Package 'sudachir2' reference manual

Title:	R Wrapper for 'sudachi.rs'
Description:	Offers bindings to 'sudachi.rs' <https://github.com/WorksApplications/sudachi.rs>, a Rust implementation of 'Sudachi' Japanese morphological analyzer.
Authors:	Akiru Kato [aut, cre]
Maintainer:	Akiru Kato <[email protected]>
License:	Apache License (>= 2)
Version:	0.6.10.4
Built:	2025-03-01 06:26:17 UTC
Source:	https://github.com/paithiov909/sudachir2

Create a tagger function

Description

Create a tagger function

Usage

create_tagger(
  dictionary_path,
  config_file = system.file("resources/sudachi.json", package = "sudachir2"),
  resource_dir = system.file("resources", package = "sudachir2"),
  mode = c("C", "A", "B")
)
create_tagger(
  dictionary_path,
  config_file = system.file("resources/sudachi.json", package = "sudachir2"),
  resource_dir = system.file("resources", package = "sudachir2"),
  mode = c("C", "A", "B")
)

Arguments

`dictionary_path`	A path to a dictionary file such as `"system_core.dic"`.
`config_file`	A path to a config file.
`resource_dir`	A path to a resource directory.
`mode`	Split mode for 'sudachi.rs'. Either `"C"`, `"A"`, or `"B"`.

Details

This function just returns a wrapper function for tokenization, i.e. does not actually create a tagger instance. Even if arguments are invalid, this function does not raise any errors.

Value

A function inheriting class purrr_function_partial.

Download and unarchive a dictionary for 'Sudachi'

Description

Download and unarchive a dictionary for 'Sudachi'

Usage

fetch_dict(
  exdir,
  dict_version = "latest",
  dict_type = c("small", "core", "full")
)
fetch_dict(
  exdir,
  dict_version = "latest",
  dict_type = c("small", "core", "full")
)

Arguments

`exdir`	Directory where the dictionary will be unarchived.
`dict_version`	Version of the dictionary to be downloaded.
`dict_type`	Type of the dictionary to be downloaded. Either `"small"`, `"core"`, or `"full"`.

Value

exdir is invisibly returned.

Check if scalars are blank

Description

Check if scalars are blank

Usage

is_blank(x, trim = TRUE, ...)
is_blank(x, trim = TRUE, ...)

Arguments

`x`	Object to check its emptiness.
`trim`	Logical.
`...`	Additional arguments for `base::sapply()`.

Value

Logicals.

Examples

is_blank(list(c(a = "", b = NA_character_), NULL))
is_blank(list(c(a = "", b = NA_character_), NULL))

Ngrams tokenizer

Description

Makes an ngram tokenizer function.

Usage

ngram_tokenizer(n = 1L)
ngram_tokenizer(n = 1L)

Arguments

n

Integer.

Value

ngram tokenizer function

Examples

bigram <- ngram_tokenizer(2)
bigram(letters, sep = "-")
bigram <- ngram_tokenizer(2)
bigram(letters, sep = "-")

Pack a data.frame of tokens

Description

Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.

Usage

pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")

Arguments

`tbl`	A data.frame of tokens.
`pull`	<`data-masked`> Column to be packed into text or ngrams body. Default value is `token`.
`n`	Integer internally passed to ngrams tokenizer function created of `ngram_tokenizer()`
`sep`	Character scalar internally used as the concatenator of ngrams.
`.collapse`	This argument is passed to `stringi::stri_c()`.

Value

A tibble.

Text Interchange Formats (TIF)

The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.

Valid data.frame of tokens

The data.frame of tokens here is a data.frame object compatible with the TIF.

A TIF valid data.frame of tokens is expected to have one unique key column (named doc_id) of each text and several feature columns of each tokens. The feature columns must contain at least token itself.

Prettify tokenized output

Description

Turns a single character column into features while separating with delimiter.

Usage

prettify(
  tbl,
  col = "feature",
  into = c("POS1", "POS2", "POS3", "POS4", "cType", "cForm"),
  col_select = seq_along(into),
  delim = ","
)
prettify(
  tbl,
  col = "feature",
  into = c("POS1", "POS2", "POS3", "POS4", "cType", "cForm"),
  col_select = seq_along(into),
  delim = ","
)

Arguments

`tbl`	A data.frame that has feature column to be prettified.
`col`	<`data-masked`> Column containing features to be prettified.
`into`	Character vector that is used as column names of features.
`col_select`	Character or integer vector that will be kept in prettified features.
`delim`	Character scalar used to separate fields within a feature.

Value

A data.frame.

Examples

prettify(
  data.frame(x = c("x,y", "y,z", "z,x")),
  col = "x",
  into = c("a", "b"),
  col_select = "b"
)
prettify(
  data.frame(x = c("x,y", "y,z", "z,x")),
  col = "x",
  into = c("a", "b"),
  col_select = "b"
)

Tokenize sentences using a tagger function

Description

Tokenize sentences using a tagger function

Usage

tokenize(x, text_field = "text", docid_field = "doc_id", tagger)
tokenize(x, text_field = "text", docid_field = "doc_id", tagger)

Arguments

`x`	A data.frame like object or a character vector to be tokenized.
`text_field`	<`data-masked`> String or symbol; column containing texts to be tokenized.
`docid_field`	<`data-masked`> String or symbol; column containing document IDs.
`tagger`	A tagger function out of `create_tagger()`

Value

A tibble.

`tbl`	A tibble of tokens out of `tokenize()`.
`token_field`	<`data-masked`> Column containing tokens.
`pos_field`	Column containing features that will be kept as the names of tokens. If you don't need them, give a `NULL` for this argument.
`nm`	Names of returned list. If left with `NULL`, "doc_id" field of `tbl` is used instead.

Package 'sudachir2'

Help Index

Create a list of tokens

Description

Usage

Arguments

Value

Create a tagger function

Description

Usage

Arguments

Details

Value

Download and unarchive a dictionary for 'Sudachi'

Description

Usage

Arguments

Value

Check if scalars are blank

Description

Usage

Arguments

Value

Examples

Ngrams tokenizer

Description

Usage

Arguments

Value

Examples

Pack a data.frame of tokens

Description

Usage

Arguments

Value

Text Interchange Formats (TIF)

Valid data.frame of tokens

See Also

Prettify tokenized output

Description

Usage

Arguments

Value

Examples

Tokenize sentences using a tagger function

Description

Usage

Arguments

Value