Title: | R Wrapper for 'sudachi.rs' |
---|---|
Description: | Offers bindings to 'sudachi.rs' <https://github.com/WorksApplications/sudachi.rs>, a Rust implementation of 'Sudachi' Japanese morphological analyzer. |
Authors: | Akiru Kato [aut, cre] |
Maintainer: | Akiru Kato <[email protected]> |
License: | Apache License (>= 2) |
Version: | 0.6.10.4 |
Built: | 2025-03-01 06:26:17 UTC |
Source: | https://github.com/paithiov909/sudachir2 |
Create a list of tokens
as_tokens(tbl, token_field = "token", pos_field = NULL, nm = NULL)
as_tokens(tbl, token_field = "token", pos_field = NULL, nm = NULL)
tbl |
A tibble of tokens out of |
token_field |
< |
pos_field |
Column containing features
that will be kept as the names of tokens.
If you don't need them, give a |
nm |
Names of returned list.
If left with |
A named list of tokens.
Create a tagger function
create_tagger( dictionary_path, config_file = system.file("resources/sudachi.json", package = "sudachir2"), resource_dir = system.file("resources", package = "sudachir2"), mode = c("C", "A", "B") )
create_tagger( dictionary_path, config_file = system.file("resources/sudachi.json", package = "sudachir2"), resource_dir = system.file("resources", package = "sudachir2"), mode = c("C", "A", "B") )
dictionary_path |
A path to a dictionary file
such as |
config_file |
A path to a config file. |
resource_dir |
A path to a resource directory. |
mode |
Split mode for 'sudachi.rs'.
Either |
This function just returns a wrapper function for tokenization, i.e. does not actually create a tagger instance. Even if arguments are invalid, this function does not raise any errors.
A function inheriting class purrr_function_partial
.
Download and unarchive a dictionary for 'Sudachi'
fetch_dict( exdir, dict_version = "latest", dict_type = c("small", "core", "full") )
fetch_dict( exdir, dict_version = "latest", dict_type = c("small", "core", "full") )
exdir |
Directory where the dictionary will be unarchived. |
dict_version |
Version of the dictionary to be downloaded. |
dict_type |
Type of the dictionary to be downloaded.
Either |
exdir
is invisibly returned.
Check if scalars are blank
is_blank(x, trim = TRUE, ...)
is_blank(x, trim = TRUE, ...)
x |
Object to check its emptiness. |
trim |
Logical. |
... |
Additional arguments for |
Logicals.
is_blank(list(c(a = "", b = NA_character_), NULL))
is_blank(list(c(a = "", b = NA_character_), NULL))
Makes an ngram tokenizer function.
ngram_tokenizer(n = 1L)
ngram_tokenizer(n = 1L)
n |
Integer. |
ngram tokenizer function
bigram <- ngram_tokenizer(2) bigram(letters, sep = "-")
bigram <- ngram_tokenizer(2) bigram(letters, sep = "-")
Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
tbl |
A data.frame of tokens. |
pull |
< |
n |
Integer internally passed to ngrams tokenizer function
created of |
sep |
Character scalar internally used as the concatenator of ngrams. |
.collapse |
This argument is passed to |
A tibble.
The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.
The data.frame of tokens here is a data.frame object compatible with the TIF.
A TIF valid data.frame of tokens is expected to have
one unique key column (named doc_id
)
of each text and several feature columns of each tokens.
The feature columns must contain at least token
itself.
https://github.com/ropenscilabs/tif
Turns a single character column into features while separating with delimiter.
prettify( tbl, col = "feature", into = c("POS1", "POS2", "POS3", "POS4", "cType", "cForm"), col_select = seq_along(into), delim = "," )
prettify( tbl, col = "feature", into = c("POS1", "POS2", "POS3", "POS4", "cType", "cForm"), col_select = seq_along(into), delim = "," )
tbl |
A data.frame that has feature column to be prettified. |
col |
< |
into |
Character vector that is used as column names of features. |
col_select |
Character or integer vector that will be kept in prettified features. |
delim |
Character scalar used to separate fields within a feature. |
A data.frame.
prettify( data.frame(x = c("x,y", "y,z", "z,x")), col = "x", into = c("a", "b"), col_select = "b" )
prettify( data.frame(x = c("x,y", "y,z", "z,x")), col = "x", into = c("a", "b"), col_select = "b" )
Tokenize sentences using a tagger function
tokenize(x, text_field = "text", docid_field = "doc_id", tagger)
tokenize(x, text_field = "text", docid_field = "doc_id", tagger)
x |
A data.frame like object or a character vector to be tokenized. |
text_field |
< |
docid_field |
< |
tagger |
A tagger function out of |
A tibble.