Title: | An R Wrapper for 'vibrato' |
---|---|
Description: | An R wrapper for 'vibrato' <https://github.com/daac-tools/vibrato>, a Rust reimplementation of 'MeCab' for fast tokenization. |
Authors: | Akiru Kato [aut, cre] |
Maintainer: | Akiru Kato <[email protected]> |
License: | Apache License (>= 2) |
Version: | 0.5.1.3 |
Built: | 2025-03-01 06:26:24 UTC |
Source: | https://github.com/paithiov909/vibrrt |
Create a list of tokens
as_tokens( tbl, token_field = "token", pos_field = get_dict_features()[1], nm = NULL )
as_tokens( tbl, token_field = "token", pos_field = get_dict_features()[1], nm = NULL )
tbl |
A tibble of tokens out of |
token_field |
< |
pos_field |
Column containing features
that will be kept as the names of tokens.
If you don't need them, give a |
nm |
Names of returned list.
If left with |
A named list of tokens.
Create a tagger function
create_tagger(sys_dic, user_dic = "", max_grouping_len = 0L, verbose = FALSE)
create_tagger(sys_dic, user_dic = "", max_grouping_len = 0L, verbose = FALSE)
sys_dic |
Character scalar; path to the system dictionary for 'vibrato'. |
user_dic |
Character scalar; path to the user dictionary for 'vibrato'. |
max_grouping_len |
Integer scalar;
The maximum grouping length for unknown words.
The default value is |
verbose |
Logical.
If |
A function inheriting class purrr_function_partial
.
Returns names of dictionary features. Currently supports "unidic17" (2.1.2 src schema), "unidic26" (2.1.2 bin schema), "unidic29" (schema used in 2.2.0, 2.3.0), "cc-cedict", "ko-dic" (mecab-ko-dic), "naist11", and "ipa".
get_dict_features( dict = c("ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", "naist11") )
get_dict_features( dict = c("ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", "naist11") )
dict |
Character scalar; one of "ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", or, "naist11". |
A character vector.
See also 'CC-CEDICT-MeCab' and 'mecab-ko-dic'.
get_dict_features("ipa")
get_dict_features("ipa")
Check if scalars are blank
is_blank(x, trim = TRUE, ...)
is_blank(x, trim = TRUE, ...)
x |
Object to check its emptiness. |
trim |
Logical. |
... |
Additional arguments for |
Logicals.
is_blank(list(c(a = "", b = NA_character_), NULL))
is_blank(list(c(a = "", b = NA_character_), NULL))
Makes an ngram tokenizer function.
ngram_tokenizer(n = 1L)
ngram_tokenizer(n = 1L)
n |
Integer. |
ngram tokenizer function
bigram <- ngram_tokenizer(2) bigram(letters, sep = "-")
bigram <- ngram_tokenizer(2) bigram(letters, sep = "-")
Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
tbl |
A data.frame of tokens. |
pull |
< |
n |
Integer internally passed to ngrams tokenizer function
created of |
sep |
Character scalar internally used as the concatenator of ngrams. |
.collapse |
This argument is passed to |
A tibble.
The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.
The data.frame of tokens here is a data.frame object compatible with the TIF.
A TIF valid data.frame of tokens is expected to have
one unique key column (named doc_id
)
of each text and several feature columns of each tokens.
The feature columns must contain at least token
itself.
https://github.com/ropenscilabs/tif
Turns a single character column into features while separating with delimiter.
prettify( tbl, col = "feature", into = get_dict_features("ipa"), col_select = seq_along(into), delim = "," )
prettify( tbl, col = "feature", into = get_dict_features("ipa"), col_select = seq_along(into), delim = "," )
tbl |
A data.frame that has feature column to be prettified. |
col |
< |
into |
Character vector that is used as column names of features. |
col_select |
Character or integer vector that will be kept in prettified features. |
delim |
Character scalar used to separate fields within a feature. |
A data.frame.
Tokenize sentences using a tagger
tokenize( x, text_field = "text", docid_field = "doc_id", split = FALSE, mode = c("parse", "wakati"), tagger )
tokenize( x, text_field = "text", docid_field = "doc_id", split = FALSE, mode = c("parse", "wakati"), tagger )
x |
A data.frame like object or a character vector to be tokenized. |
text_field |
< |
docid_field |
< |
split |
split Logical. When passed as |
mode |
Character scalar to switch output format. |
tagger |
A tagger function created by |
A tibble or a named list of tokens.