Title: | An R Wrapper for 'Vibrato' |
---|---|
Description: | An R wrapper for 'Vibrato', Viterbi-based accelerated tokenizer. |
Authors: | Akiru Kato [aut, cre] |
Maintainer: | Akiru Kato <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-01-19 05:20:01 UTC |
Source: | https://github.com/paithiov909/vibrrt |
Create a list of tokens
as_tokens( tbl, token_field = "token", pos_field = get_dict_features()[1], nm = NULL )
as_tokens( tbl, token_field = "token", pos_field = get_dict_features()[1], nm = NULL )
tbl |
A tibble of tokens out of |
token_field |
< |
pos_field |
Column containing features
that will be kept as the names of tokens.
If you don't need them, give a |
nm |
Names of returned list.
If left with |
A named list of tokens.
Calculates and binds the importance of bigrams and their synergistic average.
bind_lr(tbl, term = "token", lr_mode = c("n", "dn"), avg_rate = 1)
bind_lr(tbl, term = "token", lr_mode = c("n", "dn"), avg_rate = 1)
tbl |
A tidy text dataset. |
term |
< |
lr_mode |
Method for computing 'FL' and 'FR' values.
|
avg_rate |
Weight of the 'LR' value. |
The 'LR' value is the synergistic average of bigram importance that based on the words and their positions (left or right side).
A data.frame.
Calculates and binds the term frequency, inverse document frequency, and TF-IDF of the dataset. This function experimentally supports 4 types of term frequencies and 5 types of inverse document frequencies.
bind_tf_idf2( tbl, term = "token", document = "doc_id", n = "n", tf = c("tf", "tf2", "tf3", "itf"), idf = c("idf", "idf2", "idf3", "idf4", "df"), norm = FALSE, rmecab_compat = TRUE )
bind_tf_idf2( tbl, term = "token", document = "doc_id", n = "n", tf = c("tf", "tf2", "tf3", "itf"), idf = c("idf", "idf2", "idf3", "idf4", "df"), norm = FALSE, rmecab_compat = TRUE )
tbl |
A tidy text dataset. |
term |
< |
document |
< |
n |
< |
tf |
Method for computing term frequency. |
idf |
Method for computing inverse document frequency. |
norm |
Logical; If passed as |
rmecab_compat |
Logical; If passed as |
Types of term frequency can be switched with tf
argument:
tf
is term frequency (not raw count of terms).
tf2
is logarithmic term frequency of which base is exp(1)
.
tf3
is binary-weighted term frequency.
itf
is inverse term frequency. Use with idf="df"
.
Types of inverse document frequencies can be switched with idf
argument:
idf
is inverse document frequency of which base is 2, with smoothed.
'smoothed' here means just adding 1 to raw values after logarithmizing.
idf2
is global frequency IDF.
idf3
is probabilistic IDF of which base is 2.
idf4
is global entropy, not IDF in actual.
df
is document frequency. Use with tf="itf"
.
A data.frame.
Concatenates sequences of tokens in the tidy text dataset, while grouping them by an expression.
collapse_tokens(tbl, condition, .collapse = "")
collapse_tokens(tbl, condition, .collapse = "")
tbl |
A tidy text dataset. |
condition |
< |
.collapse |
String with which tokens are concatenated. |
Note that this function drops all columns except but 'token' and columns for grouping sequences. So, the returned data.frame has only 'doc_id', 'sentence_id', 'token_id', and 'token' columns.
A data.frame.
Returns names of dictionary features. Currently supports "unidic17" (2.1.2 src schema), "unidic26" (2.1.2 bin schema), "unidic29" (schema used in 2.2.0, 2.3.0), "cc-cedict", "ko-dic" (mecab-ko-dic), "naist11", and "ipa".
get_dict_features( dict = c("ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", "naist11") )
get_dict_features( dict = c("ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", "naist11") )
dict |
Character scalar; one of "ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", "naist11". |
A character vector.
See also 'CC-CEDICT-MeCab' and 'mecab-ko-dic'.
get_dict_features("ipa")
get_dict_features("ipa")
Check if scalars are blank
is_blank(x, trim = TRUE, ...)
is_blank(x, trim = TRUE, ...)
x |
Object to check its emptiness. |
trim |
Logical. |
... |
Additional arguments for |
Logicals.
is_blank(list(c(a = "", b = NA_character_), NULL))
is_blank(list(c(a = "", b = NA_character_), NULL))
The lexical density is the proportion of content words (lexical items) in documents. This function is a simple helper for calculating the lexical density of given datasets.
lex_density(vec, contents_words, targets = NULL, negate = c(FALSE, FALSE))
lex_density(vec, contents_words, targets = NULL, negate = c(FALSE, FALSE))
vec |
A character vector. |
contents_words |
A character vector containing values to be counted as contents words. |
targets |
A character vector with which the denominator of lexical density is filtered before computing values. |
negate |
A logical vector of which length is 2.
If passed as |
A numeric vector.
Replaces tokens in the tidy text dataset with a string scalar only if they are matched to an expression.
mute_tokens(tbl, condition, .as = NA_character_)
mute_tokens(tbl, condition, .as = NA_character_)
tbl |
A tidy text dataset. |
condition |
< |
.as |
String with which tokens are replaced
when they are matched to condition.
The default value is |
A data.frame.
Makes an ngram tokenizer function.
ngram_tokenizer(n = 1L)
ngram_tokenizer(n = 1L)
n |
Integer. |
ngram tokenizer function
bigram <- ngram_tokenizer(2) bigram(letters, sep = "-")
bigram <- ngram_tokenizer(2) bigram(letters, sep = "-")
Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
tbl |
A data.frame of tokens. |
pull |
< |
n |
Integer internally passed to ngrams tokenizer function
created of |
sep |
Character scalar internally used as the concatenator of ngrams. |
.collapse |
This argument is passed to |
A tibble.
The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.
The data.frame of tokens here is a data.frame object compatible with the TIF.
A TIF valid data.frame of tokens is expected to have
one unique key column (named doc_id
)
of each text and several feature columns of each tokens.
The feature columns must contain at least token
itself.
https://github.com/ropenscilabs/tif
Turns a single character column into features while separating with delimiter.
prettify( tbl, col = "feature", into = get_dict_features("ipa"), col_select = seq_along(into), delim = "," )
prettify( tbl, col = "feature", into = get_dict_features("ipa"), col_select = seq_along(into), delim = "," )
tbl |
A data.frame that has feature column to be prettified. |
col |
< |
into |
Character vector that is used as column names of features. |
col_select |
Character or integer vector that will be kept in prettified features. |
delim |
Character scalar used to separate fields within a feature. |
A data.frame.
prettify( data.frame(x = c("x,y", "y,z", "z,x")), col = "x", into = c("a", "b"), col_select = "b" )
prettify( data.frame(x = c("x,y", "y,z", "z,x")), col = "x", into = c("a", "b"), col_select = "b" )
Tokenize sentences using 'Vibrato'
tokenize( x, text_field = "text", docid_field = "doc_id", sys_dic = "", user_dic = "", split = FALSE, mode = c("parse", "wakati") )
tokenize( x, text_field = "text", docid_field = "doc_id", sys_dic = "", user_dic = "", split = FALSE, mode = c("parse", "wakati") )
x |
A data.frame like object or a character vector to be tokenized. |
text_field |
< |
docid_field |
< |
sys_dic |
Character scalar; path to the system dictionary for 'Vibrato'. |
user_dic |
Character scalar; path to the user dictionary for 'Vibrato'. |
split |
split Logical. When passed as |
mode |
Character scalar to switch output format. |
A tibble or a named list of tokens.