Title: | rJava Interface to Kuromoji |
---|---|
Description: | An rJava wrapper for atilika/kuromoji (v0.7.7). This package will work fine, but it is too slow to be used in production. |
Authors: | Akiru Kato [aut, cre] |
Maintainer: | Akiru Kato <[email protected]> |
License: | Apache License (>= 2) |
Version: | 0.2.0 |
Built: | 2024-11-02 05:43:35 UTC |
Source: | https://github.com/paithiov909/tangela |
Call kuromoji tokenizer
kuromoji(chr)
kuromoji(chr)
chr |
Character vector to be tokenized. |
A tibble.
Makes an ngram tokenizer function.
ngram_tokenizer(n = 1L)
ngram_tokenizer(n = 1L)
n |
Integer. |
ngram tokenizer function
Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
tbl |
A data.frame of tokens. |
pull |
< |
n |
Integer internally passed to ngrams tokenizer function
created of |
sep |
Character scalar internally used as the concatenator of ngrams. |
.collapse |
This argument is passed to |
A tibble.
The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.
The data.frame of tokens here is a data.frame object compatible with the TIF.
A TIF valid data.frame of tokens are expected to have one unique key column (named doc_id
)
of each text and several feature columns of each tokens.
The feature columns must contain at least token
itself.
https://github.com/ropenscilabs/tif
Turns a single character column into features while separating with delimiter.
prettify( tbl, col = "feature", into = c("POS1", "POS2", "POS3", "POS4", "X5StageUse1", "X5StageUse2", "Original", "Yomi1", "Yomi2"), col_select = seq_along(into), delim = "," )
prettify( tbl, col = "feature", into = c("POS1", "POS2", "POS3", "POS4", "X5StageUse1", "X5StageUse2", "Original", "Yomi1", "Yomi2"), col_select = seq_along(into), delim = "," )
tbl |
A data.frame that has feature column to be prettified. |
col |
< |
into |
Character vector that is used as column names of features. |
col_select |
Character or integer vector that will be kept in prettified features. |
delim |
Character scalar used to separate fields within a feature. |
A data.frame.
prettify( data.frame(x = c("x,y", "y,z", "z,x")), col = "x", into = c("a", "b"), col_select = "b" )
prettify( data.frame(x = c("x,y", "y,z", "z,x")), col = "x", into = c("a", "b"), col_select = "b" )
Initialize kuromoji tokenizer
rebuild_tokenizer(user_dic = "")
rebuild_tokenizer(user_dic = "")
user_dic |
file path to a user dictionary if any. |
The stored kuromoji tokenizer instance is returned invisibly.