Title: | Japanese Text Processing Tools |
---|---|
Description: | A collection of Japanese text processing tools for filling Japanese iteration marks, Japanese character type conversions, segmentation by phrase, and text normalization which is based on rules for the 'Sudachi' morphological analyzer and the 'NEologd' (Neologism dictionary for 'MeCab'). These features are specific to Japanese and are not implemented in 'ICU' (International Components for Unicode). |
Authors: | Akiru Kato [cre, aut], Koki Takahashi [cph] (Author of japanese.js), Shuhei Iitsuka [cph] (Author of budoux), Taku Kudo [cph] (Author of TinySegmenter) |
Maintainer: | Akiru Kato <[email protected]> |
License: | Apache License (>= 2) |
Version: | 0.5.2 |
Built: | 2024-12-26 06:14:13 UTC |
Source: | https://github.com/paithiov909/audubon |
Calculates and binds the importance of bigrams and their synergistic average.
bind_lr(tbl, term = "token", lr_mode = c("n", "dn"), avg_rate = 1)
bind_lr(tbl, term = "token", lr_mode = c("n", "dn"), avg_rate = 1)
tbl |
A tidy text dataset. |
term |
< |
lr_mode |
Method for computing 'FL' and 'FR' values.
|
avg_rate |
Weight of the 'LR' value. |
The 'LR' value is the synergistic average of bigram importance that based on the words and their positions (left or right side).
A data.frame.
prettify(hiroba, col_select = "POS1") |> mute_tokens(POS1 != "\u540d\u8a5e") |> bind_lr() |> head()
prettify(hiroba, col_select = "POS1") |> mute_tokens(POS1 != "\u540d\u8a5e") |> bind_lr() |> head()
Calculates and binds the term frequency, inverse document frequency, and TF-IDF of the dataset. This function experimentally supports 4 types of term frequencies and 5 types of inverse document frequencies.
bind_tf_idf2( tbl, term = "token", document = "doc_id", n = "n", tf = c("tf", "tf2", "tf3", "itf"), idf = c("idf", "idf2", "idf3", "idf4", "df"), norm = FALSE, rmecab_compat = TRUE )
bind_tf_idf2( tbl, term = "token", document = "doc_id", n = "n", tf = c("tf", "tf2", "tf3", "itf"), idf = c("idf", "idf2", "idf3", "idf4", "df"), norm = FALSE, rmecab_compat = TRUE )
tbl |
A tidy text dataset. |
term |
< |
document |
< |
n |
< |
tf |
Method for computing term frequency. |
idf |
Method for computing inverse document frequency. |
norm |
Logical; If passed as |
rmecab_compat |
Logical; If passed as |
Types of term frequency can be switched with tf
argument:
tf
is term frequency (not raw count of terms).
tf2
is logarithmic term frequency of which base is exp(1)
.
tf3
is binary-weighted term frequency.
itf
is inverse term frequency. Use with idf="df"
.
Types of inverse document frequencies can be switched with idf
argument:
idf
is inverse document frequency of which base is 2, with smoothed.
'smoothed' here means just adding 1 to raw values after logarithmizing.
idf2
is global frequency IDF.
idf3
is probabilistic IDF of which base is 2.
idf4
is global entropy, not IDF in actual.
df
is document frequency. Use with tf="itf"
.
A data.frame.
df <- dplyr::count(hiroba, doc_id, token) bind_tf_idf2(df) |> head()
df <- dplyr::count(hiroba, doc_id, token) bind_tf_idf2(df) |> head()
Concatenates sequences of tokens in the tidy text dataset, while grouping them by an expression.
collapse_tokens(tbl, condition, .collapse = "")
collapse_tokens(tbl, condition, .collapse = "")
tbl |
A tidy text dataset. |
condition |
< |
.collapse |
String with which tokens are concatenated. |
Note that this function drops all columns except but 'token' and columns for grouping sequences. So, the returned data.frame has only 'doc_id', 'sentence_id', 'token_id', and 'token' columns.
A data.frame.
df <- prettify(head(hiroba), col_select = "POS1") collapse_tokens(df, POS1 == "\u540d\u8a5e")
df <- prettify(head(hiroba), col_select = "POS1") collapse_tokens(df, POS1 == "\u540d\u8a5e")
Returns dictionary's features. Currently supports "unidic17" (2.1.2 src schema), "unidic26" (2.1.2 bin schema), "unidic29" (schema used in 2.2.0, 2.3.0), "cc-cedict", "ko-dic" (mecab-ko-dic), "naist11", "sudachi", and "ipa".
get_dict_features( dict = c("ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", "naist11", "sudachi") )
get_dict_features( dict = c("ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", "naist11", "sudachi") )
dict |
Character scalar; one of "ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", "naist11", or "sudachi". |
A character vector.
See also 'CC-CEDICT-MeCab', and 'mecab-ko-dic'.
get_dict_features("ipa")
get_dict_features("ipa")
A tidy text data of audubon::polano
that tokenized with 'MeCab'.
hiroba
hiroba
An object of class data.frame
with 26849 rows and 5 columns.
head(hiroba)
head(hiroba)
The lexical density is the proportion of content words (lexical items) in documents. This function is a simple helper for calculating the lexical density of given datasets.
lex_density(vec, contents_words, targets = NULL, negate = c(FALSE, FALSE))
lex_density(vec, contents_words, targets = NULL, negate = c(FALSE, FALSE))
vec |
A character vector. |
contents_words |
A character vector containing values to be counted as contents words. |
targets |
A character vector with which the denominator of lexical density is filtered before computing values. |
negate |
A logical vector of which length is 2.
If passed as |
A numeric vector.
head(hiroba) |> prettify(col_select = "POS1") |> dplyr::group_by(doc_id) |> dplyr::summarise( noun_ratio = lex_density(POS1, "\u540d\u8a5e", c("\u52a9\u8a5e", "\u52a9\u52d5\u8a5e"), negate = c(FALSE, TRUE) ), mvr = lex_density( POS1, c("\u5f62\u5bb9\u8a5e", "\u526f\u8a5e", "\u9023\u4f53\u8a5e"), "\u52d5\u8a5e" ), vnr = lex_density(POS1, "\u52d5\u8a5e", "\u540d\u8a5e") )
head(hiroba) |> prettify(col_select = "POS1") |> dplyr::group_by(doc_id) |> dplyr::summarise( noun_ratio = lex_density(POS1, "\u540d\u8a5e", c("\u52a9\u8a5e", "\u52a9\u52d5\u8a5e"), negate = c(FALSE, TRUE) ), mvr = lex_density( POS1, c("\u5f62\u5bb9\u8a5e", "\u526f\u8a5e", "\u9023\u4f53\u8a5e"), "\u52d5\u8a5e" ), vnr = lex_density(POS1, "\u52d5\u8a5e", "\u540d\u8a5e") )
Replaces tokens in the tidy text dataset with a string scalar only if they are matched to an expression.
mute_tokens(tbl, condition, .as = NA_character_)
mute_tokens(tbl, condition, .as = NA_character_)
tbl |
A tidy text dataset. |
condition |
< |
.as |
String with which tokens are replaced
when they are matched to condition.
The default value is |
A data.frame.
df <- prettify(head(hiroba), col_select = "POS1") mute_tokens(df, POS1 %in% c("\u52a9\u8a5e", "\u52a9\u52d5\u8a5e"))
df <- prettify(head(hiroba), col_select = "POS1") mute_tokens(df, POS1 %in% c("\u52a9\u8a5e", "\u52a9\u52d5\u8a5e"))
Makes an ngram tokenizer function.
ngram_tokenizer(n = 1L)
ngram_tokenizer(n = 1L)
n |
Integer. |
ngram tokenizer function
Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
tbl |
A data.frame of tokens. |
pull |
< |
n |
Integer internally passed to ngrams tokenizer function
created of |
sep |
Character scalar internally used as the concatenator of ngrams. |
.collapse |
This argument is passed to |
A tibble.
The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.
The data.frame of tokens here is a data.frame object compatible with the TIF.
A TIF valid data.frame of tokens are expected to have one unique key column (named doc_id
)
of each text and several feature columns of each tokens.
The feature columns must contain at least token
itself.
https://github.com/ropenscilabs/tif
pack(strj_tokenize(polano[1:5], format = "data.frame"))
pack(strj_tokenize(polano[1:5], format = "data.frame"))
Whole text of 'Porano no Hiroba' written by Miyazawa Kenji from Aozora Bunko
polano
polano
An object of class character
of length 899.
A dataset containing the text of Miyazawa Kenji's novel "Porano no Hiroba" which was published in 1934, the year after Kenji's death. Copyright of this work has expired since more than 70 years have passed after the author's death.
The UTF-8 plain text is sourced from https://www.aozora.gr.jp/cards/000081/card1935.html and is cleaned of meta data.
https://www.aozora.gr.jp/cards/000081/files/1935_ruby_19924.zip
head(polano)
head(polano)
Turns a single character column into features while separating with delimiter.
prettify( tbl, col = "feature", into = get_dict_features("ipa"), col_select = seq_along(into), delim = "," )
prettify( tbl, col = "feature", into = get_dict_features("ipa"), col_select = seq_along(into), delim = "," )
tbl |
A data.frame that has feature column to be prettified. |
col |
< |
into |
Character vector that is used as column names of features. |
col_select |
Character or integer vector that will be kept in prettified features. |
delim |
Character scalar used to separate fields within a feature. |
A data.frame.
prettify( data.frame(x = c("x,y", "y,z", "z,x")), col = "x", into = c("a", "b"), col_select = "b" )
prettify( data.frame(x = c("x,y", "y,z", "z,x")), col = "x", into = c("a", "b"), col_select = "b" )
Read a rewrite.def file
read_rewrite_def( def_path = system.file("def/rewrite.def", package = "audubon") )
read_rewrite_def( def_path = system.file("def/rewrite.def", package = "audubon") )
def_path |
Character scalar; path to the rewriting definition file. |
A list.
str(read_rewrite_def())
str(read_rewrite_def())
Fills Japanese iteration marks (Odori-ji) with their previous characters if the element has more than 5 characters.
strj_fill_iter_mark(text)
strj_fill_iter_mark(text)
text |
Character vector. |
A character vector.
strj_fill_iter_mark(c( "\u3042\u3044\u3046\u309d\u3003\u304b\u304d", "\u91d1\u5b50\u307f\u3059\u309e", "\u306e\u305f\u308a\u3033\u3035\u304b\u306a", "\u3057\u308d\uff0f\u2033\uff3c\u3068\u3057\u305f" ))
strj_fill_iter_mark(c( "\u3042\u3044\u3046\u309d\u3003\u304b\u304d", "\u91d1\u5b50\u307f\u3059\u309e", "\u306e\u305f\u308a\u3033\u3035\u304b\u306a", "\u3057\u308d\uff0f\u2033\uff3c\u3068\u3057\u305f" ))
Converts Japanese katakana to hiragana.
It is almost similar to stringi::stri_trans_general(text, "kana-hira")
,
however, this implementation can also handle some additional symbols
such as Japanese kana ligature (aka. goryaku-gana).
strj_hiraganize(text)
strj_hiraganize(text)
text |
Character vector. |
A character vector.
strj_hiraganize( c( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), "\u677f\u57a3\u6b7b\u30b9\U0002a708" ) )
strj_hiraganize( c( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), "\u677f\u57a3\u6b7b\u30b9\U0002a708" ) )
Converts Japanese hiragana to katakana.
It is almost similar to stringi::stri_trans_general(text, "hira-kana")
,
however, this implementation can also handle some additional symbols
such as Japanese kana ligature (aka. goryaku-gana).
strj_katakanize(text)
strj_katakanize(text)
text |
Character vector. |
A character vector.
strj_katakanize( c( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), "\u672c\u65e5\u309f\u304b\u304d\u6c37\u89e3\u7981" ) )
strj_katakanize( c( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), "\u672c\u65e5\u309f\u304b\u304d\u6c37\u89e3\u7981" ) )
Converts characters into normalized style following the rule that is recommended by the Neologism dictionary for 'MeCab'.
strj_normalize(text)
strj_normalize(text)
text |
Character vector to be normalized. |
A character vector.
https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
strj_normalize( paste0( "\u2015\u2015\u5357\u30a2\u30eb\u30d7\u30b9", "\u306e\u3000\u5929\u7136\u6c34-\u3000\uff33", "\uff50\uff41\uff52\uff4b\uff49\uff4e\uff47*", "\u3000\uff2c\uff45\uff4d\uff4f\uff4e+", "\u3000\u30ec\u30e2\u30f3\u4e00\u7d5e\u308a" ) )
strj_normalize( paste0( "\u2015\u2015\u5357\u30a2\u30eb\u30d7\u30b9", "\u306e\u3000\u5929\u7136\u6c34-\u3000\uff33", "\uff50\uff41\uff52\uff4b\uff49\uff4e\uff47*", "\u3000\uff2c\uff45\uff4d\uff4f\uff4e+", "\u3000\u30ec\u30e2\u30f3\u4e00\u7d5e\u308a" ) )
Rewrites text using a 'rewrite.def' file.
strj_rewrite_as_def(text, as = read_rewrite_def())
strj_rewrite_as_def(text, as = read_rewrite_def())
text |
Character vector to be normalized. |
as |
List. |
A character vector.
strj_rewrite_as_def( paste0( "\u2015\u2015\u5357\u30a2\u30eb", "\u30d7\u30b9\u306e\u3000\u5929", "\u7136\u6c34-\u3000\uff33\uff50", "\uff41\uff52\uff4b\uff49\uff4e\uff47*", "\u3000\uff2c\uff45\uff4d\uff4f\uff4e+", "\u3000\u30ec\u30e2\u30f3\u4e00\u7d5e\u308a" ) ) strj_rewrite_as_def( "\u60e1\u3068\u5047\u9762\u306e\u30eb\u30fc\u30eb", read_rewrite_def(system.file("def/kyuji.def", package = "audubon")) )
strj_rewrite_as_def( paste0( "\u2015\u2015\u5357\u30a2\u30eb", "\u30d7\u30b9\u306e\u3000\u5929", "\u7136\u6c34-\u3000\uff33\uff50", "\uff41\uff52\uff4b\uff49\uff4e\uff47*", "\u3000\uff2c\uff45\uff4d\uff4f\uff4e+", "\u3000\u30ec\u30e2\u30f3\u4e00\u7d5e\u308a" ) ) strj_rewrite_as_def( "\u60e1\u3068\u5047\u9762\u306e\u30eb\u30fc\u30eb", read_rewrite_def(system.file("def/kyuji.def", package = "audubon")) )
Romanize Japanese Hiragana and Katakana
strj_romanize( text, config = c("wikipedia", "traditional hepburn", "modified hepburn", "kunrei", "nihon") )
strj_romanize( text, config = c("wikipedia", "traditional hepburn", "modified hepburn", "kunrei", "nihon") )
text |
Character vector. If elements are composed of except but hiragana and katakana letters, those letters are dropped from the return value. |
config |
Configuration used to romanize. Default is |
There are several ways to romanize Japanese.
Using this implementation, you can convert hiragana and katakana as 5 different styles;
the wikipedia
style, the traditional hepburn
style, the modified hepburn
style,
the kunrei
style, and the nihon
style.
Note that all of these styles return a slightly different form of
stringi::stri_trans_general(text, "Any-latn")
.
A character vector.
https://github.com/hakatashi/japanese.js#japaneseromanizetext-config
strj_romanize( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ) )
strj_romanize( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ) )
An alias of strj_tokenize(engine = "budoux")
.
strj_segment(text, format = c("list", "data.frame"), split = FALSE)
strj_segment(text, format = c("list", "data.frame"), split = FALSE)
text |
Character vector to be tokenized. |
format |
Output format. Choose |
split |
Logical. If passed as, the function splits the vector
into some sentences using |
A List or a data.frame.
strj_segment( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ) ) strj_segment( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), format = "data.frame" )
strj_segment( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ) ) strj_segment( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), format = "data.frame" )
An alias of strj_tokenize(engine = "tinyseg")
.
strj_tinyseg(text, format = c("list", "data.frame"), split = FALSE)
strj_tinyseg(text, format = c("list", "data.frame"), split = FALSE)
text |
Character vector to be tokenized. |
format |
Output format. Choose |
split |
Logical. If passed as |
A list or a data.frame.
strj_tinyseg( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ) ) strj_tinyseg( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), format = "data.frame" )
strj_tinyseg( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ) ) strj_tinyseg( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), format = "data.frame" )
Splits text into several tokens using specified tokenizer.
strj_tokenize( text, format = c("list", "data.frame"), engine = c("stringi", "budoux", "tinyseg", "mecab", "sudachipy"), rcpath = NULL, mode = c("C", "B", "A"), split = FALSE )
strj_tokenize( text, format = c("list", "data.frame"), engine = c("stringi", "budoux", "tinyseg", "mecab", "sudachipy"), rcpath = NULL, mode = c("C", "B", "A"), split = FALSE )
text |
Character vector to be tokenized. |
format |
Output format. Choose |
engine |
Tokenizer name. Choose one of 'stringi', 'budoux', 'tinyseg', 'mecab', or 'sudachipy'. Note that the specified tokenizer is installed and available when you use 'mecab' or 'sudachipy'. |
rcpath |
Path to a setting file for 'MeCab' or 'sudachipy' if any. |
mode |
Splitting mode for 'sudachipy'. |
split |
Logical. If passed as |
A list or a data.frame.
strj_tokenize( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ) ) strj_tokenize( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), format = "data.frame" )
strj_tokenize( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ) ) strj_tokenize( paste0( "\u3042\u306e\u30a4\u30fc\u30cf\u30c8", "\u30fc\u30f4\u30a9\u306e\u3059\u304d", "\u3068\u304a\u3063\u305f\u98a8" ), format = "data.frame" )
Transcribes Arabic integers to Kansuji with auxiliary numerals.
strj_transcribe_num(int)
strj_transcribe_num(int)
int |
Integers. |
As its implementation is limited, this function can only transcribe numbers up to trillions. In case you convert much bigger numbers, try to use the 'arabic2kansuji' package.
A character vector.
strj_transcribe_num(c(10L, 31415L))
strj_transcribe_num(c(10L, 31415L))