Package 'tangela' reference manual

Title:	rJava Interface to Kuromoji
Description:	An rJava wrapper for atilika/kuromoji (v0.7.7). This package will work fine, but it is too slow to be used in production.
Authors:	Akiru Kato [aut, cre]
Maintainer:	Akiru Kato <[email protected]>
License:	Apache License (>= 2)
Version:	0.2.0
Built:	2025-01-26 09:20:05 UTC
Source:	https://github.com/paithiov909/tangela

Call kuromoji tokenizer

Description

Call kuromoji tokenizer

Usage

kuromoji(chr)
kuromoji(chr)

Arguments

chr

Character vector to be tokenized.

Value

A tibble.

Pack a data.frame of tokens

Description

Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.

Usage

pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")

Arguments

`tbl`	A data.frame of tokens.
`pull`	<`data-masked`> Column to be packed into text or ngrams body. Default value is `token`.
`n`	Integer internally passed to ngrams tokenizer function created of `tangela::ngram_tokenizer()`
`sep`	Character scalar internally used as the concatenator of ngrams.
`.collapse`	This argument is passed to `stringi::stri_c()`.

Value

A tibble.

Text Interchange Formats (TIF)

The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.

Valid data.frame of tokens

The data.frame of tokens here is a data.frame object compatible with the TIF.

A TIF valid data.frame of tokens are expected to have one unique key column (named doc_id) of each text and several feature columns of each tokens. The feature columns must contain at least token itself.

Prettify tokenized output

Description

Turns a single character column into features while separating with delimiter.

Usage

prettify(
  tbl,
  col = "feature",
  into = c("POS1", "POS2", "POS3", "POS4", "X5StageUse1", "X5StageUse2", "Original",
    "Yomi1", "Yomi2"),
  col_select = seq_along(into),
  delim = ","
)
prettify(
  tbl,
  col = "feature",
  into = c("POS1", "POS2", "POS3", "POS4", "X5StageUse1", "X5StageUse2", "Original",
    "Yomi1", "Yomi2"),
  col_select = seq_along(into),
  delim = ","
)

Arguments

`tbl`	A data.frame that has feature column to be prettified.
`col`	<`data-masked`> Column name where to be prettified.
`into`	Character vector that is used as column names of features.
`col_select`	Character or integer vector that will be kept in prettified features.
`delim`	Character scalar used to separate fields within a feature.

Value

A data.frame.

Examples

prettify(
  data.frame(x = c("x,y", "y,z", "z,x")),
  col = "x",
  into = c("a", "b"),
  col_select = "b"
)
prettify(
  data.frame(x = c("x,y", "y,z", "z,x")),
  col = "x",
  into = c("a", "b"),
  col_select = "b"
)

Initialize kuromoji tokenizer

Description

Initialize kuromoji tokenizer

Usage

rebuild_tokenizer(user_dic = "")
rebuild_tokenizer(user_dic = "")

Arguments

user_dic

file path to a user dictionary if any.

Value

The stored kuromoji tokenizer instance is returned invisibly.

Package 'tangela'

Help Index

Call kuromoji tokenizer

Description

Usage

Arguments

Value

Ngrams tokenizer

Description

Usage

Arguments

Value

Pack a data.frame of tokens

Description

Usage

Arguments

Value

Text Interchange Formats (TIF)

Valid data.frame of tokens

See Also

Prettify tokenized output

Description

Usage

Arguments

Value

Examples

Initialize kuromoji tokenizer

Description

Usage

Arguments

Value