Package 'vibrrt'

Title: An R Wrapper for 'vibrato'
Description: An R wrapper for 'vibrato' <https://github.com/daac-tools/vibrato>, a Rust reimplementation of 'MeCab' for fast tokenization.
Authors: Akiru Kato [aut, cre]
Maintainer: Akiru Kato <[email protected]>
License: Apache License (>= 2)
Version: 0.5.1.3
Built: 2025-03-01 06:26:24 UTC
Source: https://github.com/paithiov909/vibrrt

Help Index


Create a list of tokens

Description

Create a list of tokens

Usage

as_tokens(
  tbl,
  token_field = "token",
  pos_field = get_dict_features()[1],
  nm = NULL
)

Arguments

tbl

A tibble of tokens out of tokenize().

token_field

<data-masked> Column containing tokens.

pos_field

Column containing features that will be kept as the names of tokens. If you don't need them, give a NULL for this argument.

nm

Names of returned list. If left with NULL, "doc_id" field of tbl is used instead.

Value

A named list of tokens.


Create a tagger function

Description

Create a tagger function

Usage

create_tagger(sys_dic, user_dic = "", max_grouping_len = 0L, verbose = FALSE)

Arguments

sys_dic

Character scalar; path to the system dictionary for 'vibrato'.

user_dic

Character scalar; path to the user dictionary for 'vibrato'.

max_grouping_len

Integer scalar; The maximum grouping length for unknown words. The default value is 0L, indicating the infinity length.

verbose

Logical. If TRUE, returns additional information for debugging.

Value

A function inheriting class purrr_function_partial.


Get dictionary features

Description

Returns names of dictionary features. Currently supports "unidic17" (2.1.2 src schema), "unidic26" (2.1.2 bin schema), "unidic29" (schema used in 2.2.0, 2.3.0), "cc-cedict", "ko-dic" (mecab-ko-dic), "naist11", and "ipa".

Usage

get_dict_features(
  dict = c("ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", "naist11")
)

Arguments

dict

Character scalar; one of "ipa", "unidic17", "unidic26", "unidic29", "cc-cedict", "ko-dic", or, "naist11".

Value

A character vector.

See Also

See also 'CC-CEDICT-MeCab' and 'mecab-ko-dic'.

Examples

get_dict_features("ipa")

Check if scalars are blank

Description

Check if scalars are blank

Usage

is_blank(x, trim = TRUE, ...)

Arguments

x

Object to check its emptiness.

trim

Logical.

...

Additional arguments for base::sapply().

Value

Logicals.

Examples

is_blank(list(c(a = "", b = NA_character_), NULL))

Ngrams tokenizer

Description

Makes an ngram tokenizer function.

Usage

ngram_tokenizer(n = 1L)

Arguments

n

Integer.

Value

ngram tokenizer function

Examples

bigram <- ngram_tokenizer(2)
bigram(letters, sep = "-")

Pack a data.frame of tokens

Description

Packs a data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.

Usage

pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")

Arguments

tbl

A data.frame of tokens.

pull

<data-masked> Column to be packed into text or ngrams body. Default value is token.

n

Integer internally passed to ngrams tokenizer function created of ngram_tokenizer()

sep

Character scalar internally used as the concatenator of ngrams.

.collapse

This argument is passed to stringi::stri_c().

Value

A tibble.

Text Interchange Formats (TIF)

The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.

Valid data.frame of tokens

The data.frame of tokens here is a data.frame object compatible with the TIF.

A TIF valid data.frame of tokens is expected to have one unique key column (named doc_id) of each text and several feature columns of each tokens. The feature columns must contain at least token itself.

See Also

https://github.com/ropenscilabs/tif


Prettify tokenized output

Description

Turns a single character column into features while separating with delimiter.

Usage

prettify(
  tbl,
  col = "feature",
  into = get_dict_features("ipa"),
  col_select = seq_along(into),
  delim = ","
)

Arguments

tbl

A data.frame that has feature column to be prettified.

col

<data-masked> Column containing features to be prettified.

into

Character vector that is used as column names of features.

col_select

Character or integer vector that will be kept in prettified features.

delim

Character scalar used to separate fields within a feature.

Value

A data.frame.


Tokenize sentences using a tagger

Description

Tokenize sentences using a tagger

Usage

tokenize(
  x,
  text_field = "text",
  docid_field = "doc_id",
  split = FALSE,
  mode = c("parse", "wakati"),
  tagger
)

Arguments

x

A data.frame like object or a character vector to be tokenized.

text_field

<data-masked> String or symbol; column containing texts to be tokenized.

docid_field

<data-masked> String or symbol; column containing document IDs.

split

split Logical. When passed as TRUE, the function internally splits the sentences into sub-sentences

mode

Character scalar to switch output format.

tagger

A tagger function created by create_tagger().

Value

A tibble or a named list of tokens.