Title: | Tiny Interface to CaboCha for R |
---|---|
Description: | A tiny interface to 'CaboCha'; a Japanese dependency structure parser. The main goal of this package is to implement a parser for that XML output. |
Authors: | Akiru Kato [aut, cre], Marcin Kalicinski [aut] (Author of rapidxml) |
Maintainer: | Akiru Kato <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.9 |
Built: | 2024-11-17 04:23:37 UTC |
Source: | https://github.com/paithiov909/pipian |
Make an ngram tokenizer function.
ngram_tokenizer(n = 1L)
ngram_tokenizer(n = 1L)
n |
Integer. |
ngram tokenizer function
Packs a prettified data.frame of tokens into a new data.frame of corpus, which is compatible with the Text Interchange Formats.
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
pack(tbl, pull = "token", n = 1L, sep = "-", .collapse = " ")
tbl |
A prettified data.frame of tokens. |
pull |
Column to be packed into text or ngrams body. Default value is 'token'. |
n |
Integer internally passed to ngrams tokenizer function
created of |
sep |
Character scalar internally used as the concatenator of ngrams. |
.collapse |
This argument is passed to |
A data.frame.
The Text Interchange Formats (TIF) is a set of standards that allows R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices.
The prettified data.frame of tokens here is a data.frame object compatible with the TIF.
A TIF valid data.frame of tokens are expected to have one unique key column (named 'doc_id') of each text and several feature columns of each tokens. The feature columns must contain at least 'token' itself.
https://github.com/ropensci/tif
Execute 'cabocha -f3 -n1' command using system2
,
then return the paths to the temporary XML files.
ppn_cabocha(text, rcpath = NULL)
ppn_cabocha(text, rcpath = NULL)
text |
A character vector to be parsed with CaboCha. |
rcpath |
String; path to the 'mecabrc' file if any. |
Paths to the CaboCha XML output are returned.
## Not run: ppn_cabocha(enc2utf8("\u96e8\u306b\u3082\u8ca0\u3051\u305a")) ## End(Not run)
## Not run: ppn_cabocha(enc2utf8("\u96e8\u306b\u3082\u8ca0\u3051\u305a")) ## End(Not run)
Cast dependency structure as an igraph
ppn_make_graph(df)
ppn_make_graph(df)
df |
Output of |
An 'igraph' object is returned.
xml <- ppn_parse_xml(system.file("sample.xml", package = "pipian")) ppn_make_graph(xml)
xml <- ppn_parse_xml(system.file("sample.xml", package = "pipian")) ppn_make_graph(xml)
Parse XML output of CaboCha
ppn_parse_xml( path, into = c("POS1", "POS2", "POS3", "POS4", "X5StageUse1", "X5StageUse2", "Original", "Yomi1", "Yomi2"), col_select = seq_along(into) )
ppn_parse_xml( path, into = c("POS1", "POS2", "POS3", "POS4", "X5StageUse1", "X5StageUse2", "Original", "Yomi1", "Yomi2"), col_select = seq_along(into) )
path |
String; output from |
into |
Character vector; feature names of output. |
col_select |
Character or integer vector; features that will be kept in the result. |
A data.frame.
head(ppn_parse_xml(system.file("sample.xml", package = "pipian")))
head(ppn_parse_xml(system.file("sample.xml", package = "pipian")))