Title: | 'Rcpp' Wrapper for 'MeCab' Library |
---|---|
Description: | R package based on 'Rcpp' for 'MeCab': Yet Another Part-of-Speech and Morphological Analyzer. The purpose of this package is providing a seamless developing and analyzing environment for CJK texts. This package utilizes parallel programming for providing highly efficient text preprocessing 'posParallel()' function. |
Authors: | Junhewk Kim [aut, cre], Akiru Kato [aut], Kohei Watanabe [ctb] |
Maintainer: | Akiru Kato <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.0 |
Built: | 2024-11-05 05:13:34 UTC |
Source: | https://github.com/paithiov909/RcppMeCab |
pos
returns part-of-speech (POS) tagged morphemes of the sentence.
pos( sentence, join = TRUE, format = c("list", "data.frame"), sys_dic = "", user_dic = "" )
pos( sentence, join = TRUE, format = c("list", "data.frame"), sys_dic = "", user_dic = "" )
sentence |
A character vector of any length. For analyzing multiple sentences, put them in one character vector. |
join |
A logical to decide the output format. The default value is TRUE. If FALSE, the function will return morphemes only, and tags put in the attribute. if 'format="data.frame"', then this will be ignored. |
format |
A data type for the result. The default value is "list". You can set this to "data.frame" to get a result as data frame format. |
sys_dic |
A location of system MeCab dictionary. The default value is "". |
user_dic |
A location of user-specific MeCab dictionary. The default value is "". |
This is a basic function for MeCab part-of-speech tagger. The function gets a character vector of any length and runs a loop inside C++ to provide faster processing.
You can add a user dictionary to 'user_dic'. It should be compiled by 'mecab-dict-index'. You can find an explanation about compiling a user dictionary in the https://github.com/junhewk/RcppMeCab.
You can also set a system dictionary especially if you are using multiple
dictionaries (for example, using both IPA and Juman dictionary at the same time in Japanese)
in 'sys_dic'. Using options(mecabSysDic="#the path to your system dictionary")
, you can set your
preferred system dictionary to the R terminal.
If you want to get a morpheme only, use 'join = FALSE' to put tag names on the attribute. Basically, the function will return a list of character vectors with (morpheme)/(tag) elements.
A string vector of POS tagged morpheme will be returned in conjoined character vector form. Element names of the list are original phrases
## Not run: sentence <- c("some UTF-8 texts") pos(sentence) pos(sentence, join = FALSE) pos(sentence, format = "data.frame") pos(sentence, user_dic = "~/user_dic.dic") # System dictionary example: in case of using mecab-ipadic-NEologd pos(sentence, sys_dic = "/usr/local/lib/mecab/dic/mecab-ipadic-neologd/") ## End(Not run)
## Not run: sentence <- c("some UTF-8 texts") pos(sentence) pos(sentence, join = FALSE) pos(sentence, format = "data.frame") pos(sentence, user_dic = "~/user_dic.dic") # System dictionary example: in case of using mecab-ipadic-NEologd pos(sentence, sys_dic = "/usr/local/lib/mecab/dic/mecab-ipadic-neologd/") ## End(Not run)
posParallel
returns part-of-speech (POS) tagged morphemes of the sentence.
pos_parallel( sentence, join = TRUE, format = c("list", "data.frame"), sys_dic = "", user_dic = "" )
pos_parallel( sentence, join = TRUE, format = c("list", "data.frame"), sys_dic = "", user_dic = "" )
sentence |
A character vector of any length. For analyzing multiple sentences, put them in one character vector. |
join |
A logical to decide the output format. The default value is TRUE. If FALSE, the function will return morphemes only, and tags put in the attribute. if 'format="data.frame"', then this will be ignored. |
format |
A data type for the result. The default value is "list". You can set this to "data.frame" to get a result as data frame format. |
sys_dic |
A location of system MeCab dictionary. The default value is "". |
user_dic |
A location of user-specific MeCab dictionary. The default value is "". |
This is a parallelized version of MeCab part-of-speech tagger. The function gets a character vector of any length and runs a loop inside C++ with Intel TBB to provide faster processing.
Parallelizing over a character vector is not supported by RcppParallel
.
Thus, this function makes duplicates of the input and the output.
Therefore, if your data volume is large, use pos
or divide the vector to
several sub-vectors.
You can add a user dictionary to 'user_dic'. It should be compiled by 'mecab-dict-index'. You can find an explanation about compiling a user dictionary in the https://github.com/junhewk/RcppMeCab.
You can also set a system dictionary especially if you are using multiple
dictionaries (for example, using both IPA and Juman dictionary at the same time in Japanese)
in 'sys_dic'. Using options(mecabSysDic="#the path to your system dictionary")
, you can set your
preferred system dictionary to the R terminal.
If you want to get a morpheme only, use 'join = FALSE' to put tag names on the attribute. Basically, the function will return a list of character vectors with (morpheme)/(tag) elements.
A string vector of POS tagged morpheme will be returned in conjoined character vector form. Element names of the list are original phrases
## Not run: sentence <- c("some UTF-8 texts") posParallel(sentence) posParallel(sentence, join = FALSE) posParallel(sentence, format = "data.frame") posParallel(sentence, user_dic = "~/user_dic.dic") # System dictionary example: in case of using mecab-ipadic-NEologd pos(sentence, sys_dic = "/usr/local/lib/mecab/dic/mecab-ipadic-neologd/") ## End(Not run)
## Not run: sentence <- c("some UTF-8 texts") posParallel(sentence) posParallel(sentence, join = FALSE) posParallel(sentence, format = "data.frame") posParallel(sentence, user_dic = "~/user_dic.dic") # System dictionary example: in case of using mecab-ipadic-NEologd pos(sentence, sys_dic = "/usr/local/lib/mecab/dic/mecab-ipadic-neologd/") ## End(Not run)
posParallel
returns part-of-speech (POS) tagged morphemes of the sentence.
posParallel( sentence, join = TRUE, format = c("list", "data.frame"), sys_dic = "", user_dic = "" )
posParallel( sentence, join = TRUE, format = c("list", "data.frame"), sys_dic = "", user_dic = "" )
sentence |
A character vector of any length. For analyzing multiple sentences, put them in one character vector. |
join |
A logical to decide the output format. The default value is TRUE. If FALSE, the function will return morphemes only, and tags put in the attribute. if 'format="data.frame"', then this will be ignored. |
format |
A data type for the result. The default value is "list". You can set this to "data.frame" to get a result as data frame format. |
sys_dic |
A location of system MeCab dictionary. The default value is "". |
user_dic |
A location of user-specific MeCab dictionary. The default value is "". |
This is a parallelized version of MeCab part-of-speech tagger. The function gets a character vector of any length and runs a loop inside C++ with Intel TBB to provide faster processing.
Parallelizing over a character vector is not supported by RcppParallel
.
Thus, this function makes duplicates of the input and the output.
Therefore, if your data volume is large, use pos
or divide the vector to
several sub-vectors.
You can add a user dictionary to 'user_dic'. It should be compiled by 'mecab-dict-index'. You can find an explanation about compiling a user dictionary in the https://github.com/junhewk/RcppMeCab.
You can also set a system dictionary especially if you are using multiple
dictionaries (for example, using both IPA and Juman dictionary at the same time in Japanese)
in 'sys_dic'. Using options(mecabSysDic="#the path to your system dictionary")
, you can set your
preferred system dictionary to the R terminal.
If you want to get a morpheme only, use 'join = FALSE' to put tag names on the attribute. Basically, the function will return a list of character vectors with (morpheme)/(tag) elements.
A string vector of POS tagged morpheme will be returned in conjoined character vector form. Element names of the list are original phrases
## Not run: sentence <- c("some UTF-8 texts") posParallel(sentence) posParallel(sentence, join = FALSE) posParallel(sentence, format = "data.frame") posParallel(sentence, user_dic = "~/user_dic.dic") # System dictionary example: in case of using mecab-ipadic-NEologd pos(sentence, sys_dic = "/usr/local/lib/mecab/dic/mecab-ipadic-neologd/") ## End(Not run)
## Not run: sentence <- c("some UTF-8 texts") posParallel(sentence) posParallel(sentence, join = FALSE) posParallel(sentence, format = "data.frame") posParallel(sentence, user_dic = "~/user_dic.dic") # System dictionary example: in case of using mecab-ipadic-NEologd pos(sentence, sys_dic = "/usr/local/lib/mecab/dic/mecab-ipadic-neologd/") ## End(Not run)
R package based on 'Rcpp' for 'MeCab': Yet Another Part-of-Speech and
Morphological Analyzer (http://taku910.github.io/mecab/). The purpose of
this package is providing a seamless developing and analyzing environment for
CJK texts. This package utilizes parallel programming for providing
highly efficient text preprocessing posParallel()
function.
For installation, please refer to README.md file.
This package utilizes 'MeCab' C API and 'Rcpp' codes.
Junhewk Kim
Useful links:
Report bugs at https://github.com/paithiov909/RcppMeCab/issues