Title: | Interface to 'MeCab' |
---|---|
Description: | Parses Japanese texts with 'MeCab'. The original 'MeCab' is licensed under the BSD 3-Clause "New" or "Revised" License. See the "LICENSE.note" file for its license notice. |
Authors: | Motohiro Ishida [aut, cre], Taku Kudo [cph], Nippon Telegraph and Telephone Corporation [cph] |
Maintainer: | Motohiro Ishida <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.14 |
Built: | 2025-03-05 15:23:08 UTC |
Source: | https://github.com/paithiov909/rmecab-doc |
Checks if any mecabrc file exists.
anyRcfileExists()
anyRcfileExists()
This is a helper function that checks if any mecabrc file exists before initializing tagger.
'MeCab' expects a mecabrc file to be present; if not, it will raise an error (without any message!).
A logical.
Finds collocations from the specified text file.
Takes a node
word and a window span
as arguments.
collocate(filename, node, span = 3, dic = "", mecabrc = "", etc = "")
collocate(filename, node, span = 3, dic = "", mecabrc = "", etc = "")
filename |
An input file. |
node |
Node word. |
span |
Window span. Defaults to |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A data.frame.
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") out <- collocate(text_file, "\u6570\u5b66") out ## End(Not run)
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") out <- collocate(text_file, "\u6570\u5b66") out ## End(Not run)
Calculates T-score and MI-score according to the result of collocate()
.
collScores(kekka, node, span)
collScores(kekka, node, span)
kekka |
Result of |
node |
Node word. |
span |
Window span. |
A data frame.
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") out <- collocate(text_file, "\u6570\u5b66") collScores(out, "\u6570\u5b66", 3) ## End(Not run)
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") out <- collocate(text_file, "\u6570\u5b66") collScores(out, "\u6570\u5b66", 3) ## End(Not run)
Counts tokens (characters, terms, or N-grams) within target
.
target
can be a file, directory, or a data.frame.
docDF( target, column = 0, type = 0, pos = NULL, minFreq = 1, N = 1, Genkei = 0, weight = "", nDF = 0, co = 0, dic = "", mecabrc = "", etc = "" )
docDF( target, column = 0, type = 0, pos = NULL, minFreq = 1, N = 1, Genkei = 0, weight = "", nDF = 0, co = 0, dic = "", mecabrc = "", etc = "" )
target |
A file, directory, or a data.frame. |
column |
Column number or name which include the text to analyze. |
type |
Kind of tokens. |
pos |
Parts of speech that should be extracted.
If |
minFreq |
Minimum document frequency for filtering terms.
Terms that appear less than |
N |
Unit of tokens. If |
Genkei |
If |
weight |
Method to weight term frequencies. |
nDF |
If |
co |
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A data.frame is invisibly returned.
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docDF(text_dir, column = 0, type = 1, minFreq = 2) head(out) ## End(Not run)
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docDF(text_dir, column = 0, type = 1, minFreq = 2) head(out) ## End(Not run)
Creates a document-term matrix out of all files in a given directory. Each cell of the matrix shows the actual frequency of each word.
docMatrix( mydir, pos = "Default", minFreq = 1, weight = "no", kigo = 0, co = 0, dic = "", mecabrc = "", etc = "" )
docMatrix( mydir, pos = "Default", minFreq = 1, weight = "no", kigo = 0, co = 0, dic = "", mecabrc = "", etc = "" )
mydir |
A directory where text files are stored. |
pos |
Parts of speech that should be extracted.
If |
minFreq |
Minimum document frequency for filtering terms.
Terms that appear less than |
weight |
Method to weight term frequencies. |
kigo |
If |
co |
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
An integer matrix is invisibly returned.
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docMatrix(text_dir) head(out) ## End(Not run)
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docMatrix(text_dir) head(out) ## End(Not run)
Creates a document-term matrix out of all files in a given directory. Each cell of the matrix shows the actual frequency of each word.
docMatrix2( directory, pos = "Default", minFreq = 1, weight = "no", kigo = 0, co = 0, dic = "", mecabrc = "", etc = "" )
docMatrix2( directory, pos = "Default", minFreq = 1, weight = "no", kigo = 0, co = 0, dic = "", mecabrc = "", etc = "" )
directory |
A directory where text files are stored or a single file. |
pos |
Parts of speech that should be extracted.
If |
minFreq |
Minimum document frequency for filtering terms.
Terms that appear less than |
weight |
Method to weight term frequencies. |
kigo |
If |
co |
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
An integer matrix is invisibly returned.
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docMatrix2(text_dir) head(out) ## End(Not run)
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docMatrix2(text_dir) head(out) ## End(Not run)
Creates a document-term matrix out of a character vector. Each cell of the matrix shows the actual frequency of each word.
docMatrixDF( charVec = c("MeCab", "CaBoCha"), pos = "Default", minFreq = 1, weight = "no", co = 0, dic = "", mecabrc = "", etc = "" )
docMatrixDF( charVec = c("MeCab", "CaBoCha"), pos = "Default", minFreq = 1, weight = "no", co = 0, dic = "", mecabrc = "", etc = "" )
charVec |
A character vector. |
pos |
Parts of speech that should be extracted.
If |
minFreq |
Minimum document frequency for filtering terms.
Terms that appear less than |
weight |
Method to weight term frequencies. |
co |
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
An integer matrix is invisibly returned.
Creates a data.frame of N-gram out of all files in a given directory.
docNgram( mydir, type = 1, N = 2, pos = "Default", dic = "", mecabrc = "", etc = "" )
docNgram( mydir, type = 1, N = 2, pos = "Default", dic = "", mecabrc = "", etc = "" )
mydir |
A directory where text files are stored. |
type |
Kind of tokens. |
N |
Unit of tokens. If |
pos |
Parts of speech that should be extracted.
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A data.frame is invisibly returned.
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docNgram(text_dir, type = 1) head(out) ## End(Not run)
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docNgram(text_dir, type = 1) head(out) ## End(Not run)
Creates a data frame of N-grams out of all files in a given directory.
docNgram2( directory, type = 0, pos = "Default", minFreq = 1, N = 2, kigo = 0, weight = "no", dic = "", mecabrc = "", etc = "" )
docNgram2( directory, type = 0, pos = "Default", minFreq = 1, N = 2, kigo = 0, weight = "no", dic = "", mecabrc = "", etc = "" )
directory |
directory in which text files are stored or a single file. |
type |
Kind of tokens. |
pos |
Parts of speech that should be extracted.
If |
minFreq |
Minimum document frequency for filtering terms.
Terms that appear less than |
N |
Unit of tokens. If |
kigo |
If |
weight |
Method to weight term frequencies. |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A data.frame is invisibly returned.
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docNgram2(text_dir, type = 1) head(out) ## End(Not run)
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- docNgram2(text_dir, type = 1) head(out) ## End(Not run)
Creates a data.frame of N-grams out of a character vector.
docNgramDF( mojiVec = "MeCab", type = 0, pos = "Default", baseform = 0, minFreq = 1, N = 1, kigo = 0, weight = "no", co = 0, dic = "", mecabrc = "", etc = "" )
docNgramDF( mojiVec = "MeCab", type = 0, pos = "Default", baseform = 0, minFreq = 1, N = 1, kigo = 0, weight = "no", co = 0, dic = "", mecabrc = "", etc = "" )
mojiVec |
A character vector. |
type |
Kind of tokens. |
pos |
Parts of speech that should be extracted.
If |
baseform |
Genkei. See |
minFreq |
Minimum document frequency for filtering terms.
Terms that appear less than |
N |
Unit of tokens. If |
kigo |
If |
weight |
Method to weight term frequencies. |
co |
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A data frame is invisibly returned.
Returns a data.frame of N-gram.
Ngram( filename, type = 0, N = 2, pos = "Default", dic = "", mecabrc = "", etc = "" )
Ngram( filename, type = 0, N = 2, pos = "Default", dic = "", mecabrc = "", etc = "" )
filename |
An input file. |
type |
Kind of tokens. |
N |
Unit of tokens. If |
pos |
Parts of speech that should be extracted.
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A data.frame.
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") out <- Ngram(text_file, type = 1) head(out) ## End(Not run)
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") out <- Ngram(text_file, type = 1) head(out) ## End(Not run)
Returns a data frame of N-gram.
NgramDF( filename, type = 0, N = 2, pos = "Default", dic = "", mecabrc = "", etc = "" )
NgramDF( filename, type = 0, N = 2, pos = "Default", dic = "", mecabrc = "", etc = "" )
filename |
An input file. |
type |
Kind of tokens. |
N |
Unit of tokens. If |
pos |
Parts of speech that should be extracted.
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A data.frame.
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") out <- NgramDF(text_file, type = 1) head(out) ## End(Not run)
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") out <- NgramDF(text_file, type = 1) head(out) ## End(Not run)
Creates a data.frame of N-grams out of all files in a given directory.
NgramDF2( directory, type = 0, pos = "Default", minFreq = 1, N = 2, kigo = 0, dic = "", mecabrc = "", etc = "" )
NgramDF2( directory, type = 0, pos = "Default", minFreq = 1, N = 2, kigo = 0, dic = "", mecabrc = "", etc = "" )
directory |
A directory in which text files are stored or a single file. |
type |
Kind of tokens. |
pos |
Parts of speech that should be extracted.
If |
minFreq |
Minimum document frequency for filtering terms.
Terms that appear less than |
N |
Unit of tokens. If |
kigo |
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A data.frame is invisibly returned.
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- NgramDF2(text_dir, type = 1) head(out) ## End(Not run)
## Not run: text_dir <- system.file("samples", package = "RMeCab") out <- NgramDF2(text_dir, type = 1) head(out) ## End(Not run)
Takes a string as an argument and tokenize it into a length-1 lists of term.
RMeCabC(str, mypref = 0, dic = "", mecabrc = "", etc = "")
RMeCabC(str, mypref = 0, dic = "", mecabrc = "", etc = "")
str |
A string scalar to be tokenized. |
mypref |
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A list.
## Not run: text <- scan( system.file("samples/doc1.txt", package = "RMeCab"), what = character() ) unlist(RMeCabC(text)) ## End(Not run)
## Not run: text <- scan( system.file("samples/doc1.txt", package = "RMeCab"), what = character() ) unlist(RMeCabC(text)) ## End(Not run)
Takes a data frame as an argument and tokenize it into a length-1 lists of term.
RMeCabDF(dataf, coln, mypref = 0, dic = "", mecabrc = "", etc = "")
RMeCabDF(dataf, coln, mypref = 0, dic = "", mecabrc = "", etc = "")
dataf |
A data.frame. |
coln |
Column number or name which include the text to analyze. |
mypref |
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
This is a wrapper of RMeCabC()
.
Any blanks should be replaced with NA
for coln
.
A list.
Takes a file as an argument and tokenize it into a list of term.
RMeCabDoc(filename, mypref = 1, kigo = 0, dic = "", mecabrc = "", etc = "")
RMeCabDoc(filename, mypref = 1, kigo = 0, dic = "", mecabrc = "", etc = "")
filename |
An input file. |
mypref |
If |
kigo |
If |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A list.
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") unlist(RMeCabDoc(text_file)) ## End(Not run)
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") unlist(RMeCabDoc(text_file)) ## End(Not run)
Takes text files as first argument and returns parts of speech and frequencies as a data.frame.
RMeCabFreq(filename, dic = "", mecabrc = "", etc = "")
RMeCabFreq(filename, dic = "", mecabrc = "", etc = "")
filename |
an input file. |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A data.frame.
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") RMeCabFreq(text_file) ## End(Not run)
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") RMeCabFreq(text_file) ## End(Not run)
Takes a file as an argument and tokenize it into a list of terms and parts of speech.
RMeCabText(filename, dic = "", mecabrc = "", etc = "")
RMeCabText(filename, dic = "", mecabrc = "", etc = "")
filename |
An input file |
dic |
Path to a user dictionary file such as |
mecabrc |
Path to a mecabrc file. |
etc |
Other options for 'MeCab' tagger. |
A list.
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") RMeCabText(text_file) ## End(Not run)
## Not run: text_file <- system.file("samples/doc1.txt", package = "RMeCab") RMeCabText(text_file) ## End(Not run)