NEWS
gibasa 1.1.1.9002
- Refactored
tagger_impl
to improve performance of tokenize(split = TRUE)
.
gibasa 1.1.1 (2024-07-06)
tokenize
now warns rather than throws an error when an invalid input is given during partial parsing. With this change, tokenize
is no longer entirely aborted even if an invalid string is given. Parsing of those strings is simply skipped.
gibasa 1.1.0 (2024-02-17)
- Corrected probabilistic IDF calculation by
global_idf3
.
- Refactored
bind_tf_idf2
.
- Breaking Change: Changed behavior when
norm=TRUE
. Cosine nomalization is now performed on tf_idf
values as in the RMeCab package.
- Added
tf="itf"
and idf="df"
options.
gibasa 1.0.1 (2023-12-02)
- Added wrappers around dictionary compiler of MeCab.
gibasa 0.9.5 (2023-07-09)
- Removed audubon dependency for maintainability.
pack
now preserves doc_id
type when it's factor.
gibasa 0.9.4 (2023-06-03)
- Updated Makevars for Unix alikes. Users can now use a file specified by the
MECABRC
environment variable or ~/.mecabrc
to set up dictionaries.
gibasa 0.9.3 (2023-04-20)
- Removed unnecessary C++ files.
gibasa 0.9.2 (2023-04-12)
- Prepare for CRAN release.
gibasa 0.8.1
- For performance,
tokenize
now skips resetting the output encodings to UTF-8.
gibasa 0.8.0
- Breaking Change: Changed numbering style of 'sentence_id' when
split
is FALSE
.
- Added
grain_size
argument to tokenize
.
- Added new
bind_lr
function.
gibasa 0.7.4
- Use
RcppParallel::parallelFor
instead of tbb::parallel_for
. There are no user's visible changes.
gibasa 0.7.1
- Fix documentations. There are no visible changes.
gibasa 0.7.0
tokenize
can now accept a character vector in addition to a data.frame like object.
gbs_tokenize
is now deprecated. Please use the tokenize
function instead.
gibasa 0.6.4
gibasa 0.6.3
- Added the
partial
argument to gbs_tokenize
and tokenize
. This argument controls the partial parsing mode, which forces to extract given chunks of sentences when activated.
gibasa 0.6.2
- More friendly errors are returned when invalid dictionary path was provided.
- Added new
posDebugRcpp
function.
gibasa 0.6.1
- Revert some missing examples.
gibasa 0.6.0
- Functions added in version '0.5.1' was moved to 'audubon' package (>= 0.4.0).
gibasa 0.5.1
- Added some new functions.
bind_tf_idf2
can calculate and bind the term frequency, inverse document frequency, and tf-idf of the tidy text dataset.
collapse_tokens
, mute_tokens
, and lexical_density
can be used for handling a tidy text dataset of tokens.
gibasa 0.5.0
- gibasa now includes the MeCab source, so that users do not need to pre-install the MeCab library when building and installing the package (to use
tokenize
, it still requires MeCab and its dictionaries installed and available).
gibasa 0.4.1
tokenize
now preserves the original order of docid_field
.
gibasa 0.4.0
- Added
bind_tf_idf2
function and is_blank
function.
gibasa 0.3.1
gibasa 0.3.0
- Changed build process on Windows.
- Added a vignette.
gibasa 0.2.1
prettify
now can extract columns only specified by col_select
.
gibasa 0.2.0
- Added a
NEWS.md
file to track changes to the package.
tokenize
now takes a data.frame as its first argument, returns a data.frame only. The former function that gets character vector and returns a data.frame or named list was renamed as gbs_tokenize
.