Title: | Utilities for Various Japanese Corpora |
---|---|
Description: | The goal of ldccr package is to make easy to use Japanese language resources. This package provides parsers for several Japanese corpora that are free or open licensed and a downloader of zipped text files published on Aozora Bunko. |
Authors: | Akiru Kato [aut, cre] (2023-present Sqids maintainers [cph]) |
Maintainer: | Akiru Kato <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2025.02.02 |
Built: | 2025-02-02 14:18:35 UTC |
Source: | https://github.com/paithiov909/ldccr |
Meta data of text files published on Aozora Bunko
AozoraBunkoSnapshot
AozoraBunkoSnapshot
An object of class tbl_df
(inherits from tbl
, data.frame
) with 19296 rows and 55 columns.
http://www.aozora.gr.jp/index_pages/list_person_all_extended_utf8.zip
The structure of the data is described here.
Data for Textual Entailment
jrte_rte_files( keep = c("rte.nlp2020_base", "rte.nlp2020_append", "rte.lrec2020_surf", "rte.lrec2020_sem_short", "rte.lrec2020_sem_long", "rte.lrec2020_me") )
jrte_rte_files( keep = c("rte.nlp2020_base", "rte.nlp2020_append", "rte.lrec2020_surf", "rte.lrec2020_sem_short", "rte.lrec2020_sem_long", "rte.lrec2020_me") )
keep |
Character vector. File names to parse. |
tsv file names.
Other jrte-reader:
parse_jrte_judges()
,
parse_jrte_reasoning()
,
read_jrte()
List of categories of the Livedoor News Corpus
ldnws_categories( keep = c("dokujo-tsushin", "it-life-hack", "kaden-channel", "livedoor-homme", "movie-enter", "peachy", "smax", "sports-watch", "topic-news") )
ldnws_categories( keep = c("dokujo-tsushin", "it-life-hack", "kaden-channel", "livedoor-homme", "movie-enter", "peachy", "smax", "sports-watch", "topic-news") )
keep |
Category names to parse. |
A character vector.
Other ldnws-reader:
read_ldnws()
Whole text of ‘Wagahai Wa Neko Dearu’ written by Natsume Souseki from Aozora Bunko
NekoText
NekoText
An object of class character
of length 2258.
https://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip
Parse reasoning column of 'rte.*.tsv'
parse_jrte_reasoning(tbl)
parse_jrte_reasoning(tbl)
tbl |
A tibble returned from |
A tibble.
Other jrte-reader:
jrte_rte_files()
,
parse_jrte_judges()
,
read_jrte()
Downloads a file from specified URL, unzips and converts it as UTF-8.
If you want to read a large part of texts published at Aozora Bunko, you can download them at once via globis-university/aozorabunko-clean.
read_aozora( url = "https://www.aozora.gr.jp/cards/000081/files/472_ruby_654.zip", txtname = NULL, directory = file.path(getwd(), "cache") )
read_aozora( url = "https://www.aozora.gr.jp/cards/000081/files/472_ruby_654.zip", txtname = NULL, directory = file.path(getwd(), "cache") )
url |
URL of text download link. |
txtname |
New file name as which text is saved.
If left with |
directory |
Path where new file is saved. |
The path to the file downloaded.
Downloads and reads the ja.text8 corpus as a tibble.
read_ja_text8( url = "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip", size = NULL )
read_ja_text8( url = "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip", size = NULL )
url |
String. |
size |
Integer. If supplied, samples rows by this argument. |
By default, this function reads the ja.text8 corpus as a tibble by splitting it into sentences. The ja.text8 as whole corpus consists of over 582,000 sentences, 16,900,026 tokens, and 290,811 vocabularies.
A tibble.
Download and read the Japanese Realistic Textual Entailment Corpus.
The result of this function is memoised with memoise::memoise()
internally.
read_jrte( url = "https://github.com/megagonlabs/jrte-corpus/archive/refs/heads/master.zip", exdir = tempdir(), keep = jrte_rte_files(), keep_rhr = FALSE, keep_pn = FALSE )
read_jrte( url = "https://github.com/megagonlabs/jrte-corpus/archive/refs/heads/master.zip", exdir = tempdir(), keep = jrte_rte_files(), keep_rhr = FALSE, keep_pn = FALSE )
url |
String.
If left with |
exdir |
String. Path to tempolarily unzip text files. |
keep |
List. File names to parse and keep in returned value. |
keep_rhr |
Logical. If supplied |
keep_pn |
Logical. If supplied |
A list of tibbles.
Other jrte-reader:
jrte_rte_files()
,
parse_jrte_judges()
,
parse_jrte_reasoning()
Downloads and reads the Livedoor News Corpus.
The result of this function is memoised
with memoise::memoise()
internally.
read_ldnws( url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz", exdir = tempdir(), keep = ldnws_categories(), collapse = "\n\n", include_title = TRUE )
read_ldnws( url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz", exdir = tempdir(), keep = ldnws_categories(), collapse = "\n\n", include_title = TRUE )
url |
String.
If left with |
exdir |
String. Directory to tempolarily untar text files. |
keep |
Categories to parse and keep in the tibble. |
collapse |
String with which |
include_title |
Logical. Whether to include title in text body field.
Defaults to |
This function downloads the Livedoor News Corpus and parses it to a tibble. For details about the Livedoor News Corpus, please see thie page.
A tibble.
Other ldnws-reader:
ldnws_categories()
sqids()
is an alternative to dplyr::row_number()
that generates random-looking IDs from integer ranks
using Sqids (formerly Hashids).
IDs that generated with sqids()
can be easily decoded back into
the original ranks using unsqids()
.
sqids( x, .salt = sample.int(1000, 3), .ties = c("sequential", "min", "max", "dense") ) unsqids(x)
sqids( x, .salt = sample.int(1000, 3), .ties = c("sequential", "min", "max", "dense") ) unsqids(x)
x |
For For |
.salt |
Integers to use with each value of |
.ties |
Method to rank duplicate values.
One of |
For sqids()
, a character vector of IDs.
For unsqids()
, integers.
ids <- sqids(c(5, 1, 3, 2, 2, NA)) ids unsqids(ids) df <- data.frame( grp = c(1, 1, 1, 2, 2, 2, 3, 3, 3) ) # You can use `sqids()` without referencing `x` in dplyr verbs. dplyr::mutate(df, sqids = sqids(), row_id = unsqids(sqids)) # Use `.ties` to control how to rank duplicate values. dplyr::mutate(df, sqids = sqids(grp, .ties = "min"), grp_id = unsqids(sqids)) # When you need to generate the same IDs for each group, fix the `.salt`: dplyr::mutate(df, sqids = sqids(.salt = 1234L), .by = grp)
ids <- sqids(c(5, 1, 3, 2, 2, NA)) ids unsqids(ids) df <- data.frame( grp = c(1, 1, 1, 2, 2, 2, 3, 3, 3) ) # You can use `sqids()` without referencing `x` in dplyr verbs. dplyr::mutate(df, sqids = sqids(), row_id = unsqids(sqids)) # Use `.ties` to control how to rank duplicate values. dplyr::mutate(df, sqids = sqids(grp, .ties = "min"), grp_id = unsqids(sqids)) # When you need to generate the same IDs for each group, fix the `.salt`: dplyr::mutate(df, sqids = sqids(.salt = 1234L), .by = grp)
Downloads 'UniDic' of specified version into dirname
.
This function is a partial port of
polm/unidic-py.
Note that to unzip dictionary will take up 770MB on disk after downloading.
unidic_availables() download_unidic(version = "latest", dirname = "unidic")
unidic_availables() download_unidic(version = "latest", dirname = "unidic")
version |
String; version of 'UniDic'. |
dirname |
String; directory to unzip the dictionary. |
Full path to dirname
is returned invisibly.
These functions are experimental and may withdraw in the future.
clean_url(text, replacement = "") clean_emoji(text, replacement = "") is_within_era(date, era) parse_to_jdate(date, format)
clean_url(text, replacement = "") clean_emoji(text, replacement = "") is_within_era(date, era) parse_to_jdate(date, format)
text |
A character vector. |
replacement |
String. |
date |
Dates. |
era |
String. |
format |
String. |