Title: | Utilities for Various Japanese Corpora |
---|---|
Description: | The goal of ldccr package is to make easy to use Japanese language resources. This package provides parsers for several Japanese corpora that are free or open licensed and a downloader of zipped text files published on Aozora Bunko. |
Authors: | Akiru Kato [aut, cre] |
Maintainer: | Akiru Kato <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2024.10.10 |
Built: | 2025-01-08 05:28:41 UTC |
Source: | https://github.com/paithiov909/ldccr |
Meta data of text files published on Aozora Bunko
AozoraBunkoSnapshot
AozoraBunkoSnapshot
An object of class tbl_df
(inherits from tbl
, data.frame
) with 19233 rows and 55 columns.
http://www.aozora.gr.jp/index_pages/list_person_all_extended_utf8.zip
The structure of the data is described here.
Remove emojis
clean_emoji(text, replacement = "")
clean_emoji(text, replacement = "")
text |
A character vector. |
replacement |
String. |
A character vector.
Remove URLs
clean_url(text, replacement = "")
clean_url(text, replacement = "")
text |
A character vector. |
replacement |
String. |
A character vector.
Download 'UniDic' of specified version into dirname
.
This function is partial port of polm/unidic-py.
Note that to unzip dictionary will take up 770MB on disk after downloading.
download_unidic(version = "latest", dirname = "unidic")
download_unidic(version = "latest", dirname = "unidic")
version |
String; version of 'UniDic'. |
dirname |
String; directory where unzip the dictionary. |
Full path to dirname
is returned invisibly.
Check if dates are within Japanese era
is_within_era(date, era)
is_within_era(date, era)
date |
Dates. |
era |
String. |
Logicals.
Data for Textual Entailment
jrte_rte_files( keep = c("rte.nlp2020_base", "rte.nlp2020_append", "rte.lrec2020_surf", "rte.lrec2020_sem_short", "rte.lrec2020_sem_long", "rte.lrec2020_me") )
jrte_rte_files( keep = c("rte.nlp2020_base", "rte.nlp2020_append", "rte.lrec2020_surf", "rte.lrec2020_sem_short", "rte.lrec2020_sem_long", "rte.lrec2020_me") )
keep |
Character vector. File names to parse. |
tsv file names.
List of categories of the Livedoor News Corpus
ldnws_categories( keep = c("dokujo-tsushin", "it-life-hack", "kaden-channel", "livedoor-homme", "movie-enter", "peachy", "smax", "sports-watch", "topic-news") )
ldnws_categories( keep = c("dokujo-tsushin", "it-life-hack", "kaden-channel", "livedoor-homme", "movie-enter", "peachy", "smax", "sports-watch", "topic-news") )
keep |
Character vector. File names to parse. |
A character vector.
Whole text of ‘Wagahai Wa Neko Dearu’ written by Natsume Souseki from Aozora Bunko
NekoText
NekoText
An object of class character
of length 2258.
https://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip
Parse reasoning column of 'rte.*.tsv'
parse_jrte_reasoning(tbl)
parse_jrte_reasoning(tbl)
tbl |
A tibble returned from |
A tibble.
Parse dates to Japanese dates
parse_to_jdate(date, format)
parse_to_jdate(date, format)
date |
Dates. |
format |
String. |
A chacter vector.
Download a file from specified URL, unzip and convert it to UTF-8.
read_aozora( url = "https://www.aozora.gr.jp/cards/000081/files/472_ruby_654.zip", txtname = NULL, directory = file.path(getwd(), "cache") )
read_aozora( url = "https://www.aozora.gr.jp/cards/000081/files/472_ruby_654.zip", txtname = NULL, directory = file.path(getwd(), "cache") )
url |
URL of text download link. |
txtname |
New file name as which text is saved.
If left to |
directory |
Path where new file is saved. |
The path to the file downloaded.
Download and read the ja.text8 corpus as a tibble.
read_ja_text8( url = "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip", size = NULL )
read_ja_text8( url = "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip", size = NULL )
url |
String. |
size |
Integer. If supplied, samples rows by this argument. |
By default, this function reads the ja.text8 corpus as a tibble by splitting it into sentences. The ja.text8 as whole corpus consists of over 582,000 sentences, 16,900,026 tokens, and 290,811 vocabularies.
A tibble.
Download and read the Japanese Realistic Textual Entailment Corpus.
The result of this function is memoised with memoise::memoise
internally.
read_jrte( url = "https://github.com/megagonlabs/jrte-corpus/archive/refs/heads/master.zip", exdir = tempdir(), keep = jrte_rte_files(), keep_rhr = FALSE, keep_pn = FALSE )
read_jrte( url = "https://github.com/megagonlabs/jrte-corpus/archive/refs/heads/master.zip", exdir = tempdir(), keep = jrte_rte_files(), keep_rhr = FALSE, keep_pn = FALSE )
url |
String.
If left to |
exdir |
String. Path to tempolarily unzip text files. |
keep |
List. File names to parse and keep in returned value. |
keep_rhr |
Logical. If supplied |
keep_pn |
Logical. If supplied |
A list of tibbles.
Download and read the Livedoor News Corpus.
The result of this function is memoised
with memoise::memoise
internally.
read_ldnws( url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz", exdir = tempdir(), keep = ldnws_categories(), collapse = "\n\n", include_title = TRUE )
read_ldnws( url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz", exdir = tempdir(), keep = ldnws_categories(), collapse = "\n\n", include_title = TRUE )
url |
String.
If left to |
exdir |
String. Path to tempolarily untar text files. |
keep |
Character vector. Categories to parse and keep in data.frame. |
collapse |
String with which |
include_title |
Logical. Whether to include title in text body field.
Defaults to |
This function downloads the Livedoor News Corpus and parses it to a tibble. For details about the Livedoor News Corpus, please see this page.
A tibble.
List of available 'UniDic'
unidic_availables()
unidic_availables()
A list.