Package 'ldccr'

Title: Utilities for Various Japanese Corpora
Description: The goal of ldccr package is to make easy to use Japanese language resources. This package provides parsers for several Japanese corpora that are free or open licensed and a downloader of zipped text files published on Aozora Bunko.
Authors: Akiru Kato [aut, cre]
Maintainer: Akiru Kato <[email protected]>
License: MIT + file LICENSE
Version: 2024.10.10
Built: 2025-01-08 05:28:41 UTC
Source: https://github.com/paithiov909/ldccr

Help Index


Meta data of text files published on Aozora Bunko

Description

Meta data of text files published on Aozora Bunko

Usage

AozoraBunkoSnapshot

Format

An object of class tbl_df (inherits from tbl, data.frame) with 19233 rows and 55 columns.

Source

http://www.aozora.gr.jp/index_pages/list_person_all_extended_utf8.zip

See Also

The structure of the data is described here.


Remove emojis

Description

Remove emojis

Usage

clean_emoji(text, replacement = "")

Arguments

text

A character vector.

replacement

String.

Value

A character vector.


Remove URLs

Description

Remove URLs

Usage

clean_url(text, replacement = "")

Arguments

text

A character vector.

replacement

String.

Value

A character vector.


Download and unzip 'UniDic'

Description

Download 'UniDic' of specified version into dirname. This function is partial port of polm/unidic-py. Note that to unzip dictionary will take up 770MB on disk after downloading.

Usage

download_unidic(version = "latest", dirname = "unidic")

Arguments

version

String; version of 'UniDic'.

dirname

String; directory where unzip the dictionary.

Value

Full path to dirname is returned invisibly.


Check if dates are within Japanese era

Description

Check if dates are within Japanese era

Usage

is_within_era(date, era)

Arguments

date

Dates.

era

String.

Value

Logicals.


Data for Textual Entailment

Description

Data for Textual Entailment

Usage

jrte_rte_files(
  keep = c("rte.nlp2020_base", "rte.nlp2020_append", "rte.lrec2020_surf",
    "rte.lrec2020_sem_short", "rte.lrec2020_sem_long", "rte.lrec2020_me")
)

Arguments

keep

Character vector. File names to parse.

Value

tsv file names.


List of categories of the Livedoor News Corpus

Description

List of categories of the Livedoor News Corpus

Usage

ldnws_categories(
  keep = c("dokujo-tsushin", "it-life-hack", "kaden-channel", "livedoor-homme",
    "movie-enter", "peachy", "smax", "sports-watch", "topic-news")
)

Arguments

keep

Character vector. File names to parse.

Value

A character vector.


Whole text of ‘Wagahai Wa Neko Dearu’ written by Natsume Souseki from Aozora Bunko

Description

Whole text of ‘Wagahai Wa Neko Dearu’ written by Natsume Souseki from Aozora Bunko

Usage

NekoText

Format

An object of class character of length 2258.

Source

https://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip


Parse reasoning column of 'rte.*.tsv'

Description

Parse reasoning column of 'rte.*.tsv'

Usage

parse_jrte_reasoning(tbl)

Arguments

tbl

A tibble returned from read_jrte which name is rte.*.tsv.

Value

A tibble.


Parse dates to Japanese dates

Description

Parse dates to Japanese dates

Usage

parse_to_jdate(date, format)

Arguments

date

Dates.

format

String.

Value

A chacter vector.


Download text file from Aozora Bunko

Description

Download a file from specified URL, unzip and convert it to UTF-8.

Usage

read_aozora(
  url = "https://www.aozora.gr.jp/cards/000081/files/472_ruby_654.zip",
  txtname = NULL,
  directory = file.path(getwd(), "cache")
)

Arguments

url

URL of text download link.

txtname

New file name as which text is saved. If left to NULL, keeps name of the source file.

directory

Path where new file is saved.

Value

The path to the file downloaded.


Read the ja.text8 corpus

Description

Download and read the ja.text8 corpus as a tibble.

Usage

read_ja_text8(
  url =
    "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip",
  size = NULL
)

Arguments

url

String.

size

Integer. If supplied, samples rows by this argument.

Details

By default, this function reads the ja.text8 corpus as a tibble by splitting it into sentences. The ja.text8 as whole corpus consists of over 582,000 sentences, 16,900,026 tokens, and 290,811 vocabularies.

Value

A tibble.


Read the JRTE Corpus

Description

Download and read the Japanese Realistic Textual Entailment Corpus. The result of this function is memoised with memoise::memoise internally.

Usage

read_jrte(
  url = "https://github.com/megagonlabs/jrte-corpus/archive/refs/heads/master.zip",
  exdir = tempdir(),
  keep = jrte_rte_files(),
  keep_rhr = FALSE,
  keep_pn = FALSE
)

Arguments

url

String. If left to NULL, the function will skip downloading the file.

exdir

String. Path to tempolarily unzip text files.

keep

List. File names to parse and keep in returned value.

keep_rhr

Logical. If supplied TRUE, keeps rhr.tsv.

keep_pn

Logical. If supplied TRUE, keeps pn.tsv.

Value

A list of tibbles.


Read the Livedoor News Corpus

Description

Download and read the Livedoor News Corpus. The result of this function is memoised with memoise::memoise internally.

Usage

read_ldnws(
  url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz",
  exdir = tempdir(),
  keep = ldnws_categories(),
  collapse = "\n\n",
  include_title = TRUE
)

Arguments

url

String. If left to NULL, the function will skip downloading the file.

exdir

String. Path to tempolarily untar text files.

keep

Character vector. Categories to parse and keep in data.frame.

collapse

String with which base::paste collapses lines.

include_title

Logical. Whether to include title in text body field. Defaults to TRUE.

Details

This function downloads the Livedoor News Corpus and parses it to a tibble. For details about the Livedoor News Corpus, please see this page.

Value

A tibble.


List of available 'UniDic'

Description

List of available 'UniDic'

Usage

unidic_availables()

Value

A list.