Package 'ldccr' reference manual

Title:	Utilities for Various Japanese Corpora
Description:	The goal of ldccr package is to make easy to use Japanese language resources. This package provides parsers for several Japanese corpora that are free or open licensed and a downloader of zipped text files published on Aozora Bunko.
Authors:	Akiru Kato [aut, cre], Heikki Johannes Hildén [cph] (2023-present Sqids maintainers [cph])
Maintainer:	Akiru Kato <[email protected]>
License:	MIT + file LICENSE
Version:	2025.03.01
Built:	2025-03-01 05:20:20 UTC
Source:	https://github.com/paithiov909/ldccr

Meta data of text files published on Aozora Bunko

Description

Meta data of text files published on Aozora Bunko

Usage

AozoraBunkoSnapshot
AozoraBunkoSnapshot

Format

An object of class tbl_df (inherits from tbl, data.frame) with 19308 rows and 55 columns.

Source

http://www.aozora.gr.jp/index_pages/list_person_all_extended_utf8.zip

Data for Textual Entailment

Description

Data for Textual Entailment

Usage

jrte_rte_files(
  keep = c("rte.nlp2020_base", "rte.nlp2020_append", "rte.lrec2020_surf",
    "rte.lrec2020_sem_short", "rte.lrec2020_sem_long", "rte.lrec2020_me")
)
jrte_rte_files(
  keep = c("rte.nlp2020_base", "rte.nlp2020_append", "rte.lrec2020_surf",
    "rte.lrec2020_sem_short", "rte.lrec2020_sem_long", "rte.lrec2020_me")
)

Arguments

keep

Character vector. File names to parse.

Value

tsv file names.

List of categories of the Livedoor News Corpus

Description

List of categories of the Livedoor News Corpus

Usage

ldnws_categories(
  keep = c("dokujo-tsushin", "it-life-hack", "kaden-channel", "livedoor-homme",
    "movie-enter", "peachy", "smax", "sports-watch", "topic-news")
)
ldnws_categories(
  keep = c("dokujo-tsushin", "it-life-hack", "kaden-channel", "livedoor-homme",
    "movie-enter", "peachy", "smax", "sports-watch", "topic-news")
)

Arguments

keep

Category names to parse.

Value

A character vector.

Whole text of ‘Wagahai Wa Neko Dearu’ written by Natsume Souseki from Aozora Bunko

Description

Whole text of ‘Wagahai Wa Neko Dearu’ written by Natsume Souseki from Aozora Bunko

Usage

NekoText
NekoText

Format

An object of class character of length 2258.

Source

https://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip

Parse reasoning column of 'rte.*.tsv'

Description

Parse reasoning column of 'rte.*.tsv'

Usage

parse_jrte_reasoning(tbl)
parse_jrte_reasoning(tbl)

Arguments

tbl

A tibble returned from read_jrte() of which name is rte.*.tsv.

Value

A tibble.

Download text file from Aozora Bunko

Description

Downloads a file from specified URL, unzips and converts it as UTF-8.

If you want to read a large part of texts published at Aozora Bunko, you can download them at once via globis-university/aozorabunko-clean.

Usage

read_aozora(
  url = "https://www.aozora.gr.jp/cards/000081/files/472_ruby_654.zip",
  txtname = NULL,
  directory = file.path(getwd(), "cache")
)
read_aozora(
  url = "https://www.aozora.gr.jp/cards/000081/files/472_ruby_654.zip",
  txtname = NULL,
  directory = file.path(getwd(), "cache")
)

Arguments

`url`	URL of text download link.
`txtname`	New file name as which text is saved. If left with `NULL`, keeps name of the source file.
`directory`	Path where new file is saved.

Value

The path to the file downloaded.

Read the ja.text8 corpus

Description

Downloads and reads the ja.text8 corpus as a tibble.

Usage

read_ja_text8(
  url =
    "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip",
  size = NULL
)
read_ja_text8(
  url =
    "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip",
  size = NULL
)

Arguments

`url`	String.
`size`	Integer. If supplied, samples rows by this argument.

Details

By default, this function reads the ja.text8 corpus as a tibble by splitting it into sentences. The ja.text8 as whole corpus consists of over 582,000 sentences, 16,900,026 tokens, and 290,811 vocabularies.

Value

A tibble.

Read the JRTE Corpus

Description

Download and read the Japanese Realistic Textual Entailment Corpus. The result of this function is memoised with memoise::memoise() internally.

Usage

read_jrte(
  url = "https://github.com/megagonlabs/jrte-corpus/archive/refs/heads/master.zip",
  exdir = tempdir(),
  keep = jrte_rte_files(),
  keep_rhr = FALSE,
  keep_pn = FALSE
)
read_jrte(
  url = "https://github.com/megagonlabs/jrte-corpus/archive/refs/heads/master.zip",
  exdir = tempdir(),
  keep = jrte_rte_files(),
  keep_rhr = FALSE,
  keep_pn = FALSE
)

Arguments

`url`	String. If left with `NULL`, the function will skip downloading the file.
`exdir`	String. Path to tempolarily unzip text files.
`keep`	List. File names to parse and keep in returned value.
`keep_rhr`	Logical. If supplied `TRUE`, keeps `rhr.tsv`.
`keep_pn`	Logical. If supplied `TRUE`, keeps `pn.tsv`.

Value

A list of tibbles.

Read the Livedoor News Corpus

Description

Downloads and reads the Livedoor News Corpus. The result of this function is memoised with memoise::memoise() internally.

Usage

read_ldnws(
  url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz",
  exdir = tempdir(),
  keep = ldnws_categories(),
  collapse = "\n\n",
  include_title = TRUE
)
read_ldnws(
  url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz",
  exdir = tempdir(),
  keep = ldnws_categories(),
  collapse = "\n\n",
  include_title = TRUE
)

Arguments

`url`	String. If left with `NULL`, the function will skip downloading the file.
`exdir`	String. Directory to tempolarily untar text files.
`keep`	Categories to parse and keep in the tibble.
`collapse`	String with which `base::paste()` collapses lines.
`include_title`	Logical. Whether to include title in text body field. Defaults to `TRUE`.

Details

This function downloads the Livedoor News Corpus and parses it to a tibble. For details about the Livedoor News Corpus, please see thie page.

Value

A tibble.

Generate random-looking IDs from integer ranks

Description

sqids() is an alternative to dplyr::row_number() that generates random-looking IDs from integer ranks using Sqids (formerly Hashids).

IDs that generated with sqids() can be easily decoded back into the original ranks using unsqids().

Usage

sqids(
  x,
  .salt = sample.int(1000, 3),
  .ties = c("sequential", "min", "max", "dense")
)

unsqids(x)
sqids(
  x,
  .salt = sample.int(1000, 3),
  .ties = c("sequential", "min", "max", "dense")
)

unsqids(x)

Arguments

x

For sqids(), a vector to rank. You can leave this argument missing to refer to the "current" row number in 'dplyr' verbs.

For unsqids(), a character vector of IDs.

.salt

Integers to use with each value of x to generate IDs.

.ties

Method to rank duplicate values. One of "sequential", "min", "max", or "dense". See ties argument of vctrs::vec_rank() for more details.

Value

For sqids(), a character vector of IDs.

For unsqids(), integers.

Examples

ids <- sqids(c(5, 1, 3, 2, 2, NA))
ids
unsqids(ids)

df <- data.frame(
  grp = c(1, 1, 1, 2, 2, 2, 3, 3, 3)
)
# You can use `sqids()` without referencing `x` in dplyr verbs.
dplyr::mutate(df, sqids = sqids(), row_id = unsqids(sqids))
# Use `.ties` to control how to rank duplicate values.
dplyr::mutate(df, sqids = sqids(grp, .ties = "min"), grp_id = unsqids(sqids))
# When you need to generate the same IDs for each group, fix the `.salt`:
dplyr::mutate(df, sqids = sqids(.salt = 1234L), .by = grp)
ids <- sqids(c(5, 1, 3, 2, 2, NA))
ids
unsqids(ids)

df <- data.frame(
  grp = c(1, 1, 1, 2, 2, 2, 3, 3, 3)
)
# You can use `sqids()` without referencing `x` in dplyr verbs.
dplyr::mutate(df, sqids = sqids(), row_id = unsqids(sqids))
# Use `.ties` to control how to rank duplicate values.
dplyr::mutate(df, sqids = sqids(grp, .ties = "min"), grp_id = unsqids(sqids))
# When you need to generate the same IDs for each group, fix the `.salt`:
dplyr::mutate(df, sqids = sqids(.salt = 1234L), .by = grp)

Download and unzip 'UniDic'

Description

Downloads 'UniDic' of specified version into dirname. This function is a partial port of polm/unidic-py. Note that to unzip dictionary will take up 770MB on disk after downloading.

Usage

unidic_availables()

download_unidic(version = "latest", dirname = "unidic")
unidic_availables()

download_unidic(version = "latest", dirname = "unidic")

Arguments

`version`	String; version of 'UniDic'.
`dirname`	String; directory to unzip the dictionary.

Value

Full path to dirname is returned invisibly.

Utility functions

Description

These functions are experimental and may withdraw in the future.

Usage

clean_url(text, replacement = "")

clean_emoji(text, replacement = "")

is_within_era(date, era)

parse_to_jdate(date, format)
clean_url(text, replacement = "")

clean_emoji(text, replacement = "")

is_within_era(date, era)

parse_to_jdate(date, format)

Arguments

`text`	A character vector.
`replacement`	String.
`date`	Dates.
`era`	String.
`format`	String.

Package 'ldccr'

Help Index

Meta data of text files published on Aozora Bunko

Description

Usage

Format

Source

See Also

Data for Textual Entailment

Description

Usage

Arguments

Value

See Also

List of categories of the Livedoor News Corpus

Description

Usage

Arguments

Value

See Also

Whole text of ‘Wagahai Wa Neko Dearu’ written by Natsume Souseki from Aozora Bunko

Description

Usage

Format

Source

Parse reasoning column of 'rte.*.tsv'

Description

Usage

Arguments

Value

See Also

Download text file from Aozora Bunko

Description

Usage

Arguments

Value

Read the ja.text8 corpus

Description

Usage

Arguments

Details

Value

Read the JRTE Corpus

Description

Usage

Arguments

Value

See Also

Read the Livedoor News Corpus

Description

Usage

Arguments

Details

Value

See Also

Generate random-looking IDs from integer ranks

Description

Usage

Arguments

Value

See Also

Examples

Download and unzip 'UniDic'

Description

Usage

Arguments

Value

Utility functions

Description

Usage

Arguments