Package 'ldccr'

Title: Utilities for Various Japanese Corpora
Description: The goal of ldccr package is to make easy to use Japanese language resources. This package provides parsers for several Japanese corpora that are free or open licensed and a downloader of zipped text files published on Aozora Bunko.
Authors: Akiru Kato [aut, cre] (2023-present Sqids maintainers [cph])
Maintainer: Akiru Kato <[email protected]>
License: MIT + file LICENSE
Version: 2025.02.02
Built: 2025-02-02 14:18:35 UTC
Source: https://github.com/paithiov909/ldccr

Help Index


Meta data of text files published on Aozora Bunko

Description

Meta data of text files published on Aozora Bunko

Usage

AozoraBunkoSnapshot

Format

An object of class tbl_df (inherits from tbl, data.frame) with 19296 rows and 55 columns.

Source

http://www.aozora.gr.jp/index_pages/list_person_all_extended_utf8.zip

See Also

The structure of the data is described here.


Data for Textual Entailment

Description

Data for Textual Entailment

Usage

jrte_rte_files(
  keep = c("rte.nlp2020_base", "rte.nlp2020_append", "rte.lrec2020_surf",
    "rte.lrec2020_sem_short", "rte.lrec2020_sem_long", "rte.lrec2020_me")
)

Arguments

keep

Character vector. File names to parse.

Value

tsv file names.

See Also

Other jrte-reader: parse_jrte_judges(), parse_jrte_reasoning(), read_jrte()


List of categories of the Livedoor News Corpus

Description

List of categories of the Livedoor News Corpus

Usage

ldnws_categories(
  keep = c("dokujo-tsushin", "it-life-hack", "kaden-channel", "livedoor-homme",
    "movie-enter", "peachy", "smax", "sports-watch", "topic-news")
)

Arguments

keep

Category names to parse.

Value

A character vector.

See Also

Other ldnws-reader: read_ldnws()


Whole text of ‘Wagahai Wa Neko Dearu’ written by Natsume Souseki from Aozora Bunko

Description

Whole text of ‘Wagahai Wa Neko Dearu’ written by Natsume Souseki from Aozora Bunko

Usage

NekoText

Format

An object of class character of length 2258.

Source

https://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip


Parse reasoning column of 'rte.*.tsv'

Description

Parse reasoning column of 'rte.*.tsv'

Usage

parse_jrte_reasoning(tbl)

Arguments

tbl

A tibble returned from read_jrte() of which name is rte.*.tsv.

Value

A tibble.

See Also

Other jrte-reader: jrte_rte_files(), parse_jrte_judges(), read_jrte()


Download text file from Aozora Bunko

Description

[Superseded] Downloads a file from specified URL, unzips and converts it as UTF-8.

If you want to read a large part of texts published at Aozora Bunko, you can download them at once via globis-university/aozorabunko-clean.

Usage

read_aozora(
  url = "https://www.aozora.gr.jp/cards/000081/files/472_ruby_654.zip",
  txtname = NULL,
  directory = file.path(getwd(), "cache")
)

Arguments

url

URL of text download link.

txtname

New file name as which text is saved. If left with NULL, keeps name of the source file.

directory

Path where new file is saved.

Value

The path to the file downloaded.


Read the ja.text8 corpus

Description

Downloads and reads the ja.text8 corpus as a tibble.

Usage

read_ja_text8(
  url =
    "https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip",
  size = NULL
)

Arguments

url

String.

size

Integer. If supplied, samples rows by this argument.

Details

By default, this function reads the ja.text8 corpus as a tibble by splitting it into sentences. The ja.text8 as whole corpus consists of over 582,000 sentences, 16,900,026 tokens, and 290,811 vocabularies.

Value

A tibble.


Read the JRTE Corpus

Description

Download and read the Japanese Realistic Textual Entailment Corpus. The result of this function is memoised with memoise::memoise() internally.

Usage

read_jrte(
  url = "https://github.com/megagonlabs/jrte-corpus/archive/refs/heads/master.zip",
  exdir = tempdir(),
  keep = jrte_rte_files(),
  keep_rhr = FALSE,
  keep_pn = FALSE
)

Arguments

url

String. If left with NULL, the function will skip downloading the file.

exdir

String. Path to tempolarily unzip text files.

keep

List. File names to parse and keep in returned value.

keep_rhr

Logical. If supplied TRUE, keeps rhr.tsv.

keep_pn

Logical. If supplied TRUE, keeps pn.tsv.

Value

A list of tibbles.

See Also

Other jrte-reader: jrte_rte_files(), parse_jrte_judges(), parse_jrte_reasoning()


Read the Livedoor News Corpus

Description

Downloads and reads the Livedoor News Corpus. The result of this function is memoised with memoise::memoise() internally.

Usage

read_ldnws(
  url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz",
  exdir = tempdir(),
  keep = ldnws_categories(),
  collapse = "\n\n",
  include_title = TRUE
)

Arguments

url

String. If left with NULL, the function will skip downloading the file.

exdir

String. Directory to tempolarily untar text files.

keep

Categories to parse and keep in the tibble.

collapse

String with which base::paste() collapses lines.

include_title

Logical. Whether to include title in text body field. Defaults to TRUE.

Details

This function downloads the Livedoor News Corpus and parses it to a tibble. For details about the Livedoor News Corpus, please see thie page.

Value

A tibble.

See Also

Other ldnws-reader: ldnws_categories()


Generate random-looking IDs from integer ranks

Description

sqids() is an alternative to dplyr::row_number() that generates random-looking IDs from integer ranks using Sqids (formerly Hashids).

IDs that generated with sqids() can be easily decoded back into the original ranks using unsqids().

Usage

sqids(
  x,
  .salt = sample.int(1000, 3),
  .ties = c("sequential", "min", "max", "dense")
)

unsqids(x)

Arguments

x

For sqids(), a vector to rank. You can leave this argument missing to refer to the "current" row number in 'dplyr' verbs.

For unsqids(), a character vector of IDs.

.salt

Integers to use with each value of x to generate IDs.

.ties

Method to rank duplicate values. One of "sequential", "min", "max", or "dense". See ties argument of vctrs::vec_rank() for more details.

Value

For sqids(), a character vector of IDs.

For unsqids(), integers.

See Also

sqids/sqids-cpp

Examples

ids <- sqids(c(5, 1, 3, 2, 2, NA))
ids
unsqids(ids)

df <- data.frame(
  grp = c(1, 1, 1, 2, 2, 2, 3, 3, 3)
)
# You can use `sqids()` without referencing `x` in dplyr verbs.
dplyr::mutate(df, sqids = sqids(), row_id = unsqids(sqids))
# Use `.ties` to control how to rank duplicate values.
dplyr::mutate(df, sqids = sqids(grp, .ties = "min"), grp_id = unsqids(sqids))
# When you need to generate the same IDs for each group, fix the `.salt`:
dplyr::mutate(df, sqids = sqids(.salt = 1234L), .by = grp)

Download and unzip 'UniDic'

Description

Downloads 'UniDic' of specified version into dirname. This function is a partial port of polm/unidic-py. Note that to unzip dictionary will take up 770MB on disk after downloading.

Usage

unidic_availables()

download_unidic(version = "latest", dirname = "unidic")

Arguments

version

String; version of 'UniDic'.

dirname

String; directory to unzip the dictionary.

Value

Full path to dirname is returned invisibly.


Utility functions

Description

[Experimental] These functions are experimental and may withdraw in the future.

Usage

clean_url(text, replacement = "")

clean_emoji(text, replacement = "")

is_within_era(date, era)

parse_to_jdate(date, format)

Arguments

text

A character vector.

replacement

String.

date

Dates.

era

String.

format

String.