Class: Ucode::Parsers::Unihan

Inherits:

Base

Object
Base
Ucode::Parsers::Unihan

show all

Defined in:: lib/ucode/parsers/unihan.rb

Overview

Parses all eight Unihan files (Unihan_IRGSources.txt, Unihan_NumericValues.txt, Unihan_RadicalStrokeCounts.txt, Unihan_Readings.txt, Unihan_DictionaryIndices.txt, Unihan_DictionaryLikeData.txt, Unihan_Variants.txt, Unihan_OtherMappings.txt).

File format is uniform across all eight (Unihan documentation):

U+XXXX<TAB>kField<TAB>value

The value may be a space-separated list (kRSUnicode, kDefinition for prose, kCangjieInput for multiple codes). .split (whitespace) produces the values array uniformly. Coordinator groups records by cp and writes into CodePoint.unihan.fields[field].

One parser, not eight: the format is uniform. The filename carries no parse-time information — every line is self-describing via its field name. Adding a new Unihan file is a one-line change to FILES; no parser modification (OCP).

Defined Under Namespace

Classes: Record

Constant Summary collapse

FILES =

%w[
  Unihan_DictionaryIndices.txt
  Unihan_DictionaryLikeData.txt
  Unihan_IRGSources.txt
  Unihan_NumericValues.txt
  Unihan_RadicalStrokeCounts.txt
  Unihan_Readings.txt
  Unihan_Variants.txt
  Unihan_OtherMappings.txt
].freeze

FILE_TO_CATEGORY = Filename → category symbol. The parser tags every Record with the category derived from its source file, so consumers (Coordinator → UnihanEntry) don't need to know the mapping. Unicode does not reorganize files across versions, so this mapping is stable without per-field hardcoding.

{
  "Unihan_DictionaryIndices.txt" => :dictionary_indices,
  "Unihan_DictionaryLikeData.txt" => :dictionary_like_data,
  "Unihan_IRGSources.txt" => :irg_sources,
  "Unihan_NumericValues.txt" => :numeric_values,
  "Unihan_RadicalStrokeCounts.txt" => :radical_stroke_counts,
  "Unihan_Readings.txt" => :readings,
  "Unihan_Variants.txt" => :variants,
  "Unihan_OtherMappings.txt" => :other_mappings,
}.freeze

Class Method Summary collapse

.each_in_dir(dir) ⇒ Object
Iterates every known Unihan file in dir, yielding one Record per data line across all files.
.each_record(path, filename: nil) ⇒ Object
Yields one Record per non-comment line in a single Unihan file.

Methods inherited from Base

each_line, parse_codepoint_or_range, parse_field, parse_hex_cp

Class Method Details

.each_in_dir(dir) ⇒ `Object`

Iterates every known Unihan file in dir, yielding one Record per data line across all files. Missing files are silently skipped (incremental runs, partial downloads). Each Record carries its category so callers don't need to re-derive it.

# File 'lib/ucode/parsers/unihan.rb', line 92

def each_in_dir(dir)
  return enum_for(:each_in_dir, dir) unless block_given?

  dir_path = Pathname.new(dir)
  FILES.each do |filename|
    path = dir_path.join(filename)
    next unless path.exist?

    each_record(path, filename: filename) { |record| yield record }
  end

  nil
end

.each_record(path, filename: nil) ⇒ `Object`

Yields one Record per non-comment line in a single Unihan file. The caller must pass the source filename so the Record carries its category. Returns a lazy Enumerator when no block is given.

# File 'lib/ucode/parsers/unihan.rb', line 75

def each_record(path, filename: nil)
  return enum_for(:each_record, path, filename: filename) unless block_given?

  path_str = path.to_s
  category = FILE_TO_CATEGORY.fetch(filename || File.basename(path_str), nil)

  each_line_with_lineno(path_str) do |line, lineno|
    yield tagged_record(line, category, path_str, lineno)
  end

  nil
end