Class: Ucode::Parsers::Unihan

Inherits:
Base
  • Object
show all
Defined in:
lib/ucode/parsers/unihan.rb

Overview

Parses all eight Unihan files (‘Unihan_IRGSources.txt`, `Unihan_NumericValues.txt`, `Unihan_RadicalStrokeCounts.txt`, `Unihan_Readings.txt`, `Unihan_DictionaryIndices.txt`, `Unihan_DictionaryLikeData.txt`, `Unihan_Variants.txt`, `Unihan_OtherMappings.txt`).

File format is uniform across all eight (Unihan documentation):

U+XXXX<TAB>kField<TAB>value

The value may be a space-separated list (‘kRSUnicode`, `kDefinition` for prose, `kCangjieInput` for multiple codes). `.split` (whitespace) produces the values array uniformly. Coordinator groups records by `cp` and writes into `CodePoint.unihan.fields`.

One parser, not eight: the format is uniform. The filename carries no parse-time information — every line is self-describing via its field name. Adding a new Unihan file is a one-line change to ‘FILES`; no parser modification (OCP).

Defined Under Namespace

Classes: Record

Constant Summary collapse

FILES =
%w[
  Unihan_DictionaryIndices.txt
  Unihan_DictionaryLikeData.txt
  Unihan_IRGSources.txt
  Unihan_NumericValues.txt
  Unihan_RadicalStrokeCounts.txt
  Unihan_Readings.txt
  Unihan_Variants.txt
  Unihan_OtherMappings.txt
].freeze

Class Method Summary collapse

Methods inherited from Base

each_line, parse_codepoint_or_range, parse_field, parse_hex_cp

Class Method Details

.each_in_dir(dir) ⇒ Object

Iterates every known Unihan file in ‘dir`, yielding one Record per data line across all files. Missing files are silently skipped (incremental runs, partial downloads).



80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/ucode/parsers/unihan.rb', line 80

def each_in_dir(dir)
  return enum_for(:each_in_dir, dir) unless block_given?

  dir_path = Pathname.new(dir)
  FILES.each do |filename|
    path = dir_path.join(filename)
    next unless path.exist?

    each_record(path) { |record| yield record }
  end

  nil
end

.each_record(path) ⇒ Object

Yields one Record per non-comment line in a single Unihan file. Returns a lazy Enumerator when no block is given.



54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/ucode/parsers/unihan.rb', line 54

def each_record(path)
  return enum_for(:each_record, path) unless block_given?

  path_str = path.to_s
  lineno = 0

  File.foreach(path_str) do |raw|
    lineno += 1
    line = raw.chomp
    next if line.empty? || line.start_with?("#")

    begin
      yield parse_line(line)
    rescue MalformedLineError => e
      e.context[:file] ||= path_str
      e.context[:line] ||= lineno
      raise
    end
  end

  nil
end