Class: Ucode::Parsers::Unihan
Overview
Parses all eight Unihan files (‘Unihan_IRGSources.txt`, `Unihan_NumericValues.txt`, `Unihan_RadicalStrokeCounts.txt`, `Unihan_Readings.txt`, `Unihan_DictionaryIndices.txt`, `Unihan_DictionaryLikeData.txt`, `Unihan_Variants.txt`, `Unihan_OtherMappings.txt`).
File format is uniform across all eight (Unihan documentation):
U+XXXX<TAB>kField<TAB>value
The value may be a space-separated list (‘kRSUnicode`, `kDefinition` for prose, `kCangjieInput` for multiple codes). `.split` (whitespace) produces the values array uniformly. Coordinator groups records by `cp` and writes into `CodePoint.unihan.fields`.
One parser, not eight: the format is uniform. The filename carries no parse-time information — every line is self-describing via its field name. Adding a new Unihan file is a one-line change to ‘FILES`; no parser modification (OCP).
Defined Under Namespace
Classes: Record
Constant Summary collapse
- FILES =
%w[ Unihan_DictionaryIndices.txt Unihan_DictionaryLikeData.txt Unihan_IRGSources.txt Unihan_NumericValues.txt Unihan_RadicalStrokeCounts.txt Unihan_Readings.txt Unihan_Variants.txt Unihan_OtherMappings.txt ].freeze
Class Method Summary collapse
-
.each_in_dir(dir) ⇒ Object
Iterates every known Unihan file in ‘dir`, yielding one Record per data line across all files.
-
.each_record(path) ⇒ Object
Yields one Record per non-comment line in a single Unihan file.
Methods inherited from Base
each_line, parse_codepoint_or_range, parse_field, parse_hex_cp
Class Method Details
.each_in_dir(dir) ⇒ Object
Iterates every known Unihan file in ‘dir`, yielding one Record per data line across all files. Missing files are silently skipped (incremental runs, partial downloads).
80 81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/ucode/parsers/unihan.rb', line 80 def each_in_dir(dir) return enum_for(:each_in_dir, dir) unless block_given? dir_path = Pathname.new(dir) FILES.each do |filename| path = dir_path.join(filename) next unless path.exist? each_record(path) { |record| yield record } end nil end |
.each_record(path) ⇒ Object
Yields one Record per non-comment line in a single Unihan file. Returns a lazy Enumerator when no block is given.
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/ucode/parsers/unihan.rb', line 54 def each_record(path) return enum_for(:each_record, path) unless block_given? path_str = path.to_s lineno = 0 File.foreach(path_str) do |raw| lineno += 1 line = raw.chomp next if line.empty? || line.start_with?("#") begin yield parse_line(line) rescue MalformedLineError => e e.context[:file] ||= path_str e.context[:line] ||= lineno raise end end nil end |