Class: Ucode::Parsers::Unihan
Overview
Parses all eight Unihan files (Unihan_IRGSources.txt,
Unihan_NumericValues.txt, Unihan_RadicalStrokeCounts.txt,
Unihan_Readings.txt, Unihan_DictionaryIndices.txt,
Unihan_DictionaryLikeData.txt, Unihan_Variants.txt,
Unihan_OtherMappings.txt).
File format is uniform across all eight (Unihan documentation):
U+XXXX<TAB>kField<TAB>value
The value may be a space-separated list (kRSUnicode, kDefinition
for prose, kCangjieInput for multiple codes). .split (whitespace)
produces the values array uniformly. Coordinator groups records by
cp and writes into CodePoint.unihan.fields[field].
One parser, not eight: the format is uniform. The filename carries
no parse-time information — every line is self-describing via its
field name. Adding a new Unihan file is a one-line change to
FILES; no parser modification (OCP).
Defined Under Namespace
Classes: Record
Constant Summary collapse
- FILES =
%w[ Unihan_DictionaryIndices.txt Unihan_DictionaryLikeData.txt Unihan_IRGSources.txt Unihan_NumericValues.txt Unihan_RadicalStrokeCounts.txt Unihan_Readings.txt Unihan_Variants.txt Unihan_OtherMappings.txt ].freeze
- FILE_TO_CATEGORY =
Filename → category symbol. The parser tags every Record with the category derived from its source file, so consumers (Coordinator → UnihanEntry) don't need to know the mapping. Unicode does not reorganize files across versions, so this mapping is stable without per-field hardcoding.
{ "Unihan_DictionaryIndices.txt" => :dictionary_indices, "Unihan_DictionaryLikeData.txt" => :dictionary_like_data, "Unihan_IRGSources.txt" => :irg_sources, "Unihan_NumericValues.txt" => :numeric_values, "Unihan_RadicalStrokeCounts.txt" => :radical_stroke_counts, "Unihan_Readings.txt" => :readings, "Unihan_Variants.txt" => :variants, "Unihan_OtherMappings.txt" => :other_mappings, }.freeze
Class Method Summary collapse
-
.each_in_dir(dir) ⇒ Object
Iterates every known Unihan file in
dir, yielding one Record per data line across all files. -
.each_record(path, filename: nil) ⇒ Object
Yields one Record per non-comment line in a single Unihan file.
Methods inherited from Base
each_line, parse_codepoint_or_range, parse_field, parse_hex_cp
Class Method Details
.each_in_dir(dir) ⇒ Object
Iterates every known Unihan file in dir, yielding one Record
per data line across all files. Missing files are silently
skipped (incremental runs, partial downloads). Each Record
carries its category so callers don't need to re-derive it.
92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/ucode/parsers/unihan.rb', line 92 def each_in_dir(dir) return enum_for(:each_in_dir, dir) unless block_given? dir_path = Pathname.new(dir) FILES.each do |filename| path = dir_path.join(filename) next unless path.exist? each_record(path, filename: filename) { |record| yield record } end nil end |
.each_record(path, filename: nil) ⇒ Object
Yields one Record per non-comment line in a single Unihan file. The caller must pass the source filename so the Record carries its category. Returns a lazy Enumerator when no block is given.
75 76 77 78 79 80 81 82 83 84 85 86 |
# File 'lib/ucode/parsers/unihan.rb', line 75 def each_record(path, filename: nil) return enum_for(:each_record, path, filename: filename) unless block_given? path_str = path.to_s category = FILE_TO_CATEGORY.fetch(filename || File.basename(path_str), nil) each_line_with_lineno(path_str) do |line, lineno| yield tagged_record(line, category, path_str, lineno) end nil end |