Class: Ucode::Parsers::Base
- Inherits:
-
Object
- Object
- Ucode::Parsers::Base
- Defined in:
- lib/ucode/parsers/base.rb
Overview
Shared infrastructure for every UCD text-file parser. Subclasses implement ‘.each_record(path) { |record| … }` returning an Enumerator when called without a block.
All methods are class methods — parsers are stateless.
UCD text-file format (UAX #44):
- Fields separated by `;`
- Lines starting with `#` are comments
- Blank lines are ignored
- Some lines carry an inline `# trailing comment` after the data
Direct Known Subclasses
BidiBrackets, BidiMirroring, Blocks, CaseFolding, CjkRadicals, DerivedAge, DerivedCoreProperties, ExtractedProperties, NameAliases, NamedSequences, NamesList, PropertyAliases, PropertyValueAliases, ScriptExtensions, Scripts, SpecialCasing, StandardizedVariants, UnicodeData, Unihan
Defined Under Namespace
Classes: Line
Class Method Summary collapse
-
.each_line(path) ⇒ Object
Iterates non-blank, non-comment lines from ‘path`, yielding Line records.
-
.parse_codepoint_or_range(field) ⇒ Object
Parses a codepoint-or-range field per UAX #44.
-
.parse_field(line, n) ⇒ Object
Parses an n-th ‘;`-separated field from a line of text or a Line struct.
-
.parse_hex_cp(input) ⇒ Object
Parses a single hex codepoint string into an Integer.
Class Method Details
.each_line(path) ⇒ Object
Iterates non-blank, non-comment lines from ‘path`, yielding Line records. Returns an Enumerator when no block is given so callers can chain (`.first(n)`, `.lazy.map`, etc.).
Lines that are entirely whitespace or start with ‘#` are skipped silently — comment text is preserved on data lines that carry an inline `# trailing comment`.
54 55 56 57 58 59 60 61 62 63 64 65 66 |
# File 'lib/ucode/parsers/base.rb', line 54 def each_line(path) return enum_for(:each_line, path) unless block_given? lineno = 0 File.foreach(path.to_s) do |raw| lineno += 1 stripped = raw.strip next if stripped.empty? next if stripped.start_with?("#") yield build_line(lineno, raw) end end |
.parse_codepoint_or_range(field) ⇒ Object
Parses a codepoint-or-range field per UAX #44. Accepts:
"0041" → 0x0041 (Integer)
"3400..4DBF" → 0x3400..0x4DBF (Range)
Returns nil for blank input. Raises Ucode::MalformedLineError for invalid hex.
84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/ucode/parsers/base.rb', line 84 def parse_codepoint_or_range(field) return nil if field.nil? || field.empty? if field.include?(RANGE_SEPARATOR) first_str, last_str = field.split(RANGE_SEPARATOR, 2) first = parse_hex_cp(first_str) last = parse_hex_cp(last_str) Range.new(first, last) else parse_hex_cp(field) end end |
.parse_field(line, n) ⇒ Object
Parses an n-th ‘;`-separated field from a line of text or a Line struct. Strips surrounding whitespace. Returns nil if the field is missing or out of range.
71 72 73 74 75 76 |
# File 'lib/ucode/parsers/base.rb', line 71 def parse_field(line, n) fields = line_fields(line) return nil if fields.length <= n fields[n] end |
.parse_hex_cp(input) ⇒ Object
Parses a single hex codepoint string into an Integer. Raises Ucode::MalformedLineError with the offending input in context for invalid input.
100 101 102 103 104 105 106 107 108 109 |
# File 'lib/ucode/parsers/base.rb', line 100 def parse_hex_cp(input) s = input.to_s.strip unless s.match?(HEX_PATTERN) raise MalformedLineError.new( "invalid codepoint: #{input.inspect}", context: { input: input } ) end s.to_i(16) end |