Class: Ucode::Parsers::Base

Inherits:
Object
  • Object
show all
Defined in:
lib/ucode/parsers/base.rb

Overview

Shared infrastructure for every UCD text-file parser. Subclasses implement ‘.each_record(path) { |record| … }` returning an Enumerator when called without a block.

All methods are class methods — parsers are stateless.

UCD text-file format (UAX #44):

- Fields separated by `;`
- Lines starting with `#` are comments
- Blank lines are ignored
- Some lines carry an inline `# trailing comment` after the data

Defined Under Namespace

Classes: Line

Class Method Summary collapse

Class Method Details

.each_line(path) ⇒ Object

Iterates non-blank, non-comment lines from ‘path`, yielding Line records. Returns an Enumerator when no block is given so callers can chain (`.first(n)`, `.lazy.map`, etc.).

Lines that are entirely whitespace or start with ‘#` are skipped silently — comment text is preserved on data lines that carry an inline `# trailing comment`.



54
55
56
57
58
59
60
61
62
63
64
65
66
# File 'lib/ucode/parsers/base.rb', line 54

def each_line(path)
  return enum_for(:each_line, path) unless block_given?

  lineno = 0
  File.foreach(path.to_s) do |raw|
    lineno += 1
    stripped = raw.strip
    next if stripped.empty?
    next if stripped.start_with?("#")

    yield build_line(lineno, raw)
  end
end

.parse_codepoint_or_range(field) ⇒ Object

Parses a codepoint-or-range field per UAX #44. Accepts:

"0041"           → 0x0041 (Integer)
"3400..4DBF"     → 0x3400..0x4DBF (Range)

Returns nil for blank input. Raises Ucode::MalformedLineError for invalid hex.



84
85
86
87
88
89
90
91
92
93
94
95
# File 'lib/ucode/parsers/base.rb', line 84

def parse_codepoint_or_range(field)
  return nil if field.nil? || field.empty?

  if field.include?(RANGE_SEPARATOR)
    first_str, last_str = field.split(RANGE_SEPARATOR, 2)
    first = parse_hex_cp(first_str)
    last = parse_hex_cp(last_str)
    Range.new(first, last)
  else
    parse_hex_cp(field)
  end
end

.parse_field(line, n) ⇒ Object

Parses an n-th ‘;`-separated field from a line of text or a Line struct. Strips surrounding whitespace. Returns nil if the field is missing or out of range.



71
72
73
74
75
76
# File 'lib/ucode/parsers/base.rb', line 71

def parse_field(line, n)
  fields = line_fields(line)
  return nil if fields.length <= n

  fields[n]
end

.parse_hex_cp(input) ⇒ Object

Parses a single hex codepoint string into an Integer. Raises Ucode::MalformedLineError with the offending input in context for invalid input.



100
101
102
103
104
105
106
107
108
109
# File 'lib/ucode/parsers/base.rb', line 100

def parse_hex_cp(input)
  s = input.to_s.strip
  unless s.match?(HEX_PATTERN)
    raise MalformedLineError.new(
      "invalid codepoint: #{input.inspect}",
      context: { input: input }
    )
  end
  s.to_i(16)
end