Class: Ucode::Parsers::Base

Inherits:

Object

Object
Ucode::Parsers::Base

show all

Defined in:: lib/ucode/parsers/base.rb

Overview

Shared infrastructure for every UCD text-file parser. Subclasses implement ‘.each_record(path) { |record| … }` returning an Enumerator when called without a block.

All methods are class methods — parsers are stateless.

UCD text-file format (UAX #44):

- Fields separated by `;`
- Lines starting with `#` are comments
- Blank lines are ignored
- Some lines carry an inline `# trailing comment` after the data

Direct Known Subclasses

BidiBrackets, BidiMirroring, Blocks, CaseFolding, CjkRadicals, DerivedAge, DerivedCoreProperties, ExtractedProperties, NameAliases, NamedSequences, NamesList, PropertyAliases, PropertyValueAliases, ScriptExtensions, Scripts, SpecialCasing, StandardizedVariants, UnicodeData, Unihan

Defined Under Namespace

Classes: Line

Class Method Summary collapse

.each_line(path) ⇒ Object

Iterates non-blank, non-comment lines from ‘path`, yielding Line records.
.parse_codepoint_or_range(field) ⇒ Object

Parses a codepoint-or-range field per UAX #44.
.parse_field(line, n) ⇒ Object

Parses an n-th ‘;`-separated field from a line of text or a Line struct.
.parse_hex_cp(input) ⇒ Object

Parses a single hex codepoint string into an Integer.

Class Method Details

.each_line(path) ⇒ `Object`

Iterates non-blank, non-comment lines from ‘path`, yielding Line records. Returns an Enumerator when no block is given so callers can chain (`.first(n)`, `.lazy.map`, etc.).

Lines that are entirely whitespace or start with ‘#` are skipped silently — comment text is preserved on data lines that carry an inline `# trailing comment`.

# File 'lib/ucode/parsers/base.rb', line 54

def each_line(path)
  return enum_for(:each_line, path) unless block_given?

  lineno = 0
  File.foreach(path.to_s) do |raw|
    lineno += 1
    stripped = raw.strip
    next if stripped.empty?
    next if stripped.start_with?("#")

    yield build_line(lineno, raw)
  end
end

.parse_codepoint_or_range(field) ⇒ `Object`

Parses a codepoint-or-range field per UAX #44. Accepts:

"0041"           → 0x0041 (Integer)
"3400..4DBF"     → 0x3400..0x4DBF (Range)

Returns nil for blank input. Raises Ucode::MalformedLineError for invalid hex.

# File 'lib/ucode/parsers/base.rb', line 84

def parse_codepoint_or_range(field)
  return nil if field.nil? || field.empty?

  if field.include?(RANGE_SEPARATOR)
    first_str, last_str = field.split(RANGE_SEPARATOR, 2)
    first = parse_hex_cp(first_str)
    last = parse_hex_cp(last_str)
    Range.new(first, last)
  else
    parse_hex_cp(field)
  end
end

.parse_field(line, n) ⇒ `Object`

Parses an n-th ‘;`-separated field from a line of text or a Line struct. Strips surrounding whitespace. Returns nil if the field is missing or out of range.

# File 'lib/ucode/parsers/base.rb', line 71

def parse_field(line, n)
  fields = line_fields(line)
  return nil if fields.length <= n

  fields[n]
end

.parse_hex_cp(input) ⇒ `Object`

Parses a single hex codepoint string into an Integer. Raises Ucode::MalformedLineError with the offending input in context for invalid input.

# File 'lib/ucode/parsers/base.rb', line 100

def parse_hex_cp(input)
  s = input.to_s.strip
  unless s.match?(HEX_PATTERN)
    raise MalformedLineError.new(
      "invalid codepoint: #{input.inspect}",
      context: { input: input }
    )
  end
  s.to_i(16)
end