Class: Ucode::Parsers::UnicodeData

Inherits:
Base
  • Object
show all
Defined in:
lib/ucode/parsers/unicode_data.rb,
lib/ucode/parsers/unicode_data/hangul_name.rb

Overview

Parses ‘UnicodeData.txt` — the primary per-codepoint record file.

Field layout (UAX #44, 15 ‘;`-separated fields):

0.  codepoint
1.  name (`<control>` or `<Type, First>` / `<Type, Last>` for ranges)
2.  general_category
3.  canonical_combining_class
4.  bidi_class
5.  decomposition_type_and_mapping (combined: optional `<tag>` + cps)
6.  numeric_value_decimal (deprecated duplicate of 8 for Nd)
7.  numeric_value_digit    (deprecated duplicate of 8 for Nl)
8.  numeric_value          (canonical)
9.  bidi_mirrored (Y/N)
10. Unicode_1_Name         (deprecated, kept as `name1`)
11. ISO_10646_comment      (deprecated, ignored)
12. simple_uppercase_mapping
13. simple_lowercase_mapping
14. simple_titlecase_mapping

Hangul syllables and CJK ideographs appear as range markers (‘<…, First>` / `<…, Last>`). The range is expanded to one CodePoint per codepoint with the appropriate synthesized name.

Class Method Summary collapse

Methods inherited from Base

each_line, parse_codepoint_or_range, parse_field, parse_hex_cp

Class Method Details

.each_record(path) ⇒ Object

Yields one CodePoint per codepoint in ‘path`. Range markers (`<…, First>` to `<…, Last>`) are expanded to one CodePoint per codepoint, with names synthesized per Unicode rules.

Returns a lazy Enumerator when called without a block.



43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# File 'lib/ucode/parsers/unicode_data.rb', line 43

def each_record(path)
  return enum_for(:each_record, path) unless block_given?

  pending_range = nil

  each_line(path) do |line|
    begin
      fields = line.fields

      if pending_range
        unless fields[1]&.end_with?("#{LAST_MARKER}>")
          raise MalformedLineError.new(
            "expected <#{pending_range[:template]}, #{LAST_MARKER}>, " \
            "got #{fields[1].inspect}",
            context: { file: path.to_s, line: line.number }
          )
        end

        last_cp = parse_hex_cp(fields[0])
        expand_range(pending_range, last_cp).each { |cp| yield cp }
        pending_range = nil
        next
      end

      cp = parse_hex_cp(fields[0])
      name = fields[1]

      if range_start?(name)
        pending_range = {
          first_cp: cp,
          template: extract_template(name),
          general_category: fields[2],
          combining_class: fields[3].to_i,
          bidi_class: fields[4],
          bidi_mirrored: fields[9]
        }
        next
      end

      yield build_codepoint(
        cp: cp,
        name: synthesize_name(cp, name),
        general_category: fields[2],
        combining_class: fields[3].to_i,
        bidi_class: fields[4],
        decomposition_field: fields[5],
        numeric_decimal: fields[6],
        numeric_digit: fields[7],
        numeric_value: fields[8],
        bidi_mirrored: fields[9],
        unicode_1_name: fields[10],
        simple_upper_id: fields[12],
        simple_lower_id: fields[13],
        simple_title_id: fields[14]
      )
    rescue MalformedLineError => e
      e.context[:file] ||= path.to_s
      e.context[:line] ||= line.number
      raise
    end
  end

  nil
end