Class: Ucode::Parsers::UnicodeData
- Defined in:
- lib/ucode/parsers/unicode_data.rb,
lib/ucode/parsers/unicode_data/hangul_name.rb
Overview
Parses ‘UnicodeData.txt` — the primary per-codepoint record file.
Field layout (UAX #44, 15 ‘;`-separated fields):
0. codepoint
1. name (`<control>` or `<Type, First>` / `<Type, Last>` for ranges)
2. general_category
3. canonical_combining_class
4. bidi_class
5. decomposition_type_and_mapping (combined: optional `<tag>` + cps)
6. numeric_value_decimal (deprecated duplicate of 8 for Nd)
7. numeric_value_digit (deprecated duplicate of 8 for Nl)
8. numeric_value (canonical)
9. bidi_mirrored (Y/N)
10. Unicode_1_Name (deprecated, kept as `name1`)
11. ISO_10646_comment (deprecated, ignored)
12. simple_uppercase_mapping
13. simple_lowercase_mapping
14. simple_titlecase_mapping
Hangul syllables and CJK ideographs appear as range markers (‘<…, First>` / `<…, Last>`). The range is expanded to one CodePoint per codepoint with the appropriate synthesized name.
Class Method Summary collapse
-
.each_record(path) ⇒ Object
Yields one CodePoint per codepoint in ‘path`.
Methods inherited from Base
each_line, parse_codepoint_or_range, parse_field, parse_hex_cp
Class Method Details
.each_record(path) ⇒ Object
Yields one CodePoint per codepoint in ‘path`. Range markers (`<…, First>` to `<…, Last>`) are expanded to one CodePoint per codepoint, with names synthesized per Unicode rules.
Returns a lazy Enumerator when called without a block.
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/ucode/parsers/unicode_data.rb', line 43 def each_record(path) return enum_for(:each_record, path) unless block_given? pending_range = nil each_line(path) do |line| begin fields = line.fields if pending_range unless fields[1]&.end_with?("#{LAST_MARKER}>") raise MalformedLineError.new( "expected <#{pending_range[:template]}, #{LAST_MARKER}>, " \ "got #{fields[1].inspect}", context: { file: path.to_s, line: line.number } ) end last_cp = parse_hex_cp(fields[0]) (pending_range, last_cp).each { |cp| yield cp } pending_range = nil next end cp = parse_hex_cp(fields[0]) name = fields[1] if range_start?(name) pending_range = { first_cp: cp, template: extract_template(name), general_category: fields[2], combining_class: fields[3].to_i, bidi_class: fields[4], bidi_mirrored: fields[9] } next end yield build_codepoint( cp: cp, name: synthesize_name(cp, name), general_category: fields[2], combining_class: fields[3].to_i, bidi_class: fields[4], decomposition_field: fields[5], numeric_decimal: fields[6], numeric_digit: fields[7], numeric_value: fields[8], bidi_mirrored: fields[9], unicode_1_name: fields[10], simple_upper_id: fields[12], simple_lower_id: fields[13], simple_title_id: fields[14] ) rescue MalformedLineError => e e.context[:file] ||= path.to_s e.context[:line] ||= line.number raise end end nil end |