Class: Unisec::Blocks
- Inherits:
-
Object
- Object
- Unisec::Blocks
- Defined in:
- lib/unisec/blocks.rb
Overview
Operations about Unicode blocks
Constant Summary collapse
- UCD_BLOCKS =
UCD Blocks file location
File.join(__dir__, '../../data/Blocks.txt')
- INVALID_RANGES =
List of invalid, private, reserved ranges. Unasigned, unallocated ranges are calculated dynamically in list_unassigned.
[ { range: 0xd800..0xdfff, name: 'Surrogates (invalid outside UTF-16)' }, { range: 0xe000..0xf8ff, name: 'Private Use Area (located in BMP)' }, { range: 0xf0000..0xfffff, name: 'Supplementary Private Use Area-A' }, { range: 0x100000..0x10ffff, name: 'Supplementary Private Use Area-B' } ].freeze
Class Method Summary collapse
-
.block(block_arg, with_count: false) ⇒ Hash|nil
Find the block including the target character or code point, or matching the provided name.
-
.block_display(block_arg, with_count: false) ⇒ Object
Display a CLI-friendly output detailing the searched block.
-
.count_char_in_block(range) ⇒ Integer
Count the number of characters allocated in a block.
-
.list(with_count: false) ⇒ Array<Hash>
List Unicode blocks name ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt.
-
.list_display(with_count: false) ⇒ Object
Display a CLI-friendly output listing all blocks.
-
.list_invalid_display ⇒ Object
Display a CLI-friendly output listing all invalid and unsassigned ranges.
-
.list_unassigned ⇒ Array<Range>
List unasigned, unallocated ranges.
-
.reverse(char) ⇒ String
Returns the name of the Unicode block containing the given character.
-
.reverse_display(char) ⇒ Object
Display a CLI-friendly output showing the block name for a given character.
-
.ucd_blocks_version ⇒ String
Returns the version of Unicode used in UCD local file (data/Blocks.txt).
Class Method Details
.block(block_arg, with_count: false) ⇒ Hash|nil
Find the block including the target character or code point, or matching the provided name.
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
# File 'lib/unisec/blocks.rb', line 100 def self.block(block_arg, with_count: false) # rubocop:disable Metrics/AbcSize,Metrics/CyclomaticComplexity,Metrics/MethodLength,Metrics/PerceivedComplexity file = File.new(UCD_BLOCKS) found = false file.each_line(chomp: true) do |line| # Skip if the line is empty or a comment next if line.empty? || line[0] == '#' # parse the line to extract code point range and the name blk_range, blk_name = line.split(';') blk_range = Unisec::Utils::String.to_range(blk_range) blk_name.lstrip! case block_arg when Integer # block_arg is an intgeger code point found = true if blk_range.include?(block_arg) when String # can be a char or block name or a string code point if block_arg.size == 1 # is a char (1 code unit, not one grapheme) found = true if blk_range.include?(Utils::String.convert_to_integer(block_arg)) elsif block_arg.start_with?('U+') # string code point found = true if blk_range.include?(Utils::String.convert(block_arg, :integer)) elsif blk_name.downcase == block_arg.downcase # block name found = true end end if found return { range: blk_range, name: blk_name, range_size: with_count ? blk_range.size : nil, char_count: with_count ? count_char_in_block(blk_range) : nil } end end nil # not found end |
.block_display(block_arg, with_count: false) ⇒ Object
Display a CLI-friendly output detailing the searched block
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
# File 'lib/unisec/blocks.rb', line 177 def self.block_display(block_arg, with_count: false) blk = block(block_arg, with_count: with_count) if blk.nil? puts "no block found with #{block_arg}" else display = ->(key, value) { puts Paint[key, :red, :bold] + " #{value}" } display.call('Range:', Utils::Range.range2codepoint_range(blk[:range])) display.call('Name:', blk[:name]) if with_count display.call('Range size:', blk[:range_size]) display.call('Char count:', blk[:char_count]) end end nil end |
.count_char_in_block(range) ⇒ Integer
Count the number of characters allocated in a block. ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt.
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/unisec/blocks.rb', line 66 def self.count_char_in_block(range) # rubocop:disable Metrics/AbcSize counter = 0 file = File.new(Rugrep::UCD_DERIVEDNAME) file.each_line(chomp: true) do |line| # Skip if the line is empty or a comment next if line.empty? || line[0] == '#' # parse the line to extract code point as integer and the name cp_int, _name = line.split(';') if cp_int.include?('..') # handle ranges in DerivedName.txt ucd_range = Utils::String.to_range(cp_int) next unless range.include_range?(ucd_range) counter += ucd_range.size next end cp_int = cp_int.chomp.to_i(16) next unless range.include?(cp_int) counter += 1 break if cp_int == range.end end counter end |
.list(with_count: false) ⇒ Array<Hash>
List Unicode blocks name ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt. ⚠️ Populating char_count is slow and can take a few seconds.
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
# File 'lib/unisec/blocks.rb', line 39 def self.list(with_count: false) out = [] file = File.new(UCD_BLOCKS) file.each_line(chomp: true) do |line| # Skip if the line is empty or a comment next if line.empty? || line[0] == '#' # parse the line to extract code point range and the name blk_range, blk_name = line.split(';') blk_range = Unisec::Utils::String.to_range(blk_range) blk_name.lstrip! out << { range: blk_range, name: blk_name, range_size: with_count ? blk_range.size : nil, char_count: with_count ? count_char_in_block(blk_range) : nil } end out end |
.list_display(with_count: false) ⇒ Object
Display a CLI-friendly output listing all blocks
159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
# File 'lib/unisec/blocks.rb', line 159 def self.list_display(with_count: false) # rubocop:disable Metrics/AbcSize blocks = list(with_count: with_count) display = ->(key, value, just) { print Paint[key, :red, :bold] + " #{value}".ljust(just) } blocks.each do |blk| display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]), 22) display.call('Name:', blk[:name], 50) if with_count display.call('Range size:', blk[:range_size], 8) display.call('Char count:', blk[:char_count], 0) end puts end nil end |
.list_invalid_display ⇒ Object
Display a CLI-friendly output listing all invalid and unsassigned ranges.
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
# File 'lib/unisec/blocks.rb', line 194 def self.list_invalid_display # rubocop:disable Metrics/AbcSize display = ->(key, value, just) { print Paint[key, :red, :bold] + " #{value}".ljust(just) } puts '(Assigned) invalid, private, reserved ranges:' INVALID_RANGES.each do |blk| display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]), 22) display.call('Name:', blk[:name], 50) puts end puts "\nUnasigned, unallocated ranges:" list_unassigned.each do |blk| display.call('Range:', Utils::Range.range2codepoint_range(blk), 22) puts end nil end |
.list_unassigned ⇒ Array<Range>
List unasigned, unallocated ranges.
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
# File 'lib/unisec/blocks.rb', line 139 def self.list_unassigned # rubocop:disable Metrics/AbcSize base = (0x0000..0x10ffff) assigned = Unisec::Blocks.list.map { |b| b[:range] } unassigned = [] cursor = base.begin assigned.each do |r| unassigned << (cursor..(r.begin - 1)) if cursor < r.begin cursor = r.end + 1 break if cursor > base.end end unassigned << (cursor..base.end) if cursor <= base.end unassigned end |
.reverse(char) ⇒ String
Returns the name of the Unicode block containing the given character.
220 221 222 223 224 225 226 227 |
# File 'lib/unisec/blocks.rb', line 220 def self.reverse(char) cp_num = TwitterCldr::Utils::CodePoints.from_string(char) cp = TwitterCldr::Shared::CodePoint.get(cp_num.first) props = cp.properties props.block.join rescue NoMethodError # in case of invalid character where CodePoint.get() => nil '' end |
.reverse_display(char) ⇒ Object
Display a CLI-friendly output showing the block name for a given character.
233 234 235 236 237 238 239 240 241 |
# File 'lib/unisec/blocks.rb', line 233 def self.reverse_display(char) blk_name = reverse(char) if blk_name.empty? puts "no block found for #{char.inspect}" else puts blk_name end nil end |
.ucd_blocks_version ⇒ String
Returns the version of Unicode used in UCD local file (data/Blocks.txt)
26 27 28 29 |
# File 'lib/unisec/blocks.rb', line 26 def self.ucd_blocks_version first_line = File.open(UCD_BLOCKS, &:readline) first_line.match(/-(\d+\.\d+\.\d+)\.txt/).captures.first end |