Class: Unisec::Blocks

Inherits:
Object
  • Object
show all
Defined in:
lib/unisec/blocks.rb

Overview

Operations about Unicode blocks

Constant Summary collapse

UCD_BLOCKS =

UCD Blocks file location

File.join(__dir__, '../../data/Blocks.txt')
INVALID_RANGES =

List of invalid, private, reserved ranges. Unasigned, unallocated ranges are calculated dynamically in list_unassigned.

[
  { range: 0xd800..0xdfff, name: 'Surrogates (invalid outside UTF-16)' },
  { range: 0xe000..0xf8ff, name: 'Private Use Area (located in BMP)' },
  { range: 0xf0000..0xfffff, name: 'Supplementary Private Use Area-A' },
  { range: 0x100000..0x10ffff, name: 'Supplementary Private Use Area-B' }
].freeze

Class Method Summary collapse

Class Method Details

.block(block_arg, with_count: false) ⇒ Hash|nil

Find the block including the target character or code point, or matching the provided name.

Examples:

Unisec::Blocks.block(65, with_count:true) # => {range: 0..127, name: "Basic Latin", range_size: 128, char_count: 95}
Unisec::Blocks.block("U+1f4a9") # => {range: 127744..128511, name: "Miscellaneous Symbols and Pictographs", range_size: nil, char_count: nil}
Unisec::Blocks.block("", with_count:true) # => {range: 8192..8303, name: "General Punctuation", range_size: 112, char_count: 111}
Unisec::Blocks.block("javanese") # => {range: 43392..43487, name: "Javanese", range_size: nil, char_count: nil}

Parameters:

  • block_arg (Integer|String)

    Decimal code point or standardized hexadecimal codepoint or string character (only one, so be careful with emojis, composed or joint characters using several units) or directly look for the block name (case insensitive).

  • with_count (TrueClass|FalseClass) (defaults to: false)

    calculate block's range size & char count?

Returns:

  • (Hash|nil)

    Maching block (block name, range and count) or nil if not found



100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# File 'lib/unisec/blocks.rb', line 100

def self.block(block_arg, with_count: false) # rubocop:disable Metrics/AbcSize,Metrics/CyclomaticComplexity,Metrics/MethodLength,Metrics/PerceivedComplexity
  file = File.new(UCD_BLOCKS)
  found = false
  file.each_line(chomp: true) do |line|
    # Skip if the line is empty or a comment
    next if line.empty? || line[0] == '#'

    # parse the line to extract code point range and the name
    blk_range, blk_name = line.split(';')
    blk_range = Unisec::Utils::String.to_range(blk_range)
    blk_name.lstrip!
    case block_arg
    when Integer # block_arg is an intgeger code point
      found = true if blk_range.include?(block_arg)
    when String # can be a char or block name or a string code point
      if block_arg.size == 1 # is a char (1 code unit, not one grapheme)
        found = true if blk_range.include?(Utils::String.convert_to_integer(block_arg))
      elsif block_arg.start_with?('U+') # string code point
        found = true if blk_range.include?(Utils::String.convert(block_arg, :integer))
      elsif blk_name.downcase == block_arg.downcase # block name
        found = true
      end
    end
    if found
      return {
        range: blk_range,
        name: blk_name,
        range_size: with_count ? blk_range.size : nil,
        char_count: with_count ? count_char_in_block(blk_range) : nil
      }
    end
  end
  nil # not found
end

.block_display(block_arg, with_count: false) ⇒ Object

Display a CLI-friendly output detailing the searched block

Parameters:

  • block_arg (Integer|String)

    Decimal code point or standardized hexadecimal codepoint or string character (only one, so be careful with emojis, composed or joint characters using several units) or directly look for the block name (case insensitive).

  • with_count (TrueClass|FalseClass) (defaults to: false)

    calculate block's range size & char count?



177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# File 'lib/unisec/blocks.rb', line 177

def self.block_display(block_arg, with_count: false)
  blk = block(block_arg, with_count: with_count)
  if blk.nil?
    puts "no block found with #{block_arg}"
  else
    display = ->(key, value) { puts Paint[key, :red, :bold] + " #{value}" }
    display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]))
    display.call('Name:', blk[:name])
    if with_count
      display.call('Range size:', blk[:range_size])
      display.call('Char count:', blk[:char_count])
    end
  end
  nil
end

.count_char_in_block(range) ⇒ Integer

Count the number of characters allocated in a block. ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt.

Examples:

Unisec::Blocks::count_char_in_block(0xAC00..0xD7AF) # => 11172

Parameters:

  • range (Range)

    Block code point range

Returns:

  • (Integer)

    number of code points in the block



66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/unisec/blocks.rb', line 66

def self.count_char_in_block(range) # rubocop:disable Metrics/AbcSize
  counter = 0
  file = File.new(Rugrep::UCD_DERIVEDNAME)
  file.each_line(chomp: true) do |line|
    # Skip if the line is empty or a comment
    next if line.empty? || line[0] == '#'

    # parse the line to extract code point as integer and the name
    cp_int, _name = line.split(';')
    if cp_int.include?('..') # handle ranges in DerivedName.txt
      ucd_range = Utils::String.to_range(cp_int)
      next unless range.include_range?(ucd_range)

      counter += ucd_range.size
      next
    end
    cp_int = cp_int.chomp.to_i(16)
    next unless range.include?(cp_int)

    counter += 1
    break if cp_int == range.end
  end
  counter
end

.list(with_count: false) ⇒ Array<Hash>

List Unicode blocks name ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt. ⚠️ Populating char_count is slow and can take a few seconds.

Examples:

Unisec::Blocks.list # => [{range: 0..127, name: "Basic Latin", range_size: nil, char_count: nil}, … ]
Unisec::Blocks.list(with_count: true) # => [{range: 0..127, name: "Basic Latin", range_size: 128, char_count: 95}, … ]

Parameters:

  • with_count (TrueClass|FalseClass) (defaults to: false)

    calculate block's range size & char count?

Returns:

  • (Array<Hash>)

    List of blocks (block name, range and count)



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# File 'lib/unisec/blocks.rb', line 39

def self.list(with_count: false)
  out = []
  file = File.new(UCD_BLOCKS)
  file.each_line(chomp: true) do |line|
    # Skip if the line is empty or a comment
    next if line.empty? || line[0] == '#'

    # parse the line to extract code point range and the name
    blk_range, blk_name = line.split(';')
    blk_range = Unisec::Utils::String.to_range(blk_range)
    blk_name.lstrip!
    out << {
      range: blk_range,
      name: blk_name,
      range_size: with_count ? blk_range.size : nil,
      char_count: with_count ? count_char_in_block(blk_range) : nil
    }
  end
  out
end

.list_display(with_count: false) ⇒ Object

Display a CLI-friendly output listing all blocks

Parameters:

  • with_count (TrueClass|FalseClass) (defaults to: false)

    calculate block's range size & char count?



159
160
161
162
163
164
165
166
167
168
169
170
171
172
# File 'lib/unisec/blocks.rb', line 159

def self.list_display(with_count: false) # rubocop:disable Metrics/AbcSize
  blocks = list(with_count: with_count)
  display = ->(key, value, just) { print Paint[key, :red, :bold] + " #{value}".ljust(just) }
  blocks.each do |blk|
    display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]), 22)
    display.call('Name:', blk[:name], 50)
    if with_count
      display.call('Range size:', blk[:range_size], 8)
      display.call('Char count:', blk[:char_count], 0)
    end
    puts
  end
  nil
end

.list_invalid_displayObject

Display a CLI-friendly output listing all invalid and unsassigned ranges.



194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# File 'lib/unisec/blocks.rb', line 194

def self.list_invalid_display # rubocop:disable Metrics/AbcSize
  display = ->(key, value, just) { print Paint[key, :red, :bold] + " #{value}".ljust(just) }
  puts '(Assigned) invalid, private, reserved ranges:'
  INVALID_RANGES.each do |blk|
    display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]), 22)
    display.call('Name:', blk[:name], 50)
    puts
  end
  puts "\nUnasigned, unallocated ranges:"
  list_unassigned.each do |blk|
    display.call('Range:', Utils::Range.range2codepoint_range(blk), 22)
    puts
  end
  nil
end

.list_unassignedArray<Range>

List unasigned, unallocated ranges.

Examples:

Unisec::Blocks.list_unassigned # => [12256..12271, 66048..66175, …]

Returns:

  • (Array<Range>)

    List of unassigned (code-point) ranges



139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# File 'lib/unisec/blocks.rb', line 139

def self.list_unassigned # rubocop:disable Metrics/AbcSize
  base = (0x0000..0x10ffff)
  assigned = Unisec::Blocks.list.map { |b| b[:range] }

  unassigned = []
  cursor = base.begin

  assigned.each do |r|
    unassigned << (cursor..(r.begin - 1)) if cursor < r.begin
    cursor = r.end + 1
    break if cursor > base.end
  end

  unassigned << (cursor..base.end) if cursor <= base.end

  unassigned
end

.reverse(char) ⇒ String

Returns the name of the Unicode block containing the given character.

Examples:

Unisec::Blocks.reverse('') # => "General Punctuation"
Unisec::Blocks.reverse('A') # => "Basic Latin"
Unisec::Blocks.reverse('💩') # => "Miscellaneous Symbols and Pictographs"
Unisec::Blocks.reverse('🇫🇷') # => "Enclosed Alphanumeric Supplement" (only first unit is kept)

Parameters:

  • char (String)

    Single character (only one code unit, so be careful with emojis, composed or joint characters using several units, only the first code unit will be kept).

Returns:

  • (String)

    Block name or empty string if not found.



220
221
222
223
224
225
226
227
# File 'lib/unisec/blocks.rb', line 220

def self.reverse(char)
  cp_num = TwitterCldr::Utils::CodePoints.from_string(char)
  cp = TwitterCldr::Shared::CodePoint.get(cp_num.first)
  props = cp.properties
  props.block.join
rescue NoMethodError # in case of invalid character where CodePoint.get() => nil
  ''
end

.reverse_display(char) ⇒ Object

Display a CLI-friendly output showing the block name for a given character.

Parameters:

  • char (String)

    Single character (only one code unit, so be careful with emojis, composed or joint characters using several units, only the first code unit will be kept).



233
234
235
236
237
238
239
240
241
# File 'lib/unisec/blocks.rb', line 233

def self.reverse_display(char)
  blk_name = reverse(char)
  if blk_name.empty?
    puts "no block found for #{char.inspect}"
  else
    puts blk_name
  end
  nil
end

.ucd_blocks_versionString

Returns the version of Unicode used in UCD local file (data/Blocks.txt)

Examples:

Unisec::Blocks.ucd_blocks_version # => "17.0.0"

Returns:

  • (String)

    Unicode version



26
27
28
29
# File 'lib/unisec/blocks.rb', line 26

def self.ucd_blocks_version
  first_line = File.open(UCD_BLOCKS, &:readline)
  first_line.match(/-(\d+\.\d+\.\d+)\.txt/).captures.first
end