Module: Dommy::Internal::IDNA

Defined in:
lib/dommy/internal/idna.rb

Overview

IDNA ToASCII / ToUnicode for domain names. Built on ‘Internal::Punycode` plus the Unicode tables in `Internal::IDNAData` (generated by `script/build_idna_tables.rb` from Unicode 16.0 source files in `vendor/unicode/`).

Conforms (approximately) to UTS #46 with WHATWG URL parameters:

UseSTD3ASCIIRules       = false   (so `_` etc. are allowed)
Transitional_Processing = false   (so `ß` stays as `ß`)
CheckHyphens            = true    (strict per RFC 5891)
CheckBidi               = true    (RFC 5893)
CheckJoiners            = true    (RFC 5892 ContextJ for ZWJ/ZWNJ)

Algorithm: UTS #46 §4 (Processing). Each input goes through:

1. Map (UTS #46 mapping table)
2. Normalize (NFC)
3. Break into labels on `.`
4. ACE-decode any `xn--`-prefixed label
5. Validate (hyphen rules, leading combining marks, Bidi, ContextJ)
6. Punycode-encode non-ASCII labels
7. Length-validate the result

Defined Under Namespace

Classes: Error

Constant Summary collapse

ACE_PREFIX =
"xn--"
MAX_LABEL_OCTETS =
63
MAX_DOMAIN_OCTETS =
253
RTL_BODY_CLASSES =

Bidi classes permitted in the body of each kind of label (per RFC 5893 §2).

%i[R AL AN EN ES CS ET ON BN NSM].freeze
LTR_BODY_CLASSES =
%i[L EN ES CS ET ON BN NSM].freeze
GREEK_RANGES =

Script ranges used by RFC 5892 ContextO checks. We only need Greek / Hebrew / Hiragana / Katakana / Han, so we hardcode the block ranges instead of pulling in the full Script property table. Covers the practical cases; a code point in one of these blocks but not actually that script is rare and the spec-stricter interpretation still flags it correctly for IDN.

[
  # Greek and Coptic
  [0x0370, 0x03FF],
  # Greek Extended
  [0x1F00, 0x1FFF]
].freeze
HEBREW_RANGES =
[
  # Hebrew
  [0x0590, 0x05FF],
  # Alphabetic Presentation Forms — Hebrew
  [0xFB1D, 0xFB4F]
].freeze
HIRAGANA_KATAKANA_HAN_RANGES =
[
  # Hiragana
  [0x3040, 0x309F],
  # Katakana (incl. U+30FB itself)
  [0x30A0, 0x30FF],
  # Katakana Phonetic Extensions
  [0x31F0, 0x31FF],
  # CJK Unified Ideographs Extension A
  [0x3400, 0x4DBF],
  # CJK Unified Ideographs
  [0x4E00, 0x9FFF],
  # CJK Compatibility Ideographs
  [0xF900, 0xFAFF],
  # Halfwidth Katakana
  [0xFF66, 0xFF9F],
  # Kana Extended-A
  [0x1B100, 0x1B12F],
  # Small Kana Extension
  [0x1B130, 0x1B16F],
  # CJK Ext B
  [0x20000, 0x2A6DF],
  # CJK Ext C–G
  [0x2A700, 0x2EBEF],
  # CJK Compatibility Supplement
  [0x2F800, 0x2FA1F]
].freeze

Class Method Summary collapse

Class Method Details

.ace_decode(label) ⇒ Object

— Step 4: ACE decode if prefixed —————————–



159
160
161
162
163
164
# File 'lib/dommy/internal/idna.rb', line 159

def self.ace_decode(label)
  return label unless label.downcase.start_with?(ACE_PREFIX)
  return "" if label.length == ACE_PREFIX.length

  Punycode.decode(label[ACE_PREFIX.length..])
end

.bidi_class_of(cp) ⇒ Object



432
433
434
435
# File 'lib/dommy/internal/idna.rb', line 432

def self.bidi_class_of(cp)
  row = IDNAData.lookup(IDNAData::BIDI_CLASS, cp)
  row ? row[2] : :L
end

.bidi_label?(label) ⇒ Boolean

— Bidi (RFC 5893) ——————————————-

Returns:

  • (Boolean)


270
271
272
273
274
# File 'lib/dommy/internal/idna.rb', line 270

def self.bidi_label?(label)
  label.each_codepoint.any? do |cp|
    %i[R AL AN].include?(bidi_class_of(cp))
  end
end

.check_bidi(label) ⇒ Object



276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
# File 'lib/dommy/internal/idna.rb', line 276

def self.check_bidi(label)
  cps = label.codepoints
  return if cps.empty?

  classes = cps.map { |cp| bidi_class_of(cp) }
  first = classes.first
  last_non_nsm = classes.reverse.find { |c| c != :NSM }

  case first
  when :R, :AL
    classes.each do |c|
      next if RTL_BODY_CLASSES.include?(c)

      raise Error, "Bidi rule 2 violation: class #{c} in RTL label"
    end

    unless %i[R AL EN AN].include?(last_non_nsm)
      raise Error, "Bidi rule 3 violation: RTL label trailing class #{last_non_nsm}"
    end

    if classes.include?(:EN) && classes.include?(:AN)
      raise Error, "Bidi rule 4 violation: EN and AN both present"
    end

  when :L
    classes.each do |c|
      next if LTR_BODY_CLASSES.include?(c)

      raise Error, "Bidi rule 5 violation: class #{c} in LTR label"
    end

    unless %i[L EN].include?(last_non_nsm)
      raise Error, "Bidi rule 6 violation: LTR label trailing class #{last_non_nsm}"
    end
  else
    raise Error, "Bidi rule 1 violation: label starts with #{first}"
  end
end

.check_contextj(label) ⇒ Object

— ContextJ (RFC 5892) —————————————



317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# File 'lib/dommy/internal/idna.rb', line 317

def self.check_contextj(label)
  cps = label.codepoints
  cps.each_with_index do |cp, i|
    case cp
    # ZWNJ
    when 0x200C
      next if zwnj_allowed?(cps, i)

      raise Error, "ZWNJ in invalid context"
      # ZWJ
    when 0x200D
      next if zwj_allowed?(cps, i)

      raise Error, "ZWJ in invalid context"
    end
  end
end

.check_contexto(label) ⇒ Object

— ContextO (RFC 5892 §4) ————————————

Position-sensitive rules for seven specific code points whose validity depends on neighbors / script co-occurrence.



370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
# File 'lib/dommy/internal/idna.rb', line 370

def self.check_contexto(label)
  cps = label.codepoints
  cps.each_with_index do |cp, i|
    case cp
    when 0x00B7
      # §4.1 MIDDLE DOT — allowed only between two `l` characters
      # (Catalan `l·l` ligature).
      unless cps[i - 1] == 0x006C && cps[i + 1] == 0x006C
        raise Error, "U+00B7 MIDDLE DOT requires surrounding 'l' characters"
      end

    when 0x0375
      # §4.2 GREEK LOWER NUMERAL SIGN — next char must be Greek.
      unless in_ranges?(cps[i + 1], GREEK_RANGES)
        raise Error, "U+0375 must precede a Greek-script character"
      end

    when 0x05F3, 0x05F4
      # §4.3, §4.4 HEBREW GERESH / GERSHAYIM — previous char
      # must be Hebrew.
      unless in_ranges?(cps[i - 1], HEBREW_RANGES)
        raise(
          Error,
          "U+#{cp.to_s(16).upcase} must follow a Hebrew-script character"
        )
      end

    when 0x30FB
      # §4.5 KATAKANA MIDDLE DOT — label must contain at least
      # one Hiragana/Katakana/Han character. U+30FB itself has
      # Script=Common, not Katakana — only its block falls in
      # the Katakana block, so we exclude it from the cohort.
      companions = cps.each_with_index.reject { |c, j| j == i || c == 0x30FB }.map(&:first)
      unless companions.any? { |c| in_ranges?(c, HIRAGANA_KATAKANA_HAN_RANGES) }
        raise(
          Error,
          "U+30FB requires another Hiragana/Katakana/Han character in the label"
        )
      end

    when 0x0660..0x0669
      # §4.6 Arabic-Indic Digits cannot mix with Extended
      # Arabic-Indic Digits in the same label.
      if cps.any? { |c| (0x06F0..0x06F9).cover?(c) }
        raise Error, "Arabic-Indic digit forbidden alongside Extended Arabic-Indic digit"
      end

    when 0x06F0..0x06F9
      # §4.7 symmetric to §4.6.
      if cps.any? { |c| (0x0660..0x0669).cover?(c) }
        raise Error, "Extended Arabic-Indic digit forbidden alongside Arabic-Indic digit"
      end
    end
  end
end

.encode_label(label) ⇒ Object

— Step 6: encode ——————————————–



207
208
209
210
211
212
# File 'lib/dommy/internal/idna.rb', line 207

def self.encode_label(label)
  return label if label.empty?
  return label if label.ascii_only?

  ACE_PREFIX + Punycode.encode(label)
end

.in_ranges?(cp, ranges) ⇒ Boolean

Returns:

  • (Boolean)


426
427
428
429
430
# File 'lib/dommy/internal/idna.rb', line 426

def self.in_ranges?(cp, ranges)
  return false if cp.nil?

  ranges.any? { |(lo, hi)| cp >= lo && cp <= hi }
end

.joining_type_of(cp) ⇒ Object



437
438
439
440
# File 'lib/dommy/internal/idna.rb', line 437

def self.joining_type_of(cp)
  row = IDNAData.lookup(IDNAData::JOINING_TYPE, cp)
  row ? row[2] : :U
end

.scan_joining(cps, start, step) ⇒ Object

Walk in ‘step` direction from `start`, skipping Joining_Type=T, and return the first non-T joining type encountered (or nil at the edge).



353
354
355
356
357
358
359
360
361
362
363
# File 'lib/dommy/internal/idna.rb', line 353

def self.scan_joining(cps, start, step)
  i = start
  while i >= 0 && i < cps.length
    jt = joining_type_of(cps[i])
    return jt unless jt == :T

    i += step
  end

  nil
end

.to_ascii(domain) ⇒ Object

‘domain` → ASCII-only form. Returns nil for nil input.



90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/dommy/internal/idna.rb', line 90

def self.to_ascii(domain)
  return domain if domain.nil?

  mapped = uts46_map(domain.to_s)
  normalized = mapped.unicode_normalize(:nfc)
  labels = normalized.split(".", -1)

  validate_no_empty_intermediate(labels)
  bidi_domain = labels.any? { |l| bidi_label?(l) }

  encoded = labels.map do |label|
    ace_encoded = label.downcase.start_with?(ACE_PREFIX)
    decoded = ace_decode(label)

    # A-labels carry an extra invariant: the decoded U-label
    # must itself be valid IDNA (no mapped / ignored /
    # disallowed code points), and re-encoding it must produce
    # the original A-label modulo case. UTS #46 §4 step 4 / RFC
    # 5891 §4.2.
    if ace_encoded
      validate_decoded_u_label(decoded)
      validate_a_label_roundtrip(label, decoded)
    end

    validate_label(decoded, bidi_domain: bidi_domain)
    encode_label(decoded)
  end

  encoded.each { |label| validate_a_label_form(label) }
  result = encoded.join(".")
  validate_total_length(result)
  result
end

.to_unicode(domain) ⇒ Object

Inverse: any ‘xn–`-prefixed label is Punycode-decoded back to Unicode. ASCII labels pass through unchanged.



126
127
128
129
130
131
# File 'lib/dommy/internal/idna.rb', line 126

def self.to_unicode(domain)
  return domain if domain.nil?

  labels = domain.to_s.split(".", -1)
  labels.map { |label| ace_decode(label) }.join(".")
end

.uts46_map(input) ⇒ Object

— UTS #46 step 1: map —————————————–



135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# File 'lib/dommy/internal/idna.rb', line 135

def self.uts46_map(input)
  out = +""
  input.each_codepoint do |cp|
    row = IDNAData.lookup(IDNAData::IDNA_MAPPING, cp)
    status = row ? row[2] : :disallowed

    case status
    when :valid
      out << [cp].pack("U*")
    when :ignored
      # drop
    when :mapped
      mapping = row[3]
      out << (mapping || [cp].pack("U*"))
    when :disallowed
      raise Error, "disallowed code point: U+#{cp.to_s(16).upcase}"
    end
  end

  out
end

.validate_a_label_form(label) ⇒ Object



214
215
216
217
218
# File 'lib/dommy/internal/idna.rb', line 214

def self.validate_a_label_form(label)
  if label.bytesize > MAX_LABEL_OCTETS
    raise Error, "label exceeds 63 octets: #{label.inspect}"
  end
end

.validate_a_label_roundtrip(a_label, u_label) ⇒ Object

Round-trip invariant: re-encoding the U-label must produce the original A-label (case-insensitively). Catches malformed ‘xn–` inputs whose Punycode decodes-but-doesn’t-recover.

Raises:



251
252
253
254
255
256
257
258
259
# File 'lib/dommy/internal/idna.rb', line 251

def self.validate_a_label_roundtrip(a_label, u_label)
  re_encoded = u_label.ascii_only? ? u_label : ACE_PREFIX + Punycode.encode(u_label)
  return if re_encoded.downcase == a_label.downcase

  raise(
    Error,
    "A-label fails round-trip: #{a_label.inspect}#{re_encoded.inspect}"
  )
end

.validate_decoded_u_label(label) ⇒ Object

After ACE-decoding an A-label, every code point in the resulting U-label must itself be IDNA :valid — ‘:mapped`, `:ignored`, or `:disallowed` are not allowed at this stage.



235
236
237
238
239
240
241
242
243
244
245
246
# File 'lib/dommy/internal/idna.rb', line 235

def self.validate_decoded_u_label(label)
  label.each_codepoint do |cp|
    row = IDNAData.lookup(IDNAData::IDNA_MAPPING, cp)
    status = row ? row[2] : :disallowed
    next if status == :valid

    raise(
      Error,
      "A-label decodes to invalid code point U+#{cp.to_s(16).upcase} (status #{status})"
    )
  end
end

.validate_hyphens(label) ⇒ Object



178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# File 'lib/dommy/internal/idna.rb', line 178

def self.validate_hyphens(label)
  if label.start_with?("-")
    raise Error, "label starts with hyphen: #{label.inspect}"
  end

  if label.end_with?("-")
    raise Error, "label ends with hyphen: #{label.inspect}"
  end

  if label.length >= 4 &&
      label[2] == "-" &&
      label[3] == "-" &&
      !label.downcase.start_with?(ACE_PREFIX)
    raise Error, "label has reserved hyphens at positions 3-4: #{label.inspect}"
  end
end

.validate_label(label, bidi_domain:) ⇒ Object

— Step 5: validate per-label ———————————



168
169
170
171
172
173
174
175
176
# File 'lib/dommy/internal/idna.rb', line 168

def self.validate_label(label, bidi_domain:)
  return if label.empty?

  validate_hyphens(label)
  validate_no_leading_combining_mark(label)
  check_contextj(label)
  check_contexto(label)
  check_bidi(label) if bidi_domain
end

.validate_no_empty_intermediate(labels) ⇒ Object

Per RFC 5891 §4.2.3 a non-final label must be non-empty. ‘example.test.` (trailing dot) parses as `[“example”, “test”, “”]` — the trailing empty is OK; any other empty (e.g. `a..b` → `[“a”, “”, “b”]`) is invalid.



224
225
226
227
228
229
230
# File 'lib/dommy/internal/idna.rb', line 224

def self.validate_no_empty_intermediate(labels)
  labels[0...-1].each_with_index do |label, idx|
    next unless label.empty?

    raise Error, "empty label at position #{idx}"
  end
end

.validate_no_leading_combining_mark(label) ⇒ Object

Raises:



195
196
197
198
199
200
201
202
203
# File 'lib/dommy/internal/idna.rb', line 195

def self.validate_no_leading_combining_mark(label)
  first_cp = label.codepoints.first
  return unless first_cp

  bidi_class = bidi_class_of(first_cp)
  return unless bidi_class == :NSM

  raise Error, "label starts with combining mark: #{label.inspect}"
end

.validate_total_length(domain) ⇒ Object



261
262
263
264
265
266
# File 'lib/dommy/internal/idna.rb', line 261

def self.validate_total_length(domain)
  measured = domain.end_with?(".") ? domain[0...-1] : domain
  if measured.bytesize > MAX_DOMAIN_OCTETS
    raise Error, "domain exceeds 253 octets: #{measured.bytesize} octets"
  end
end

.zwj_allowed?(cps, idx) ⇒ Boolean

Returns:

  • (Boolean)


345
346
347
348
# File 'lib/dommy/internal/idna.rb', line 345

def self.zwj_allowed?(cps, idx)
  prev = idx.positive? ? cps[idx - 1] : nil
  prev && IDNAData::VIRAMA.include?(prev)
end

.zwnj_allowed?(cps, idx) ⇒ Boolean

Returns:

  • (Boolean)


335
336
337
338
339
340
341
342
343
# File 'lib/dommy/internal/idna.rb', line 335

def self.zwnj_allowed?(cps, idx)
  prev = idx.positive? ? cps[idx - 1] : nil
  return true if prev && IDNAData::VIRAMA.include?(prev)

  # Or: (Joining_Type:L|D)(Joining_Type:T)* . (Joining_Type:T)*(Joining_Type:R|D)
  left = scan_joining(cps, idx - 1, -1)
  right = scan_joining(cps, idx + 1, 1)
  %i[L D].include?(left) && %i[R D].include?(right)
end