Module: Scrapetor::Encoding

Defined in:
lib/scrapetor/encoding.rb

Overview

Encoding detection + UTF-8 normalization.

The native streaming engine treats the input as a byte stream and tags output strings as UTF-8. To make that honest, we transcode non-UTF-8 input to UTF-8 in Ruby before handing it to C — using the cascade the HTML5 spec describes:

1. BOM         — UTF-8 / UTF-16 BE/LE
2. <meta charset=...> in the first ~1024 bytes
3. <meta http-equiv="Content-Type" content="...; charset=...">
4. Fall back to UTF-8

If the detected encoding equals UTF-8 (or close enough), we leave the bytes alone. Otherwise we transcode with ‘invalid: :replace, undef: :replace` so a single bad byte doesn’t poison the whole document.

Constant Summary collapse

META_CHARSET_RE =
/<meta[^>]+charset\s*=\s*["']?([A-Za-z0-9_\-:]+)/i.freeze
META_HTTP_EQUIV_RE =
/<meta[^>]+http-equiv\s*=\s*["']?content-type["']?[^>]+content\s*=\s*["'][^"'>]*charset=([A-Za-z0-9_\-:]+)/i.freeze
SNIFF_BYTES =
1024
BOM_UTF8 =

Best-effort transcode of ‘bytes` to a UTF-8 String. Strips a leading BOM. Never raises — invalid sequences become “” (dropped).

"\xEF\xBB\xBF".b.freeze

Class Method Summary collapse

Class Method Details

.detect(bytes) ⇒ Object



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# File 'lib/scrapetor/encoding.rb', line 24

def self.detect(bytes)
  return "UTF-8" if bytes.nil? || bytes.empty?
  head = (bytes.byteslice(0, 4) || "").dup.force_encoding(::Encoding::ASCII_8BIT)
  return "UTF-8"     if head.start_with?("\xEF\xBB\xBF".b)
  return "UTF-32LE"  if head.bytesize >= 4 && head.start_with?("\xFF\xFE\x00\x00".b)
  return "UTF-32BE"  if head.bytesize >= 4 && head.start_with?("\x00\x00\xFE\xFF".b)
  return "UTF-16LE"  if head.bytesize >= 2 && head.byteslice(0, 2) == "\xFF\xFE".b
  return "UTF-16BE"  if head.bytesize >= 2 && head.byteslice(0, 2) == "\xFE\xFF".b
  prefix = (bytes.byteslice(0, SNIFF_BYTES) || "").dup.force_encoding(::Encoding::ASCII_8BIT)
  if (m = prefix.match(META_CHARSET_RE))
    return normalize(m[1])
  end
  if (m = prefix.match(META_HTTP_EQUIV_RE))
    return normalize(m[1])
  end
  "UTF-8"
end

.normalize(name) ⇒ Object



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/scrapetor/encoding.rb', line 42

def self.normalize(name)
  n = name.to_s.upcase.gsub(/[^A-Z0-9]/, "")
  case n
  when "UTF8", "UTF8N"                                then "UTF-8"
  when "LATIN1", "ISO88591", "WINDOWS1252", "WIN1252", "CP1252"
    "WINDOWS-1252"
  when "SHIFTJIS", "SJIS"                             then "Shift_JIS"
  when "EUCJP"                                        then "EUC-JP"
  when "GBK", "GB2312", "CP936"                       then "GBK"
  when "BIG5"                                         then "Big5"
  when "UTF16", "UTF16LE"                             then "UTF-16LE"
  when "UTF16BE"                                      then "UTF-16BE"
  when "USASCII", "ASCII"                             then "US-ASCII"
  else name.to_s.upcase
  end
end

.to_utf8(bytes) ⇒ Object



63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/scrapetor/encoding.rb', line 63

def self.to_utf8(bytes)
  s = bytes.is_a?(String) ? bytes.dup : bytes.to_s
  enc = detect(s)
  s.force_encoding(::Encoding::ASCII_8BIT)
  # Strip UTF-8 BOM if present
  if s.bytesize >= 3 && s.byteslice(0, 3) == BOM_UTF8
    s = s.byteslice(3, s.bytesize - 3) || ""
  end
  if enc.casecmp("UTF-8").zero?
    s.force_encoding(::Encoding::UTF_8)
    return s if s.valid_encoding?
    return s.encode(::Encoding::UTF_8, ::Encoding::UTF_8, invalid: :replace, undef: :replace, replace: "")
  end
  begin
    s.force_encoding(enc)
    s.encode(::Encoding::UTF_8, invalid: :replace, undef: :replace, replace: "")
  rescue ::Encoding::ConverterNotFoundError, ArgumentError
    s.force_encoding(::Encoding::UTF_8)
    s.encode(::Encoding::UTF_8, ::Encoding::UTF_8, invalid: :replace, undef: :replace, replace: "")
  end
end