Module: Philiprehberger::EncodingKit::Detector

Defined in:
lib/philiprehberger/encoding_kit/detector.rb

Overview

Encoding detection via BOM inspection and byte-pattern heuristics

Constant Summary collapse

BOMS =

BOM signatures ordered from longest to shortest to avoid false matches

[
  ["\x00\x00\xFE\xFF".b, Encoding::UTF_32BE],
  ["\xFF\xFE\x00\x00".b, Encoding::UTF_32LE],
  ["\xEF\xBB\xBF".b, Encoding::UTF_8],
  ["\xFE\xFF".b,         Encoding::UTF_16BE],
  ["\xFF\xFE".b,         Encoding::UTF_16LE]
].freeze
CP1252_SPECIFIC =

Bytes in 0x80-0x9F that are defined in CP1252 but not in ISO-8859-1. These bytes are unmapped in ISO-8859-1, so their presence strongly suggests a Windows codepage.

[
  0x80, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88,
  0x89, 0x8A, 0x8B, 0x8C, 0x8E, 0x91, 0x92, 0x93,
  0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B,
  0x9C, 0x9E, 0x9F
].freeze
CP1250_MARKERS =

CP1250 (Central European) has specific characters in 0x80-0x9F that differ from CP1252. Common: 0x8A (S-caron), 0x8E (Z-caron), 0x9A (s-caron), 0x9E (z-caron).

[0x8A, 0x8E, 0x9A, 0x9E].freeze
CP1251_RANGE =

CP1251 (Cyrillic) maps 0x80-0xFF almost entirely to Cyrillic letters. Bytes 0xC0-0xFF are Cyrillic А-я in CP1251.

(0xC0..0xFF)

Class Method Summary collapse

Class Method Details

.call(string) ⇒ DetectionResult

Detect the encoding of a byte string, returning a DetectionResult with encoding and confidence score.

Strategy:

1. Check for a byte order mark (BOM) - confidence 1.0
2. Try UTF-8 validity - confidence 0.9
3. Check pure ASCII - confidence 0.9
4. Check Windows codepages (CP1252, CP1250, CP1251) - confidence 0.6-0.7
5. Apply Latin-1 heuristic - confidence 0.7
6. Fall back to BINARY - confidence 0.5

Parameters:

  • string (String)

    the input string (ideally with BINARY/ASCII-8BIT encoding)

Returns:



49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# File 'lib/philiprehberger/encoding_kit/detector.rb', line 49

def call(string)
  bytes = string.b

  bom_result = detect_bom_with_confidence(bytes)
  return bom_result if bom_result

  return DetectionResult.new(Encoding::UTF_8, utf8_confidence(bytes)) if valid_utf8?(bytes)
  return DetectionResult.new(Encoding::US_ASCII, 0.9) if ascii_only?(bytes)

  codepage_result = detect_windows_codepage(bytes)
  return codepage_result if codepage_result

  return DetectionResult.new(Encoding::ISO_8859_1, 0.7) if latin1_heuristic?(bytes)

  DetectionResult.new(Encoding::BINARY, 0.5)
end

.detect_bom(bytes) ⇒ Encoding?

Check whether the string starts with a known BOM.

Parameters:

  • bytes (String)

    binary string

Returns:

  • (Encoding, nil)

    the encoding indicated by the BOM, or nil



70
71
72
73
74
75
# File 'lib/philiprehberger/encoding_kit/detector.rb', line 70

def detect_bom(bytes)
  BOMS.each do |bom, encoding|
    return encoding if bytes.start_with?(bom)
  end
  nil
end