Module: Philiprehberger::EncodingKit::Detector

Defined in:: lib/philiprehberger/encoding_kit/detector.rb

Overview

Encoding detection via BOM inspection and byte-pattern heuristics

Constant Summary collapse

BOMS = BOM signatures ordered from longest to shortest to avoid false matches

[
  ["\x00\x00\xFE\xFF".b, Encoding::UTF_32BE],
  ["\xFF\xFE\x00\x00".b, Encoding::UTF_32LE],
  ["\xEF\xBB\xBF".b, Encoding::UTF_8],
  ["\xFE\xFF".b,         Encoding::UTF_16BE],
  ["\xFF\xFE".b,         Encoding::UTF_16LE]
].freeze

CP1252_SPECIFIC = Bytes in 0x80-0x9F that are defined in CP1252 but not in ISO-8859-1. These bytes are unmapped in ISO-8859-1, so their presence strongly suggests a Windows codepage.

[
  0x80, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88,
  0x89, 0x8A, 0x8B, 0x8C, 0x8E, 0x91, 0x92, 0x93,
  0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B,
  0x9C, 0x9E, 0x9F
].freeze

CP1250_MARKERS = CP1250 (Central European) has specific characters in 0x80-0x9F that differ from CP1252. Common: 0x8A (S-caron), 0x8E (Z-caron), 0x9A (s-caron), 0x9E (z-caron).

[0x8A, 0x8E, 0x9A, 0x9E].freeze

CP1251_RANGE = CP1251 (Cyrillic) maps 0x80-0xFF almost entirely to Cyrillic letters. Bytes 0xC0-0xFF are Cyrillic А-я in CP1251.

(0xC0..0xFF)

Class Method Summary collapse

.call(string) ⇒ DetectionResult

Detect the encoding of a byte string, returning a DetectionResult with encoding and confidence score.
.detect_bom(bytes) ⇒ Encoding^?

Check whether the string starts with a known BOM.

Class Method Details

.call(string) ⇒ `DetectionResult`

Detect the encoding of a byte string, returning a DetectionResult with encoding and confidence score.

Strategy:

1. Check for a byte order mark (BOM) - confidence 1.0
2. Try UTF-8 validity - confidence 0.9
3. Check pure ASCII - confidence 0.9
4. Check Windows codepages (CP1252, CP1250, CP1251) - confidence 0.6-0.7
5. Apply Latin-1 heuristic - confidence 0.7
6. Fall back to BINARY - confidence 0.5

Parameters:

string (String) —

the input string (ideally with BINARY/ASCII-8BIT encoding)

Returns:

(DetectionResult) —

the detected encoding with confidence

# File 'lib/philiprehberger/encoding_kit/detector.rb', line 49

def call(string)
  bytes = string.b

  bom_result = detect_bom_with_confidence(bytes)
  return bom_result if bom_result

  return DetectionResult.new(Encoding::UTF_8, utf8_confidence(bytes)) if valid_utf8?(bytes)
  return DetectionResult.new(Encoding::US_ASCII, 0.9) if ascii_only?(bytes)

  codepage_result = detect_windows_codepage(bytes)
  return codepage_result if codepage_result

  return DetectionResult.new(Encoding::ISO_8859_1, 0.7) if latin1_heuristic?(bytes)

  DetectionResult.new(Encoding::BINARY, 0.5)
end

.detect_bom(bytes) ⇒ `Encoding`^?

Check whether the string starts with a known BOM.