Module: Philiprehberger::EncodingKit::Detector
- Defined in:
- lib/philiprehberger/encoding_kit/detector.rb
Overview
Encoding detection via BOM inspection and byte-pattern heuristics
Constant Summary collapse
- BOMS =
BOM signatures ordered from longest to shortest to avoid false matches
[ ["\x00\x00\xFE\xFF".b, Encoding::UTF_32BE], ["\xFF\xFE\x00\x00".b, Encoding::UTF_32LE], ["\xEF\xBB\xBF".b, Encoding::UTF_8], ["\xFE\xFF".b, Encoding::UTF_16BE], ["\xFF\xFE".b, Encoding::UTF_16LE] ].freeze
- CP1252_SPECIFIC =
Bytes in 0x80-0x9F that are defined in CP1252 but not in ISO-8859-1. These bytes are unmapped in ISO-8859-1, so their presence strongly suggests a Windows codepage.
[ 0x80, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8E, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9E, 0x9F ].freeze
- CP1250_MARKERS =
CP1250 (Central European) has specific characters in 0x80-0x9F that differ from CP1252. Common: 0x8A (S-caron), 0x8E (Z-caron), 0x9A (s-caron), 0x9E (z-caron).
[0x8A, 0x8E, 0x9A, 0x9E].freeze
- CP1251_RANGE =
CP1251 (Cyrillic) maps 0x80-0xFF almost entirely to Cyrillic letters. Bytes 0xC0-0xFF are Cyrillic А-я in CP1251.
(0xC0..0xFF)
Class Method Summary collapse
-
.call(string) ⇒ DetectionResult
Detect the encoding of a byte string, returning a DetectionResult with encoding and confidence score.
-
.detect_bom(bytes) ⇒ Encoding?
Check whether the string starts with a known BOM.
Class Method Details
.call(string) ⇒ DetectionResult
Detect the encoding of a byte string, returning a DetectionResult with encoding and confidence score.
Strategy:
1. Check for a byte order mark (BOM) - confidence 1.0
2. Try UTF-8 validity - confidence 0.9
3. Check pure ASCII - confidence 0.9
4. Check Windows codepages (CP1252, CP1250, CP1251) - confidence 0.6-0.7
5. Apply Latin-1 heuristic - confidence 0.7
6. Fall back to BINARY - confidence 0.5
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# File 'lib/philiprehberger/encoding_kit/detector.rb', line 49 def call(string) bytes = string.b bom_result = detect_bom_with_confidence(bytes) return bom_result if bom_result return DetectionResult.new(Encoding::UTF_8, utf8_confidence(bytes)) if valid_utf8?(bytes) return DetectionResult.new(Encoding::US_ASCII, 0.9) if ascii_only?(bytes) codepage_result = detect_windows_codepage(bytes) return codepage_result if codepage_result return DetectionResult.new(Encoding::ISO_8859_1, 0.7) if latin1_heuristic?(bytes) DetectionResult.new(Encoding::BINARY, 0.5) end |
.detect_bom(bytes) ⇒ Encoding?
Check whether the string starts with a known BOM.
70 71 72 73 74 75 |
# File 'lib/philiprehberger/encoding_kit/detector.rb', line 70 def detect_bom(bytes) BOMS.each do |bom, encoding| return encoding if bytes.start_with?(bom) end nil end |