Module: Philiprehberger::EncodingKit
- Defined in:
- lib/philiprehberger/encoding_kit.rb,
lib/philiprehberger/encoding_kit/version.rb,
lib/philiprehberger/encoding_kit/detector.rb,
lib/philiprehberger/encoding_kit/converter.rb,
lib/philiprehberger/encoding_kit/detection_result.rb
Defined Under Namespace
Modules: Converter, Detector Classes: DetectionResult, Error
Constant Summary collapse
- BOMS =
BOM signatures (re-exported for public use)
Detector::BOMS
- LINE_ENDINGS =
{ lf: "\n", crlf: "\r\n", cr: "\r" }.freeze
- FILENAME_ENCODING_HINTS =
Filename suffix / extension hints that imply a specific encoding. Matched against the final two extension tokens of the filename.
{ 'utf8' => Encoding::UTF_8, 'utf-8' => Encoding::UTF_8, 'utf16' => Encoding::UTF_16, 'utf-16' => Encoding::UTF_16, 'utf16le' => Encoding::UTF_16LE, 'utf-16le' => Encoding::UTF_16LE, 'utf16be' => Encoding::UTF_16BE, 'utf-16be' => Encoding::UTF_16BE, 'utf32' => Encoding::UTF_32, 'utf-32' => Encoding::UTF_32, 'ascii' => Encoding::US_ASCII, 'us-ascii' => Encoding::US_ASCII, 'latin1' => Encoding::ISO_8859_1, 'latin-1' => Encoding::ISO_8859_1, 'iso88591' => Encoding::ISO_8859_1, 'iso-8859-1' => Encoding::ISO_8859_1, 'iso88592' => Encoding::ISO_8859_2, 'iso-8859-2' => Encoding::ISO_8859_2, 'cp1252' => Encoding::Windows_1252, 'windows1252' => Encoding::Windows_1252, 'windows-1252' => Encoding::Windows_1252, 'sjis' => Encoding::Shift_JIS, 'shiftjis' => Encoding::Shift_JIS, 'shift-jis' => Encoding::Shift_JIS, 'shift_jis' => Encoding::Shift_JIS, 'euc-jp' => Encoding::EUC_JP, 'eucjp' => Encoding::EUC_JP, 'gbk' => Encoding::GBK, 'gb2312' => Encoding::GB2312, 'big5' => Encoding::Big5 }.freeze
- VERSION =
'0.6.0'
Class Method Summary collapse
-
.analyze(string) ⇒ Hash
Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.
-
.bom?(string) ⇒ Boolean
Check if a string starts with a byte order mark.
-
.convert(string, from:, to:) ⇒ String
Convert a string between encodings.
-
.detect(string) ⇒ DetectionResult
Detect the encoding of a string via BOM and heuristics.
-
.detect_file(path, sample_size: 4096) ⇒ DetectionResult
Detect the encoding of a file by reading a byte sample.
-
.detect_stream(io, sample_size: 4096) ⇒ DetectionResult
Detect encoding from an IO stream by reading a sample of bytes.
-
.file_valid?(path, encoding: nil) ⇒ Boolean
Check if a file’s content is valid in the detected or specified encoding.
-
.guess_from_filename(filename) ⇒ Encoding?
Guess the encoding based on filename suffixes/extensions alone.
-
.normalize(string) ⇒ String
Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).
-
.normalize_line_endings(string, to: :lf) ⇒ String
Normalize line endings to a single canonical form.
-
.read_as_utf8(path, from: nil, strip_bom: false) ⇒ String
Read a file and return its content as UTF-8.
-
.scrub(string) ⇒ String
Strip invalid bytes from a string, returning valid UTF-8.
-
.strip_bom(string) ⇒ String
Remove a byte order mark from the beginning of a string.
-
.to_utf8(string, from: nil, strip_bom: false) ⇒ String
Convert a string to UTF-8, auto-detecting source encoding if not specified.
-
.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String
Transcode a string to the target encoding, auto-detecting the source.
-
.valid?(string, encoding: nil) ⇒ Boolean
Check if a string is valid in the given encoding (or its current encoding).
Class Method Details
.analyze(string) ⇒ Hash
Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/philiprehberger/encoding_kit.rb', line 51 def self.analyze(string) bytes = string.b total = bytes.bytesize.to_f if total.zero? return { encoding: Encoding::BINARY, confidence: 0.5, printable_ratio: 0.0, ascii_ratio: 0.0, high_bytes: 0, candidates: [{ encoding: Encoding::BINARY, confidence: 0.5 }] } end ascii_count = 0 printable_count = 0 high_byte_count = 0 bytes.each_byte do |b| ascii_count += 1 if b < 128 printable_count += 1 if (0x20..0x7E).cover?(b) || b == 0x09 || b == 0x0A || b == 0x0D high_byte_count += 1 if b >= 128 end primary = Detector.call(bytes) candidates = build_candidates(bytes, primary) { encoding: primary.encoding, confidence: primary.confidence, printable_ratio: (printable_count / total).round(4), ascii_ratio: (ascii_count / total).round(4), high_bytes: high_byte_count, candidates: candidates } end |
.bom?(string) ⇒ Boolean
Check if a string starts with a byte order mark.
193 194 195 196 |
# File 'lib/philiprehberger/encoding_kit.rb', line 193 def self.bom?(string) bytes = string.b BOMS.any? { |bom, _encoding| bytes.start_with?(bom) } end |
.convert(string, from:, to:) ⇒ String
Convert a string between encodings.
153 154 155 |
# File 'lib/philiprehberger/encoding_kit.rb', line 153 def self.convert(string, from:, to:) Converter.convert(string, from: from, to: to) end |
.detect(string) ⇒ DetectionResult
Detect the encoding of a string via BOM and heuristics. Returns a DetectionResult that delegates to the underlying Encoding, so it can be compared directly (e.g., result == Encoding::UTF_8) while also providing a confidence score via result.confidence.
22 23 24 |
# File 'lib/philiprehberger/encoding_kit.rb', line 22 def self.detect(string) Detector.call(string) end |
.detect_file(path, sample_size: 4096) ⇒ DetectionResult
Detect the encoding of a file by reading a byte sample.
203 204 205 206 207 |
# File 'lib/philiprehberger/encoding_kit.rb', line 203 def self.detect_file(path, sample_size: 4096) File.open(path, 'rb') do |file| detect_stream(file, sample_size: sample_size) end end |
.detect_stream(io, sample_size: 4096) ⇒ DetectionResult
Detect encoding from an IO stream by reading a sample of bytes. The IO position is restored after reading (if the IO supports seek).
32 33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/philiprehberger/encoding_kit.rb', line 32 def self.detect_stream(io, sample_size: 4096) original_pos = io.respond_to?(:pos) ? io.pos : nil sample = io.read(sample_size) if original_pos && io.respond_to?(:seek) io.seek(original_pos) end return DetectionResult.new(Encoding::BINARY, 0.5) if sample.nil? || sample.empty? Detector.call(sample) end |
.file_valid?(path, encoding: nil) ⇒ Boolean
Check if a file’s content is valid in the detected or specified encoding.
226 227 228 229 |
# File 'lib/philiprehberger/encoding_kit.rb', line 226 def self.file_valid?(path, encoding: nil) raw = File.binread(path) valid?(raw, encoding: encoding) end |
.guess_from_filename(filename) ⇒ Encoding?
Guess the encoding based on filename suffixes/extensions alone. Useful when a file name carries an explicit encoding hint (e.g., “data.utf8.csv”, “legacy.latin1.txt”). Falls back to nil when no hint can be extracted — callers should then use detect_file to inspect the bytes.
Matching is case-insensitive and considers the final two file extension tokens; the rightmost recognizable hint wins.
277 278 279 280 281 282 283 284 285 |
# File 'lib/philiprehberger/encoding_kit.rb', line 277 def self.guess_from_filename(filename) name = File.basename(filename.to_s).downcase tokens = name.split('.').last(3) # extension + up to two modifiers tokens.reverse_each do |token| enc = FILENAME_ENCODING_HINTS[token] return enc if enc end nil end |
.normalize(string) ⇒ String
Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).
104 105 106 |
# File 'lib/philiprehberger/encoding_kit.rb', line 104 def self.normalize(string) Converter.normalize(string) end |
.normalize_line_endings(string, to: :lf) ⇒ String
Normalize line endings to a single canonical form.
127 128 129 130 131 |
# File 'lib/philiprehberger/encoding_kit.rb', line 127 def self.normalize_line_endings(string, to: :lf) target = LINE_ENDINGS[to] or raise Error, "Unknown line ending: #{to.inspect} (expected :lf, :crlf, or :cr)" string.gsub(/\r\n|\r|\n/, target) end |
.read_as_utf8(path, from: nil, strip_bom: false) ⇒ String
Read a file and return its content as UTF-8. Auto-detects the source encoding unless specified via ‘from:`.
216 217 218 219 |
# File 'lib/philiprehberger/encoding_kit.rb', line 216 def self.read_as_utf8(path, from: nil, strip_bom: false) raw = File.binread(path) to_utf8(raw, from: from, strip_bom: strip_bom) end |
.scrub(string) ⇒ String
Strip invalid bytes from a string, returning valid UTF-8.
Unlike normalize, which replaces invalid bytes with ‘�`, this method removes them entirely.
115 116 117 |
# File 'lib/philiprehberger/encoding_kit.rb', line 115 def self.scrub(string) Converter.scrub(string) end |
.strip_bom(string) ⇒ String
Remove a byte order mark from the beginning of a string.
178 179 180 181 182 183 184 185 186 187 |
# File 'lib/philiprehberger/encoding_kit.rb', line 178 def self.strip_bom(string) bytes = string.b BOMS.each do |bom, _encoding| # rubocop:disable Style/HashEachMethods if bytes.start_with?(bom) result = bytes[bom.bytesize..] return result.force_encoding(string.encoding) end end string.dup end |
.to_utf8(string, from: nil, strip_bom: false) ⇒ String
Convert a string to UTF-8, auto-detecting source encoding if not specified.
95 96 97 |
# File 'lib/philiprehberger/encoding_kit.rb', line 95 def self.to_utf8(string, from: nil, strip_bom: false) Converter.to_utf8(string, from: from, strip_bom: strip_bom) end |
.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String
Transcode a string to the target encoding, auto-detecting the source. Simpler API for the most common conversion pattern.
166 167 168 169 170 171 172 |
# File 'lib/philiprehberger/encoding_kit.rb', line 166 def self.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') detected = Detector.call(string) source = detected.encoding target = to.is_a?(Encoding) ? to : Encoding.find(to.to_s) Converter.convert(string, from: source, to: target, fallback: fallback, replace: replace) end |
.valid?(string, encoding: nil) ⇒ Boolean
Check if a string is valid in the given encoding (or its current encoding).
138 139 140 141 142 143 144 145 |
# File 'lib/philiprehberger/encoding_kit.rb', line 138 def self.valid?(string, encoding: nil) if encoding enc = Encoding.find(encoding.to_s) string.dup.force_encoding(enc).valid_encoding? else string.valid_encoding? end end |