Module: Philiprehberger::EncodingKit
- Defined in:
- lib/philiprehberger/encoding_kit.rb,
lib/philiprehberger/encoding_kit/version.rb,
lib/philiprehberger/encoding_kit/detector.rb,
lib/philiprehberger/encoding_kit/converter.rb,
lib/philiprehberger/encoding_kit/detection_result.rb
Defined Under Namespace
Modules: Converter, Detector Classes: DetectionResult, Error
Constant Summary collapse
- BOMS =
BOM signatures (re-exported for public use)
Detector::BOMS
- LINE_ENDINGS =
{ lf: "\n", crlf: "\r\n", cr: "\r" }.freeze
- FILENAME_ENCODING_HINTS =
Filename suffix / extension hints that imply a specific encoding. Matched against the final two extension tokens of the filename.
{ 'utf8' => Encoding::UTF_8, 'utf-8' => Encoding::UTF_8, 'utf16' => Encoding::UTF_16, 'utf-16' => Encoding::UTF_16, 'utf16le' => Encoding::UTF_16LE, 'utf-16le' => Encoding::UTF_16LE, 'utf16be' => Encoding::UTF_16BE, 'utf-16be' => Encoding::UTF_16BE, 'utf32' => Encoding::UTF_32, 'utf-32' => Encoding::UTF_32, 'ascii' => Encoding::US_ASCII, 'us-ascii' => Encoding::US_ASCII, 'latin1' => Encoding::ISO_8859_1, 'latin-1' => Encoding::ISO_8859_1, 'iso88591' => Encoding::ISO_8859_1, 'iso-8859-1' => Encoding::ISO_8859_1, 'iso88592' => Encoding::ISO_8859_2, 'iso-8859-2' => Encoding::ISO_8859_2, 'cp1252' => Encoding::Windows_1252, 'windows1252' => Encoding::Windows_1252, 'windows-1252' => Encoding::Windows_1252, 'sjis' => Encoding::Shift_JIS, 'shiftjis' => Encoding::Shift_JIS, 'shift-jis' => Encoding::Shift_JIS, 'shift_jis' => Encoding::Shift_JIS, 'euc-jp' => Encoding::EUC_JP, 'eucjp' => Encoding::EUC_JP, 'gbk' => Encoding::GBK, 'gb2312' => Encoding::GB2312, 'big5' => Encoding::Big5 }.freeze
- VERSION =
'0.5.0'
Class Method Summary collapse
-
.analyze(string) ⇒ Hash
Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.
-
.bom?(string) ⇒ Boolean
Check if a string starts with a byte order mark.
-
.convert(string, from:, to:) ⇒ String
Convert a string between encodings.
-
.detect(string) ⇒ DetectionResult
Detect the encoding of a string via BOM and heuristics.
-
.detect_file(path, sample_size: 4096) ⇒ DetectionResult
Detect the encoding of a file by reading a byte sample.
-
.detect_stream(io, sample_size: 4096) ⇒ DetectionResult
Detect encoding from an IO stream by reading a sample of bytes.
-
.file_valid?(path, encoding: nil) ⇒ Boolean
Check if a file’s content is valid in the detected or specified encoding.
-
.guess_from_filename(filename) ⇒ Encoding?
Guess the encoding based on filename suffixes/extensions alone.
-
.normalize(string) ⇒ String
Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).
-
.normalize_line_endings(string, to: :lf) ⇒ String
Normalize line endings to a single canonical form.
-
.read_as_utf8(path, from: nil) ⇒ String
Read a file and return its content as UTF-8.
-
.scrub(string) ⇒ String
Strip invalid bytes from a string, returning valid UTF-8.
-
.strip_bom(string) ⇒ String
Remove a byte order mark from the beginning of a string.
-
.to_utf8(string, from: nil) ⇒ String
Convert a string to UTF-8, auto-detecting source encoding if not specified.
-
.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String
Transcode a string to the target encoding, auto-detecting the source.
-
.valid?(string, encoding: nil) ⇒ Boolean
Check if a string is valid in the given encoding (or its current encoding).
Class Method Details
.analyze(string) ⇒ Hash
Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/philiprehberger/encoding_kit.rb', line 51 def self.analyze(string) bytes = string.b total = bytes.bytesize.to_f if total.zero? return { encoding: Encoding::BINARY, confidence: 0.5, printable_ratio: 0.0, ascii_ratio: 0.0, high_bytes: 0, candidates: [{ encoding: Encoding::BINARY, confidence: 0.5 }] } end ascii_count = 0 printable_count = 0 high_byte_count = 0 bytes.each_byte do |b| ascii_count += 1 if b < 128 printable_count += 1 if (0x20..0x7E).cover?(b) || b == 0x09 || b == 0x0A || b == 0x0D high_byte_count += 1 if b >= 128 end primary = Detector.call(bytes) candidates = build_candidates(bytes, primary) { encoding: primary.encoding, confidence: primary.confidence, printable_ratio: (printable_count / total).round(4), ascii_ratio: (ascii_count / total).round(4), high_bytes: high_byte_count, candidates: candidates } end |
.bom?(string) ⇒ Boolean
Check if a string starts with a byte order mark.
192 193 194 195 |
# File 'lib/philiprehberger/encoding_kit.rb', line 192 def self.bom?(string) bytes = string.b BOMS.any? { |bom, _encoding| bytes.start_with?(bom) } end |
.convert(string, from:, to:) ⇒ String
Convert a string between encodings.
152 153 154 |
# File 'lib/philiprehberger/encoding_kit.rb', line 152 def self.convert(string, from:, to:) Converter.convert(string, from: from, to: to) end |
.detect(string) ⇒ DetectionResult
Detect the encoding of a string via BOM and heuristics. Returns a DetectionResult that delegates to the underlying Encoding, so it can be compared directly (e.g., result == Encoding::UTF_8) while also providing a confidence score via result.confidence.
22 23 24 |
# File 'lib/philiprehberger/encoding_kit.rb', line 22 def self.detect(string) Detector.call(string) end |
.detect_file(path, sample_size: 4096) ⇒ DetectionResult
Detect the encoding of a file by reading a byte sample.
202 203 204 205 206 |
# File 'lib/philiprehberger/encoding_kit.rb', line 202 def self.detect_file(path, sample_size: 4096) File.open(path, 'rb') do |file| detect_stream(file, sample_size: sample_size) end end |
.detect_stream(io, sample_size: 4096) ⇒ DetectionResult
Detect encoding from an IO stream by reading a sample of bytes. The IO position is restored after reading (if the IO supports seek).
32 33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/philiprehberger/encoding_kit.rb', line 32 def self.detect_stream(io, sample_size: 4096) original_pos = io.respond_to?(:pos) ? io.pos : nil sample = io.read(sample_size) if original_pos && io.respond_to?(:seek) io.seek(original_pos) end return DetectionResult.new(Encoding::BINARY, 0.5) if sample.nil? || sample.empty? Detector.call(sample) end |
.file_valid?(path, encoding: nil) ⇒ Boolean
Check if a file’s content is valid in the detected or specified encoding.
224 225 226 227 |
# File 'lib/philiprehberger/encoding_kit.rb', line 224 def self.file_valid?(path, encoding: nil) raw = File.binread(path) valid?(raw, encoding: encoding) end |
.guess_from_filename(filename) ⇒ Encoding?
Guess the encoding based on filename suffixes/extensions alone. Useful when a file name carries an explicit encoding hint (e.g., “data.utf8.csv”, “legacy.latin1.txt”). Falls back to nil when no hint can be extracted — callers should then use detect_file to inspect the bytes.
Matching is case-insensitive and considers the final two file extension tokens; the rightmost recognizable hint wins.
275 276 277 278 279 280 281 282 283 |
# File 'lib/philiprehberger/encoding_kit.rb', line 275 def self.guess_from_filename(filename) name = File.basename(filename.to_s).downcase tokens = name.split('.').last(3) # extension + up to two modifiers tokens.reverse_each do |token| enc = FILENAME_ENCODING_HINTS[token] return enc if enc end nil end |
.normalize(string) ⇒ String
Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).
103 104 105 |
# File 'lib/philiprehberger/encoding_kit.rb', line 103 def self.normalize(string) Converter.normalize(string) end |
.normalize_line_endings(string, to: :lf) ⇒ String
Normalize line endings to a single canonical form.
126 127 128 129 130 |
# File 'lib/philiprehberger/encoding_kit.rb', line 126 def self.normalize_line_endings(string, to: :lf) target = LINE_ENDINGS[to] or raise Error, "Unknown line ending: #{to.inspect} (expected :lf, :crlf, or :cr)" string.gsub(/\r\n|\r|\n/, target) end |
.read_as_utf8(path, from: nil) ⇒ String
Read a file and return its content as UTF-8. Auto-detects the source encoding unless specified via ‘from:`.
214 215 216 217 |
# File 'lib/philiprehberger/encoding_kit.rb', line 214 def self.read_as_utf8(path, from: nil) raw = File.binread(path) to_utf8(raw, from: from) end |
.scrub(string) ⇒ String
Strip invalid bytes from a string, returning valid UTF-8.
Unlike normalize, which replaces invalid bytes with ‘�`, this method removes them entirely.
114 115 116 |
# File 'lib/philiprehberger/encoding_kit.rb', line 114 def self.scrub(string) Converter.scrub(string) end |
.strip_bom(string) ⇒ String
Remove a byte order mark from the beginning of a string.
177 178 179 180 181 182 183 184 185 186 |
# File 'lib/philiprehberger/encoding_kit.rb', line 177 def self.strip_bom(string) bytes = string.b BOMS.each do |bom, _encoding| # rubocop:disable Style/HashEachMethods if bytes.start_with?(bom) result = bytes[bom.bytesize..] return result.force_encoding(string.encoding) end end string.dup end |
.to_utf8(string, from: nil) ⇒ String
Convert a string to UTF-8, auto-detecting source encoding if not specified.
94 95 96 |
# File 'lib/philiprehberger/encoding_kit.rb', line 94 def self.to_utf8(string, from: nil) Converter.to_utf8(string, from: from) end |
.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String
Transcode a string to the target encoding, auto-detecting the source. Simpler API for the most common conversion pattern.
165 166 167 168 169 170 171 |
# File 'lib/philiprehberger/encoding_kit.rb', line 165 def self.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') detected = Detector.call(string) source = detected.encoding target = to.is_a?(Encoding) ? to : Encoding.find(to.to_s) Converter.convert(string, from: source, to: target, fallback: fallback, replace: replace) end |
.valid?(string, encoding: nil) ⇒ Boolean
Check if a string is valid in the given encoding (or its current encoding).
137 138 139 140 141 142 143 144 |
# File 'lib/philiprehberger/encoding_kit.rb', line 137 def self.valid?(string, encoding: nil) if encoding enc = Encoding.find(encoding.to_s) string.dup.force_encoding(enc).valid_encoding? else string.valid_encoding? end end |