Module: Philiprehberger::EncodingKit
- Defined in:
- lib/philiprehberger/encoding_kit.rb,
lib/philiprehberger/encoding_kit/version.rb,
lib/philiprehberger/encoding_kit/detector.rb,
lib/philiprehberger/encoding_kit/converter.rb,
lib/philiprehberger/encoding_kit/detection_result.rb
Defined Under Namespace
Modules: Converter, Detector Classes: DetectionResult, Error
Constant Summary collapse
- BOMS =
BOM signatures (re-exported for public use)
Detector::BOMS
- FILENAME_ENCODING_HINTS =
Filename suffix / extension hints that imply a specific encoding. Matched against the final two extension tokens of the filename.
{ 'utf8' => Encoding::UTF_8, 'utf-8' => Encoding::UTF_8, 'utf16' => Encoding::UTF_16, 'utf-16' => Encoding::UTF_16, 'utf16le' => Encoding::UTF_16LE, 'utf-16le' => Encoding::UTF_16LE, 'utf16be' => Encoding::UTF_16BE, 'utf-16be' => Encoding::UTF_16BE, 'utf32' => Encoding::UTF_32, 'utf-32' => Encoding::UTF_32, 'ascii' => Encoding::US_ASCII, 'us-ascii' => Encoding::US_ASCII, 'latin1' => Encoding::ISO_8859_1, 'latin-1' => Encoding::ISO_8859_1, 'iso88591' => Encoding::ISO_8859_1, 'iso-8859-1' => Encoding::ISO_8859_1, 'iso88592' => Encoding::ISO_8859_2, 'iso-8859-2' => Encoding::ISO_8859_2, 'cp1252' => Encoding::Windows_1252, 'windows1252' => Encoding::Windows_1252, 'windows-1252' => Encoding::Windows_1252, 'sjis' => Encoding::Shift_JIS, 'shiftjis' => Encoding::Shift_JIS, 'shift-jis' => Encoding::Shift_JIS, 'shift_jis' => Encoding::Shift_JIS, 'euc-jp' => Encoding::EUC_JP, 'eucjp' => Encoding::EUC_JP, 'gbk' => Encoding::GBK, 'gb2312' => Encoding::GB2312, 'big5' => Encoding::Big5 }.freeze
- VERSION =
'0.4.0'
Class Method Summary collapse
-
.analyze(string) ⇒ Hash
Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.
-
.bom?(string) ⇒ Boolean
Check if a string starts with a byte order mark.
-
.convert(string, from:, to:) ⇒ String
Convert a string between encodings.
-
.detect(string) ⇒ DetectionResult
Detect the encoding of a string via BOM and heuristics.
-
.detect_file(path, sample_size: 4096) ⇒ DetectionResult
Detect the encoding of a file by reading a byte sample.
-
.detect_stream(io, sample_size: 4096) ⇒ DetectionResult
Detect encoding from an IO stream by reading a sample of bytes.
-
.file_valid?(path, encoding: nil) ⇒ Boolean
Check if a file’s content is valid in the detected or specified encoding.
-
.guess_from_filename(filename) ⇒ Encoding?
Guess the encoding based on filename suffixes/extensions alone.
-
.normalize(string) ⇒ String
Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).
-
.read_as_utf8(path, from: nil) ⇒ String
Read a file and return its content as UTF-8.
-
.strip_bom(string) ⇒ String
Remove a byte order mark from the beginning of a string.
-
.to_utf8(string, from: nil) ⇒ String
Convert a string to UTF-8, auto-detecting source encoding if not specified.
-
.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String
Transcode a string to the target encoding, auto-detecting the source.
-
.valid?(string, encoding: nil) ⇒ Boolean
Check if a string is valid in the given encoding (or its current encoding).
Class Method Details
.analyze(string) ⇒ Hash
Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/philiprehberger/encoding_kit.rb', line 51 def self.analyze(string) bytes = string.b total = bytes.bytesize.to_f if total.zero? return { encoding: Encoding::BINARY, confidence: 0.5, printable_ratio: 0.0, ascii_ratio: 0.0, high_bytes: 0, candidates: [{ encoding: Encoding::BINARY, confidence: 0.5 }] } end ascii_count = 0 printable_count = 0 high_byte_count = 0 bytes.each_byte do |b| ascii_count += 1 if b < 128 printable_count += 1 if (0x20..0x7E).cover?(b) || b == 0x09 || b == 0x0A || b == 0x0D high_byte_count += 1 if b >= 128 end primary = Detector.call(bytes) candidates = build_candidates(bytes, primary) { encoding: primary.encoding, confidence: primary.confidence, printable_ratio: (printable_count / total).round(4), ascii_ratio: (ascii_count / total).round(4), high_bytes: high_byte_count, candidates: candidates } end |
.bom?(string) ⇒ Boolean
Check if a string starts with a byte order mark.
167 168 169 170 |
# File 'lib/philiprehberger/encoding_kit.rb', line 167 def self.bom?(string) bytes = string.b BOMS.any? { |bom, _encoding| bytes.start_with?(bom) } end |
.convert(string, from:, to:) ⇒ String
Convert a string between encodings.
127 128 129 |
# File 'lib/philiprehberger/encoding_kit.rb', line 127 def self.convert(string, from:, to:) Converter.convert(string, from: from, to: to) end |
.detect(string) ⇒ DetectionResult
Detect the encoding of a string via BOM and heuristics. Returns a DetectionResult that delegates to the underlying Encoding, so it can be compared directly (e.g., result == Encoding::UTF_8) while also providing a confidence score via result.confidence.
22 23 24 |
# File 'lib/philiprehberger/encoding_kit.rb', line 22 def self.detect(string) Detector.call(string) end |
.detect_file(path, sample_size: 4096) ⇒ DetectionResult
Detect the encoding of a file by reading a byte sample.
177 178 179 180 181 |
# File 'lib/philiprehberger/encoding_kit.rb', line 177 def self.detect_file(path, sample_size: 4096) File.open(path, 'rb') do |file| detect_stream(file, sample_size: sample_size) end end |
.detect_stream(io, sample_size: 4096) ⇒ DetectionResult
Detect encoding from an IO stream by reading a sample of bytes. The IO position is restored after reading (if the IO supports seek).
32 33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/philiprehberger/encoding_kit.rb', line 32 def self.detect_stream(io, sample_size: 4096) original_pos = io.respond_to?(:pos) ? io.pos : nil sample = io.read(sample_size) if original_pos && io.respond_to?(:seek) io.seek(original_pos) end return DetectionResult.new(Encoding::BINARY, 0.5) if sample.nil? || sample.empty? Detector.call(sample) end |
.file_valid?(path, encoding: nil) ⇒ Boolean
Check if a file’s content is valid in the detected or specified encoding.
199 200 201 202 |
# File 'lib/philiprehberger/encoding_kit.rb', line 199 def self.file_valid?(path, encoding: nil) raw = File.binread(path) valid?(raw, encoding: encoding) end |
.guess_from_filename(filename) ⇒ Encoding?
Guess the encoding based on filename suffixes/extensions alone. Useful when a file name carries an explicit encoding hint (e.g., “data.utf8.csv”, “legacy.latin1.txt”). Falls back to nil when no hint can be extracted — callers should then use detect_file to inspect the bytes.
Matching is case-insensitive and considers the final two file extension tokens; the rightmost recognizable hint wins.
250 251 252 253 254 255 256 257 258 |
# File 'lib/philiprehberger/encoding_kit.rb', line 250 def self.guess_from_filename(filename) name = File.basename(filename.to_s).downcase tokens = name.split('.').last(3) # extension + up to two modifiers tokens.reverse_each do |token| enc = FILENAME_ENCODING_HINTS[token] return enc if enc end nil end |
.normalize(string) ⇒ String
Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).
103 104 105 |
# File 'lib/philiprehberger/encoding_kit.rb', line 103 def self.normalize(string) Converter.normalize(string) end |
.read_as_utf8(path, from: nil) ⇒ String
Read a file and return its content as UTF-8. Auto-detects the source encoding unless specified via ‘from:`.
189 190 191 192 |
# File 'lib/philiprehberger/encoding_kit.rb', line 189 def self.read_as_utf8(path, from: nil) raw = File.binread(path) to_utf8(raw, from: from) end |
.strip_bom(string) ⇒ String
Remove a byte order mark from the beginning of a string.
152 153 154 155 156 157 158 159 160 161 |
# File 'lib/philiprehberger/encoding_kit.rb', line 152 def self.strip_bom(string) bytes = string.b BOMS.each do |bom, _encoding| # rubocop:disable Style/HashEachMethods if bytes.start_with?(bom) result = bytes[bom.bytesize..] return result.force_encoding(string.encoding) end end string.dup end |
.to_utf8(string, from: nil) ⇒ String
Convert a string to UTF-8, auto-detecting source encoding if not specified.
94 95 96 |
# File 'lib/philiprehberger/encoding_kit.rb', line 94 def self.to_utf8(string, from: nil) Converter.to_utf8(string, from: from) end |
.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String
Transcode a string to the target encoding, auto-detecting the source. Simpler API for the most common conversion pattern.
140 141 142 143 144 145 146 |
# File 'lib/philiprehberger/encoding_kit.rb', line 140 def self.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') detected = Detector.call(string) source = detected.encoding target = to.is_a?(Encoding) ? to : Encoding.find(to.to_s) Converter.convert(string, from: source, to: target, fallback: fallback, replace: replace) end |
.valid?(string, encoding: nil) ⇒ Boolean
Check if a string is valid in the given encoding (or its current encoding).
112 113 114 115 116 117 118 119 |
# File 'lib/philiprehberger/encoding_kit.rb', line 112 def self.valid?(string, encoding: nil) if encoding enc = Encoding.find(encoding.to_s) string.dup.force_encoding(enc).valid_encoding? else string.valid_encoding? end end |