Module: Philiprehberger::EncodingKit

Defined in:
lib/philiprehberger/encoding_kit.rb,
lib/philiprehberger/encoding_kit/version.rb,
lib/philiprehberger/encoding_kit/detector.rb,
lib/philiprehberger/encoding_kit/converter.rb,
lib/philiprehberger/encoding_kit/detection_result.rb

Defined Under Namespace

Modules: Converter, Detector Classes: DetectionResult, Error

Constant Summary collapse

BOMS =

BOM signatures (re-exported for public use)

Detector::BOMS
LINE_ENDINGS =
{ lf: "\n", crlf: "\r\n", cr: "\r" }.freeze
FILENAME_ENCODING_HINTS =

Filename suffix / extension hints that imply a specific encoding. Matched against the final two extension tokens of the filename.

{
  'utf8' => Encoding::UTF_8,
  'utf-8' => Encoding::UTF_8,
  'utf16' => Encoding::UTF_16,
  'utf-16' => Encoding::UTF_16,
  'utf16le' => Encoding::UTF_16LE,
  'utf-16le' => Encoding::UTF_16LE,
  'utf16be' => Encoding::UTF_16BE,
  'utf-16be' => Encoding::UTF_16BE,
  'utf32' => Encoding::UTF_32,
  'utf-32' => Encoding::UTF_32,
  'ascii' => Encoding::US_ASCII,
  'us-ascii' => Encoding::US_ASCII,
  'latin1' => Encoding::ISO_8859_1,
  'latin-1' => Encoding::ISO_8859_1,
  'iso88591' => Encoding::ISO_8859_1,
  'iso-8859-1' => Encoding::ISO_8859_1,
  'iso88592' => Encoding::ISO_8859_2,
  'iso-8859-2' => Encoding::ISO_8859_2,
  'cp1252' => Encoding::Windows_1252,
  'windows1252' => Encoding::Windows_1252,
  'windows-1252' => Encoding::Windows_1252,
  'sjis' => Encoding::Shift_JIS,
  'shiftjis' => Encoding::Shift_JIS,
  'shift-jis' => Encoding::Shift_JIS,
  'shift_jis' => Encoding::Shift_JIS,
  'euc-jp' => Encoding::EUC_JP,
  'eucjp' => Encoding::EUC_JP,
  'gbk' => Encoding::GBK,
  'gb2312' => Encoding::GB2312,
  'big5' => Encoding::Big5
}.freeze
VERSION =
'0.5.0'

Class Method Summary collapse

Class Method Details

.analyze(string) ⇒ Hash

Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.

Parameters:

  • string (String)

    the input string

Returns:

  • (Hash)

    analysis results with keys :encoding, :confidence, :printable_ratio, :ascii_ratio, :high_bytes, :candidates



51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/philiprehberger/encoding_kit.rb', line 51

def self.analyze(string)
  bytes = string.b
  total = bytes.bytesize.to_f

  if total.zero?
    return {
      encoding: Encoding::BINARY,
      confidence: 0.5,
      printable_ratio: 0.0,
      ascii_ratio: 0.0,
      high_bytes: 0,
      candidates: [{ encoding: Encoding::BINARY, confidence: 0.5 }]
    }
  end

  ascii_count = 0
  printable_count = 0
  high_byte_count = 0

  bytes.each_byte do |b|
    ascii_count += 1 if b < 128
    printable_count += 1 if (0x20..0x7E).cover?(b) || b == 0x09 || b == 0x0A || b == 0x0D
    high_byte_count += 1 if b >= 128
  end

  primary = Detector.call(bytes)
  candidates = build_candidates(bytes, primary)

  {
    encoding: primary.encoding,
    confidence: primary.confidence,
    printable_ratio: (printable_count / total).round(4),
    ascii_ratio: (ascii_count / total).round(4),
    high_bytes: high_byte_count,
    candidates: candidates
  }
end

.bom?(string) ⇒ Boolean

Check if a string starts with a byte order mark.

Parameters:

  • string (String)

    the input string

Returns:

  • (Boolean)


192
193
194
195
# File 'lib/philiprehberger/encoding_kit.rb', line 192

def self.bom?(string)
  bytes = string.b
  BOMS.any? { |bom, _encoding| bytes.start_with?(bom) }
end

.convert(string, from:, to:) ⇒ String

Convert a string between encodings.

Parameters:

  • string (String)

    the input string

  • from (String, Encoding)

    source encoding

  • to (String, Encoding)

    target encoding

Returns:

  • (String)

    the converted string



152
153
154
# File 'lib/philiprehberger/encoding_kit.rb', line 152

def self.convert(string, from:, to:)
  Converter.convert(string, from: from, to: to)
end

.detect(string) ⇒ DetectionResult

Detect the encoding of a string via BOM and heuristics. Returns a DetectionResult that delegates to the underlying Encoding, so it can be compared directly (e.g., result == Encoding::UTF_8) while also providing a confidence score via result.confidence.

Parameters:

  • string (String)

    the input string

Returns:



22
23
24
# File 'lib/philiprehberger/encoding_kit.rb', line 22

def self.detect(string)
  Detector.call(string)
end

.detect_file(path, sample_size: 4096) ⇒ DetectionResult

Detect the encoding of a file by reading a byte sample.

Parameters:

  • path (String)

    path to the file

  • sample_size (Integer) (defaults to: 4096)

    number of bytes to sample (default: 4096)

Returns:



202
203
204
205
206
# File 'lib/philiprehberger/encoding_kit.rb', line 202

def self.detect_file(path, sample_size: 4096)
  File.open(path, 'rb') do |file|
    detect_stream(file, sample_size: sample_size)
  end
end

.detect_stream(io, sample_size: 4096) ⇒ DetectionResult

Detect encoding from an IO stream by reading a sample of bytes. The IO position is restored after reading (if the IO supports seek).

Parameters:

  • io (IO, StringIO)

    the IO object to read from

  • sample_size (Integer) (defaults to: 4096)

    number of bytes to sample (default: 4096)

Returns:



32
33
34
35
36
37
38
39
40
41
42
43
# File 'lib/philiprehberger/encoding_kit.rb', line 32

def self.detect_stream(io, sample_size: 4096)
  original_pos = io.respond_to?(:pos) ? io.pos : nil
  sample = io.read(sample_size)

  if original_pos && io.respond_to?(:seek)
    io.seek(original_pos)
  end

  return DetectionResult.new(Encoding::BINARY, 0.5) if sample.nil? || sample.empty?

  Detector.call(sample)
end

.file_valid?(path, encoding: nil) ⇒ Boolean

Check if a file’s content is valid in the detected or specified encoding.

Parameters:

  • path (String)

    path to the file

  • encoding (String, Encoding, nil) (defaults to: nil)

    encoding to check against (auto-detect if nil)

Returns:

  • (Boolean)


224
225
226
227
# File 'lib/philiprehberger/encoding_kit.rb', line 224

def self.file_valid?(path, encoding: nil)
  raw = File.binread(path)
  valid?(raw, encoding: encoding)
end

.guess_from_filename(filename) ⇒ Encoding?

Guess the encoding based on filename suffixes/extensions alone. Useful when a file name carries an explicit encoding hint (e.g., “data.utf8.csv”, “legacy.latin1.txt”). Falls back to nil when no hint can be extracted — callers should then use detect_file to inspect the bytes.

Matching is case-insensitive and considers the final two file extension tokens; the rightmost recognizable hint wins.

Parameters:

  • filename (String)

    filename or path

Returns:

  • (Encoding, nil)

    detected encoding or nil when no hint matches



275
276
277
278
279
280
281
282
283
# File 'lib/philiprehberger/encoding_kit.rb', line 275

def self.guess_from_filename(filename)
  name = File.basename(filename.to_s).downcase
  tokens = name.split('.').last(3) # extension + up to two modifiers
  tokens.reverse_each do |token|
    enc = FILENAME_ENCODING_HINTS[token]
    return enc if enc
  end
  nil
end

.normalize(string) ⇒ String

Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).

Parameters:

  • string (String)

    the input string

Returns:

  • (String)

    valid UTF-8 string



103
104
105
# File 'lib/philiprehberger/encoding_kit.rb', line 103

def self.normalize(string)
  Converter.normalize(string)
end

.normalize_line_endings(string, to: :lf) ⇒ String

Normalize line endings to a single canonical form.

Parameters:

  • string (String)

    the input string

  • to (Symbol) (defaults to: :lf)

    target line ending: ‘:lf`, `:crlf`, or `:cr`

Returns:

  • (String)

    string with normalized line endings

Raises:

  • (Error)

    if ‘to:` is not one of `:lf`, `:crlf`, or `:cr`



126
127
128
129
130
# File 'lib/philiprehberger/encoding_kit.rb', line 126

def self.normalize_line_endings(string, to: :lf)
  target = LINE_ENDINGS[to] or raise Error, "Unknown line ending: #{to.inspect} (expected :lf, :crlf, or :cr)"

  string.gsub(/\r\n|\r|\n/, target)
end

.read_as_utf8(path, from: nil) ⇒ String

Read a file and return its content as UTF-8. Auto-detects the source encoding unless specified via ‘from:`.

Parameters:

  • path (String)

    path to the file

  • from (String, Encoding, nil) (defaults to: nil)

    source encoding (auto-detect if nil)

Returns:

  • (String)

    UTF-8 encoded file content



214
215
216
217
# File 'lib/philiprehberger/encoding_kit.rb', line 214

def self.read_as_utf8(path, from: nil)
  raw = File.binread(path)
  to_utf8(raw, from: from)
end

.scrub(string) ⇒ String

Strip invalid bytes from a string, returning valid UTF-8.

Unlike normalize, which replaces invalid bytes with ‘�`, this method removes them entirely.

Parameters:

  • string (String)

    the input string

Returns:

  • (String)

    valid UTF-8 string with invalid bytes removed



114
115
116
# File 'lib/philiprehberger/encoding_kit.rb', line 114

def self.scrub(string)
  Converter.scrub(string)
end

.strip_bom(string) ⇒ String

Remove a byte order mark from the beginning of a string.

Parameters:

  • string (String)

    the input string

Returns:

  • (String)

    the string without a BOM



177
178
179
180
181
182
183
184
185
186
# File 'lib/philiprehberger/encoding_kit.rb', line 177

def self.strip_bom(string)
  bytes = string.b
  BOMS.each do |bom, _encoding| # rubocop:disable Style/HashEachMethods
    if bytes.start_with?(bom)
      result = bytes[bom.bytesize..]
      return result.force_encoding(string.encoding)
    end
  end
  string.dup
end

.to_utf8(string, from: nil) ⇒ String

Convert a string to UTF-8, auto-detecting source encoding if not specified.

Parameters:

  • string (String)

    the input string

  • from (String, Encoding, nil) (defaults to: nil)

    source encoding (auto-detect if nil)

Returns:

  • (String)

    UTF-8 encoded string



94
95
96
# File 'lib/philiprehberger/encoding_kit.rb', line 94

def self.to_utf8(string, from: nil)
  Converter.to_utf8(string, from: from)
end

.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String

Transcode a string to the target encoding, auto-detecting the source. Simpler API for the most common conversion pattern.

Parameters:

  • string (String)

    the input string

  • to (String, Encoding) (defaults to: Encoding::UTF_8)

    target encoding (default: UTF-8)

  • fallback (Symbol) (defaults to: :replace)

    fallback strategy (:replace or :raise)

  • replace (String) (defaults to: '?')

    replacement character for invalid bytes

Returns:

  • (String)

    the transcoded string

Raises:



165
166
167
168
169
170
171
# File 'lib/philiprehberger/encoding_kit.rb', line 165

def self.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?')
  detected = Detector.call(string)
  source = detected.encoding
  target = to.is_a?(Encoding) ? to : Encoding.find(to.to_s)

  Converter.convert(string, from: source, to: target, fallback: fallback, replace: replace)
end

.valid?(string, encoding: nil) ⇒ Boolean

Check if a string is valid in the given encoding (or its current encoding).

Parameters:

  • string (String)

    the input string

  • encoding (String, Encoding, nil) (defaults to: nil)

    encoding to check against (defaults to string’s encoding)

Returns:

  • (Boolean)


137
138
139
140
141
142
143
144
# File 'lib/philiprehberger/encoding_kit.rb', line 137

def self.valid?(string, encoding: nil)
  if encoding
    enc = Encoding.find(encoding.to_s)
    string.dup.force_encoding(enc).valid_encoding?
  else
    string.valid_encoding?
  end
end