Module: Philiprehberger::EncodingKit

Defined in:
lib/philiprehberger/encoding_kit.rb,
lib/philiprehberger/encoding_kit/version.rb,
lib/philiprehberger/encoding_kit/detector.rb,
lib/philiprehberger/encoding_kit/converter.rb,
lib/philiprehberger/encoding_kit/detection_result.rb

Defined Under Namespace

Modules: Converter, Detector Classes: DetectionResult, Error

Constant Summary collapse

BOMS =

BOM signatures (re-exported for public use)

Detector::BOMS
LINE_ENDINGS =
{ lf: "\n", crlf: "\r\n", cr: "\r" }.freeze
FILENAME_ENCODING_HINTS =

Filename suffix / extension hints that imply a specific encoding. Matched against the final two extension tokens of the filename.

{
  'utf8' => Encoding::UTF_8,
  'utf-8' => Encoding::UTF_8,
  'utf16' => Encoding::UTF_16,
  'utf-16' => Encoding::UTF_16,
  'utf16le' => Encoding::UTF_16LE,
  'utf-16le' => Encoding::UTF_16LE,
  'utf16be' => Encoding::UTF_16BE,
  'utf-16be' => Encoding::UTF_16BE,
  'utf32' => Encoding::UTF_32,
  'utf-32' => Encoding::UTF_32,
  'ascii' => Encoding::US_ASCII,
  'us-ascii' => Encoding::US_ASCII,
  'latin1' => Encoding::ISO_8859_1,
  'latin-1' => Encoding::ISO_8859_1,
  'iso88591' => Encoding::ISO_8859_1,
  'iso-8859-1' => Encoding::ISO_8859_1,
  'iso88592' => Encoding::ISO_8859_2,
  'iso-8859-2' => Encoding::ISO_8859_2,
  'cp1252' => Encoding::Windows_1252,
  'windows1252' => Encoding::Windows_1252,
  'windows-1252' => Encoding::Windows_1252,
  'sjis' => Encoding::Shift_JIS,
  'shiftjis' => Encoding::Shift_JIS,
  'shift-jis' => Encoding::Shift_JIS,
  'shift_jis' => Encoding::Shift_JIS,
  'euc-jp' => Encoding::EUC_JP,
  'eucjp' => Encoding::EUC_JP,
  'gbk' => Encoding::GBK,
  'gb2312' => Encoding::GB2312,
  'big5' => Encoding::Big5
}.freeze
VERSION =
'0.6.0'

Class Method Summary collapse

Class Method Details

.analyze(string) ⇒ Hash

Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.

Parameters:

  • string (String)

    the input string

Returns:

  • (Hash)

    analysis results with keys :encoding, :confidence, :printable_ratio, :ascii_ratio, :high_bytes, :candidates



51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/philiprehberger/encoding_kit.rb', line 51

def self.analyze(string)
  bytes = string.b
  total = bytes.bytesize.to_f

  if total.zero?
    return {
      encoding: Encoding::BINARY,
      confidence: 0.5,
      printable_ratio: 0.0,
      ascii_ratio: 0.0,
      high_bytes: 0,
      candidates: [{ encoding: Encoding::BINARY, confidence: 0.5 }]
    }
  end

  ascii_count = 0
  printable_count = 0
  high_byte_count = 0

  bytes.each_byte do |b|
    ascii_count += 1 if b < 128
    printable_count += 1 if (0x20..0x7E).cover?(b) || b == 0x09 || b == 0x0A || b == 0x0D
    high_byte_count += 1 if b >= 128
  end

  primary = Detector.call(bytes)
  candidates = build_candidates(bytes, primary)

  {
    encoding: primary.encoding,
    confidence: primary.confidence,
    printable_ratio: (printable_count / total).round(4),
    ascii_ratio: (ascii_count / total).round(4),
    high_bytes: high_byte_count,
    candidates: candidates
  }
end

.bom?(string) ⇒ Boolean

Check if a string starts with a byte order mark.

Parameters:

  • string (String)

    the input string

Returns:

  • (Boolean)


193
194
195
196
# File 'lib/philiprehberger/encoding_kit.rb', line 193

def self.bom?(string)
  bytes = string.b
  BOMS.any? { |bom, _encoding| bytes.start_with?(bom) }
end

.convert(string, from:, to:) ⇒ String

Convert a string between encodings.

Parameters:

  • string (String)

    the input string

  • from (String, Encoding)

    source encoding

  • to (String, Encoding)

    target encoding

Returns:

  • (String)

    the converted string



153
154
155
# File 'lib/philiprehberger/encoding_kit.rb', line 153

def self.convert(string, from:, to:)
  Converter.convert(string, from: from, to: to)
end

.detect(string) ⇒ DetectionResult

Detect the encoding of a string via BOM and heuristics. Returns a DetectionResult that delegates to the underlying Encoding, so it can be compared directly (e.g., result == Encoding::UTF_8) while also providing a confidence score via result.confidence.

Parameters:

  • string (String)

    the input string

Returns:



22
23
24
# File 'lib/philiprehberger/encoding_kit.rb', line 22

def self.detect(string)
  Detector.call(string)
end

.detect_file(path, sample_size: 4096) ⇒ DetectionResult

Detect the encoding of a file by reading a byte sample.

Parameters:

  • path (String)

    path to the file

  • sample_size (Integer) (defaults to: 4096)

    number of bytes to sample (default: 4096)

Returns:



203
204
205
206
207
# File 'lib/philiprehberger/encoding_kit.rb', line 203

def self.detect_file(path, sample_size: 4096)
  File.open(path, 'rb') do |file|
    detect_stream(file, sample_size: sample_size)
  end
end

.detect_stream(io, sample_size: 4096) ⇒ DetectionResult

Detect encoding from an IO stream by reading a sample of bytes. The IO position is restored after reading (if the IO supports seek).

Parameters:

  • io (IO, StringIO)

    the IO object to read from

  • sample_size (Integer) (defaults to: 4096)

    number of bytes to sample (default: 4096)

Returns:



32
33
34
35
36
37
38
39
40
41
42
43
# File 'lib/philiprehberger/encoding_kit.rb', line 32

def self.detect_stream(io, sample_size: 4096)
  original_pos = io.respond_to?(:pos) ? io.pos : nil
  sample = io.read(sample_size)

  if original_pos && io.respond_to?(:seek)
    io.seek(original_pos)
  end

  return DetectionResult.new(Encoding::BINARY, 0.5) if sample.nil? || sample.empty?

  Detector.call(sample)
end

.file_valid?(path, encoding: nil) ⇒ Boolean

Check if a file’s content is valid in the detected or specified encoding.

Parameters:

  • path (String)

    path to the file

  • encoding (String, Encoding, nil) (defaults to: nil)

    encoding to check against (auto-detect if nil)

Returns:

  • (Boolean)


226
227
228
229
# File 'lib/philiprehberger/encoding_kit.rb', line 226

def self.file_valid?(path, encoding: nil)
  raw = File.binread(path)
  valid?(raw, encoding: encoding)
end

.guess_from_filename(filename) ⇒ Encoding?

Guess the encoding based on filename suffixes/extensions alone. Useful when a file name carries an explicit encoding hint (e.g., “data.utf8.csv”, “legacy.latin1.txt”). Falls back to nil when no hint can be extracted — callers should then use detect_file to inspect the bytes.

Matching is case-insensitive and considers the final two file extension tokens; the rightmost recognizable hint wins.

Parameters:

  • filename (String)

    filename or path

Returns:

  • (Encoding, nil)

    detected encoding or nil when no hint matches



277
278
279
280
281
282
283
284
285
# File 'lib/philiprehberger/encoding_kit.rb', line 277

def self.guess_from_filename(filename)
  name = File.basename(filename.to_s).downcase
  tokens = name.split('.').last(3) # extension + up to two modifiers
  tokens.reverse_each do |token|
    enc = FILENAME_ENCODING_HINTS[token]
    return enc if enc
  end
  nil
end

.normalize(string) ⇒ String

Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).

Parameters:

  • string (String)

    the input string

Returns:

  • (String)

    valid UTF-8 string



104
105
106
# File 'lib/philiprehberger/encoding_kit.rb', line 104

def self.normalize(string)
  Converter.normalize(string)
end

.normalize_line_endings(string, to: :lf) ⇒ String

Normalize line endings to a single canonical form.

Parameters:

  • string (String)

    the input string

  • to (Symbol) (defaults to: :lf)

    target line ending: ‘:lf`, `:crlf`, or `:cr`

Returns:

  • (String)

    string with normalized line endings

Raises:

  • (Error)

    if ‘to:` is not one of `:lf`, `:crlf`, or `:cr`



127
128
129
130
131
# File 'lib/philiprehberger/encoding_kit.rb', line 127

def self.normalize_line_endings(string, to: :lf)
  target = LINE_ENDINGS[to] or raise Error, "Unknown line ending: #{to.inspect} (expected :lf, :crlf, or :cr)"

  string.gsub(/\r\n|\r|\n/, target)
end

.read_as_utf8(path, from: nil, strip_bom: false) ⇒ String

Read a file and return its content as UTF-8. Auto-detects the source encoding unless specified via ‘from:`.

Parameters:

  • path (String)

    path to the file

  • from (String, Encoding, nil) (defaults to: nil)

    source encoding (auto-detect if nil)

  • strip_bom (Boolean) (defaults to: false)

    remove any leading UTF BOM from the result (default: false)

Returns:

  • (String)

    UTF-8 encoded file content



216
217
218
219
# File 'lib/philiprehberger/encoding_kit.rb', line 216

def self.read_as_utf8(path, from: nil, strip_bom: false)
  raw = File.binread(path)
  to_utf8(raw, from: from, strip_bom: strip_bom)
end

.scrub(string) ⇒ String

Strip invalid bytes from a string, returning valid UTF-8.

Unlike normalize, which replaces invalid bytes with ‘�`, this method removes them entirely.

Parameters:

  • string (String)

    the input string

Returns:

  • (String)

    valid UTF-8 string with invalid bytes removed



115
116
117
# File 'lib/philiprehberger/encoding_kit.rb', line 115

def self.scrub(string)
  Converter.scrub(string)
end

.strip_bom(string) ⇒ String

Remove a byte order mark from the beginning of a string.

Parameters:

  • string (String)

    the input string

Returns:

  • (String)

    the string without a BOM



178
179
180
181
182
183
184
185
186
187
# File 'lib/philiprehberger/encoding_kit.rb', line 178

def self.strip_bom(string)
  bytes = string.b
  BOMS.each do |bom, _encoding| # rubocop:disable Style/HashEachMethods
    if bytes.start_with?(bom)
      result = bytes[bom.bytesize..]
      return result.force_encoding(string.encoding)
    end
  end
  string.dup
end

.to_utf8(string, from: nil, strip_bom: false) ⇒ String

Convert a string to UTF-8, auto-detecting source encoding if not specified.

Parameters:

  • string (String)

    the input string

  • from (String, Encoding, nil) (defaults to: nil)

    source encoding (auto-detect if nil)

  • strip_bom (Boolean) (defaults to: false)

    remove any leading UTF BOM from the result (default: false)

Returns:

  • (String)

    UTF-8 encoded string



95
96
97
# File 'lib/philiprehberger/encoding_kit.rb', line 95

def self.to_utf8(string, from: nil, strip_bom: false)
  Converter.to_utf8(string, from: from, strip_bom: strip_bom)
end

.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String

Transcode a string to the target encoding, auto-detecting the source. Simpler API for the most common conversion pattern.

Parameters:

  • string (String)

    the input string

  • to (String, Encoding) (defaults to: Encoding::UTF_8)

    target encoding (default: UTF-8)

  • fallback (Symbol) (defaults to: :replace)

    fallback strategy (:replace or :raise)

  • replace (String) (defaults to: '?')

    replacement character for invalid bytes

Returns:

  • (String)

    the transcoded string

Raises:



166
167
168
169
170
171
172
# File 'lib/philiprehberger/encoding_kit.rb', line 166

def self.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?')
  detected = Detector.call(string)
  source = detected.encoding
  target = to.is_a?(Encoding) ? to : Encoding.find(to.to_s)

  Converter.convert(string, from: source, to: target, fallback: fallback, replace: replace)
end

.valid?(string, encoding: nil) ⇒ Boolean

Check if a string is valid in the given encoding (or its current encoding).

Parameters:

  • string (String)

    the input string

  • encoding (String, Encoding, nil) (defaults to: nil)

    encoding to check against (defaults to string’s encoding)

Returns:

  • (Boolean)


138
139
140
141
142
143
144
145
# File 'lib/philiprehberger/encoding_kit.rb', line 138

def self.valid?(string, encoding: nil)
  if encoding
    enc = Encoding.find(encoding.to_s)
    string.dup.force_encoding(enc).valid_encoding?
  else
    string.valid_encoding?
  end
end