Module: Philiprehberger::EncodingKit

Defined in:
lib/philiprehberger/encoding_kit.rb,
lib/philiprehberger/encoding_kit/version.rb,
lib/philiprehberger/encoding_kit/detector.rb,
lib/philiprehberger/encoding_kit/converter.rb,
lib/philiprehberger/encoding_kit/detection_result.rb

Defined Under Namespace

Modules: Converter, Detector Classes: DetectionResult, Error

Constant Summary collapse

BOMS =

BOM signatures (re-exported for public use)

Detector::BOMS
FILENAME_ENCODING_HINTS =

Filename suffix / extension hints that imply a specific encoding. Matched against the final two extension tokens of the filename.

{
  'utf8' => Encoding::UTF_8,
  'utf-8' => Encoding::UTF_8,
  'utf16' => Encoding::UTF_16,
  'utf-16' => Encoding::UTF_16,
  'utf16le' => Encoding::UTF_16LE,
  'utf-16le' => Encoding::UTF_16LE,
  'utf16be' => Encoding::UTF_16BE,
  'utf-16be' => Encoding::UTF_16BE,
  'utf32' => Encoding::UTF_32,
  'utf-32' => Encoding::UTF_32,
  'ascii' => Encoding::US_ASCII,
  'us-ascii' => Encoding::US_ASCII,
  'latin1' => Encoding::ISO_8859_1,
  'latin-1' => Encoding::ISO_8859_1,
  'iso88591' => Encoding::ISO_8859_1,
  'iso-8859-1' => Encoding::ISO_8859_1,
  'iso88592' => Encoding::ISO_8859_2,
  'iso-8859-2' => Encoding::ISO_8859_2,
  'cp1252' => Encoding::Windows_1252,
  'windows1252' => Encoding::Windows_1252,
  'windows-1252' => Encoding::Windows_1252,
  'sjis' => Encoding::Shift_JIS,
  'shiftjis' => Encoding::Shift_JIS,
  'shift-jis' => Encoding::Shift_JIS,
  'shift_jis' => Encoding::Shift_JIS,
  'euc-jp' => Encoding::EUC_JP,
  'eucjp' => Encoding::EUC_JP,
  'gbk' => Encoding::GBK,
  'gb2312' => Encoding::GB2312,
  'big5' => Encoding::Big5
}.freeze
VERSION =
'0.4.0'

Class Method Summary collapse

Class Method Details

.analyze(string) ⇒ Hash

Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.

Parameters:

  • string (String)

    the input string

Returns:

  • (Hash)

    analysis results with keys :encoding, :confidence, :printable_ratio, :ascii_ratio, :high_bytes, :candidates



51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/philiprehberger/encoding_kit.rb', line 51

def self.analyze(string)
  bytes = string.b
  total = bytes.bytesize.to_f

  if total.zero?
    return {
      encoding: Encoding::BINARY,
      confidence: 0.5,
      printable_ratio: 0.0,
      ascii_ratio: 0.0,
      high_bytes: 0,
      candidates: [{ encoding: Encoding::BINARY, confidence: 0.5 }]
    }
  end

  ascii_count = 0
  printable_count = 0
  high_byte_count = 0

  bytes.each_byte do |b|
    ascii_count += 1 if b < 128
    printable_count += 1 if (0x20..0x7E).cover?(b) || b == 0x09 || b == 0x0A || b == 0x0D
    high_byte_count += 1 if b >= 128
  end

  primary = Detector.call(bytes)
  candidates = build_candidates(bytes, primary)

  {
    encoding: primary.encoding,
    confidence: primary.confidence,
    printable_ratio: (printable_count / total).round(4),
    ascii_ratio: (ascii_count / total).round(4),
    high_bytes: high_byte_count,
    candidates: candidates
  }
end

.bom?(string) ⇒ Boolean

Check if a string starts with a byte order mark.

Parameters:

  • string (String)

    the input string

Returns:

  • (Boolean)


167
168
169
170
# File 'lib/philiprehberger/encoding_kit.rb', line 167

def self.bom?(string)
  bytes = string.b
  BOMS.any? { |bom, _encoding| bytes.start_with?(bom) }
end

.convert(string, from:, to:) ⇒ String

Convert a string between encodings.

Parameters:

  • string (String)

    the input string

  • from (String, Encoding)

    source encoding

  • to (String, Encoding)

    target encoding

Returns:

  • (String)

    the converted string



127
128
129
# File 'lib/philiprehberger/encoding_kit.rb', line 127

def self.convert(string, from:, to:)
  Converter.convert(string, from: from, to: to)
end

.detect(string) ⇒ DetectionResult

Detect the encoding of a string via BOM and heuristics. Returns a DetectionResult that delegates to the underlying Encoding, so it can be compared directly (e.g., result == Encoding::UTF_8) while also providing a confidence score via result.confidence.

Parameters:

  • string (String)

    the input string

Returns:



22
23
24
# File 'lib/philiprehberger/encoding_kit.rb', line 22

def self.detect(string)
  Detector.call(string)
end

.detect_file(path, sample_size: 4096) ⇒ DetectionResult

Detect the encoding of a file by reading a byte sample.

Parameters:

  • path (String)

    path to the file

  • sample_size (Integer) (defaults to: 4096)

    number of bytes to sample (default: 4096)

Returns:



177
178
179
180
181
# File 'lib/philiprehberger/encoding_kit.rb', line 177

def self.detect_file(path, sample_size: 4096)
  File.open(path, 'rb') do |file|
    detect_stream(file, sample_size: sample_size)
  end
end

.detect_stream(io, sample_size: 4096) ⇒ DetectionResult

Detect encoding from an IO stream by reading a sample of bytes. The IO position is restored after reading (if the IO supports seek).

Parameters:

  • io (IO, StringIO)

    the IO object to read from

  • sample_size (Integer) (defaults to: 4096)

    number of bytes to sample (default: 4096)

Returns:



32
33
34
35
36
37
38
39
40
41
42
43
# File 'lib/philiprehberger/encoding_kit.rb', line 32

def self.detect_stream(io, sample_size: 4096)
  original_pos = io.respond_to?(:pos) ? io.pos : nil
  sample = io.read(sample_size)

  if original_pos && io.respond_to?(:seek)
    io.seek(original_pos)
  end

  return DetectionResult.new(Encoding::BINARY, 0.5) if sample.nil? || sample.empty?

  Detector.call(sample)
end

.file_valid?(path, encoding: nil) ⇒ Boolean

Check if a file’s content is valid in the detected or specified encoding.

Parameters:

  • path (String)

    path to the file

  • encoding (String, Encoding, nil) (defaults to: nil)

    encoding to check against (auto-detect if nil)

Returns:

  • (Boolean)


199
200
201
202
# File 'lib/philiprehberger/encoding_kit.rb', line 199

def self.file_valid?(path, encoding: nil)
  raw = File.binread(path)
  valid?(raw, encoding: encoding)
end

.guess_from_filename(filename) ⇒ Encoding?

Guess the encoding based on filename suffixes/extensions alone. Useful when a file name carries an explicit encoding hint (e.g., “data.utf8.csv”, “legacy.latin1.txt”). Falls back to nil when no hint can be extracted — callers should then use detect_file to inspect the bytes.

Matching is case-insensitive and considers the final two file extension tokens; the rightmost recognizable hint wins.

Parameters:

  • filename (String)

    filename or path

Returns:

  • (Encoding, nil)

    detected encoding or nil when no hint matches



250
251
252
253
254
255
256
257
258
# File 'lib/philiprehberger/encoding_kit.rb', line 250

def self.guess_from_filename(filename)
  name = File.basename(filename.to_s).downcase
  tokens = name.split('.').last(3) # extension + up to two modifiers
  tokens.reverse_each do |token|
    enc = FILENAME_ENCODING_HINTS[token]
    return enc if enc
  end
  nil
end

.normalize(string) ⇒ String

Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).

Parameters:

  • string (String)

    the input string

Returns:

  • (String)

    valid UTF-8 string



103
104
105
# File 'lib/philiprehberger/encoding_kit.rb', line 103

def self.normalize(string)
  Converter.normalize(string)
end

.read_as_utf8(path, from: nil) ⇒ String

Read a file and return its content as UTF-8. Auto-detects the source encoding unless specified via ‘from:`.

Parameters:

  • path (String)

    path to the file

  • from (String, Encoding, nil) (defaults to: nil)

    source encoding (auto-detect if nil)

Returns:

  • (String)

    UTF-8 encoded file content



189
190
191
192
# File 'lib/philiprehberger/encoding_kit.rb', line 189

def self.read_as_utf8(path, from: nil)
  raw = File.binread(path)
  to_utf8(raw, from: from)
end

.strip_bom(string) ⇒ String

Remove a byte order mark from the beginning of a string.

Parameters:

  • string (String)

    the input string

Returns:

  • (String)

    the string without a BOM



152
153
154
155
156
157
158
159
160
161
# File 'lib/philiprehberger/encoding_kit.rb', line 152

def self.strip_bom(string)
  bytes = string.b
  BOMS.each do |bom, _encoding| # rubocop:disable Style/HashEachMethods
    if bytes.start_with?(bom)
      result = bytes[bom.bytesize..]
      return result.force_encoding(string.encoding)
    end
  end
  string.dup
end

.to_utf8(string, from: nil) ⇒ String

Convert a string to UTF-8, auto-detecting source encoding if not specified.

Parameters:

  • string (String)

    the input string

  • from (String, Encoding, nil) (defaults to: nil)

    source encoding (auto-detect if nil)

Returns:

  • (String)

    UTF-8 encoded string



94
95
96
# File 'lib/philiprehberger/encoding_kit.rb', line 94

def self.to_utf8(string, from: nil)
  Converter.to_utf8(string, from: from)
end

.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String

Transcode a string to the target encoding, auto-detecting the source. Simpler API for the most common conversion pattern.

Parameters:

  • string (String)

    the input string

  • to (String, Encoding) (defaults to: Encoding::UTF_8)

    target encoding (default: UTF-8)

  • fallback (Symbol) (defaults to: :replace)

    fallback strategy (:replace or :raise)

  • replace (String) (defaults to: '?')

    replacement character for invalid bytes

Returns:

  • (String)

    the transcoded string

Raises:



140
141
142
143
144
145
146
# File 'lib/philiprehberger/encoding_kit.rb', line 140

def self.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?')
  detected = Detector.call(string)
  source = detected.encoding
  target = to.is_a?(Encoding) ? to : Encoding.find(to.to_s)

  Converter.convert(string, from: source, to: target, fallback: fallback, replace: replace)
end

.valid?(string, encoding: nil) ⇒ Boolean

Check if a string is valid in the given encoding (or its current encoding).

Parameters:

  • string (String)

    the input string

  • encoding (String, Encoding, nil) (defaults to: nil)

    encoding to check against (defaults to string’s encoding)

Returns:

  • (Boolean)


112
113
114
115
116
117
118
119
# File 'lib/philiprehberger/encoding_kit.rb', line 112

def self.valid?(string, encoding: nil)
  if encoding
    enc = Encoding.find(encoding.to_s)
    string.dup.force_encoding(enc).valid_encoding?
  else
    string.valid_encoding?
  end
end