Module: Philiprehberger::EncodingKit

Defined in:: lib/philiprehberger/encoding_kit.rb,
lib/philiprehberger/encoding_kit/version.rb,
lib/philiprehberger/encoding_kit/detector.rb,
lib/philiprehberger/encoding_kit/converter.rb,
lib/philiprehberger/encoding_kit/detection_result.rb

Defined Under Namespace

Modules: Converter, Detector Classes: DetectionResult, Error

Constant Summary collapse

BOMS = BOM signatures (re-exported for public use)

Detector::BOMS

LINE_ENDINGS =

{ lf: "\n", crlf: "\r\n", cr: "\r" }.freeze

FILENAME_ENCODING_HINTS = Filename suffix / extension hints that imply a specific encoding. Matched against the final two extension tokens of the filename.

{
  'utf8' => Encoding::UTF_8,
  'utf-8' => Encoding::UTF_8,
  'utf16' => Encoding::UTF_16,
  'utf-16' => Encoding::UTF_16,
  'utf16le' => Encoding::UTF_16LE,
  'utf-16le' => Encoding::UTF_16LE,
  'utf16be' => Encoding::UTF_16BE,
  'utf-16be' => Encoding::UTF_16BE,
  'utf32' => Encoding::UTF_32,
  'utf-32' => Encoding::UTF_32,
  'ascii' => Encoding::US_ASCII,
  'us-ascii' => Encoding::US_ASCII,
  'latin1' => Encoding::ISO_8859_1,
  'latin-1' => Encoding::ISO_8859_1,
  'iso88591' => Encoding::ISO_8859_1,
  'iso-8859-1' => Encoding::ISO_8859_1,
  'iso88592' => Encoding::ISO_8859_2,
  'iso-8859-2' => Encoding::ISO_8859_2,
  'cp1252' => Encoding::Windows_1252,
  'windows1252' => Encoding::Windows_1252,
  'windows-1252' => Encoding::Windows_1252,
  'sjis' => Encoding::Shift_JIS,
  'shiftjis' => Encoding::Shift_JIS,
  'shift-jis' => Encoding::Shift_JIS,
  'shift_jis' => Encoding::Shift_JIS,
  'euc-jp' => Encoding::EUC_JP,
  'eucjp' => Encoding::EUC_JP,
  'gbk' => Encoding::GBK,
  'gb2312' => Encoding::GB2312,
  'big5' => Encoding::Big5
}.freeze

VERSION =

'0.6.0'

Class Method Summary collapse

.analyze(string) ⇒ Hash

Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.
.bom?(string) ⇒ Boolean

Check if a string starts with a byte order mark.
.convert(string, from:, to:) ⇒ String

Convert a string between encodings.
.detect(string) ⇒ DetectionResult

Detect the encoding of a string via BOM and heuristics.
.detect_file(path, sample_size: 4096) ⇒ DetectionResult

Detect the encoding of a file by reading a byte sample.
.detect_stream(io, sample_size: 4096) ⇒ DetectionResult

Detect encoding from an IO stream by reading a sample of bytes.
.file_valid?(path, encoding: nil) ⇒ Boolean

Check if a file’s content is valid in the detected or specified encoding.
.guess_from_filename(filename) ⇒ Encoding^?

Guess the encoding based on filename suffixes/extensions alone.
.normalize(string) ⇒ String

Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).
.normalize_line_endings(string, to: :lf) ⇒ String

Normalize line endings to a single canonical form.
.read_as_utf8(path, from: nil, strip_bom: false) ⇒ String

Read a file and return its content as UTF-8.
.scrub(string) ⇒ String

Strip invalid bytes from a string, returning valid UTF-8.
.strip_bom(string) ⇒ String

Remove a byte order mark from the beginning of a string.
.to_utf8(string, from: nil, strip_bom: false) ⇒ String

Convert a string to UTF-8, auto-detecting source encoding if not specified.
.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String

Transcode a string to the target encoding, auto-detecting the source.
.valid?(string, encoding: nil) ⇒ Boolean

Check if a string is valid in the given encoding (or its current encoding).

Class Method Details

.analyze(string) ⇒ `Hash`

Analyze a string and return detailed byte distribution statistics along with encoding candidates ranked by confidence.

Parameters:

string (String) —

the input string

Returns:

(Hash) —

analysis results with keys :encoding, :confidence, :printable_ratio, :ascii_ratio, :high_bytes, :candidates

# File 'lib/philiprehberger/encoding_kit.rb', line 51

def self.analyze(string)
  bytes = string.b
  total = bytes.bytesize.to_f

  if total.zero?
    return {
      encoding: Encoding::BINARY,
      confidence: 0.5,
      printable_ratio: 0.0,
      ascii_ratio: 0.0,
      high_bytes: 0,
      candidates: [{ encoding: Encoding::BINARY, confidence: 0.5 }]
    }
  end

  ascii_count = 0
  printable_count = 0
  high_byte_count = 0

  bytes.each_byte do |b|
    ascii_count += 1 if b < 128
    printable_count += 1 if (0x20..0x7E).cover?(b) || b == 0x09 || b == 0x0A || b == 0x0D
    high_byte_count += 1 if b >= 128
  end

  primary = Detector.call(bytes)
  candidates = build_candidates(bytes, primary)

  {
    encoding: primary.encoding,
    confidence: primary.confidence,
    printable_ratio: (printable_count / total).round(4),
    ascii_ratio: (ascii_count / total).round(4),
    high_bytes: high_byte_count,
    candidates: candidates
  }
end

.bom?(string) ⇒ `Boolean`

Check if a string starts with a byte order mark.

Parameters:

string (String) —

the input string

Returns:

(Boolean)

# File 'lib/philiprehberger/encoding_kit.rb', line 193

def self.bom?(string)
  bytes = string.b
  BOMS.any? { |bom, _encoding| bytes.start_with?(bom) }
end

.convert(string, from:, to:) ⇒ `String`

Convert a string between encodings.

Parameters:

string (String) —

the input string
from (String, Encoding) —

source encoding
to (String, Encoding) —

target encoding

Returns:

(String) —

the converted string



153
154
155

# File 'lib/philiprehberger/encoding_kit.rb', line 153

def self.convert(string, from:, to:)
  Converter.convert(string, from: from, to: to)
end

.detect(string) ⇒ `DetectionResult`

Detect the encoding of a string via BOM and heuristics. Returns a DetectionResult that delegates to the underlying Encoding, so it can be compared directly (e.g., result == Encoding::UTF_8) while also providing a confidence score via result.confidence.

Parameters:

string (String) —

the input string

Returns:

(DetectionResult) —

the detected encoding with confidence score



22
23
24

# File 'lib/philiprehberger/encoding_kit.rb', line 22

def self.detect(string)
  Detector.call(string)
end

.detect_file(path, sample_size: 4096) ⇒ `DetectionResult`

Detect the encoding of a file by reading a byte sample.

Parameters:

path (String) —

path to the file
sample_size (Integer) (defaults to: 4096) —

number of bytes to sample (default: 4096)

Returns:

(DetectionResult) —

the detected encoding with confidence score

# File 'lib/philiprehberger/encoding_kit.rb', line 203

def self.detect_file(path, sample_size: 4096)
  File.open(path, 'rb') do |file|
    detect_stream(file, sample_size: sample_size)
  end
end

.detect_stream(io, sample_size: 4096) ⇒ `DetectionResult`

Detect encoding from an IO stream by reading a sample of bytes. The IO position is restored after reading (if the IO supports seek).

Parameters:

io (IO, StringIO) —

the IO object to read from
sample_size (Integer) (defaults to: 4096) —

number of bytes to sample (default: 4096)

Returns:

(DetectionResult) —

the detected encoding with confidence score

# File 'lib/philiprehberger/encoding_kit.rb', line 32

def self.detect_stream(io, sample_size: 4096)
  original_pos = io.respond_to?(:pos) ? io.pos : nil
  sample = io.read(sample_size)

  if original_pos && io.respond_to?(:seek)
    io.seek(original_pos)
  end

  return DetectionResult.new(Encoding::BINARY, 0.5) if sample.nil? || sample.empty?

  Detector.call(sample)
end

.file_valid?(path, encoding: nil) ⇒ `Boolean`

Check if a file’s content is valid in the detected or specified encoding.

Parameters:

path (String) —

path to the file
encoding (String, Encoding, nil) (defaults to: nil) —

encoding to check against (auto-detect if nil)

Returns:

(Boolean)

# File 'lib/philiprehberger/encoding_kit.rb', line 226

def self.file_valid?(path, encoding: nil)
  raw = File.binread(path)
  valid?(raw, encoding: encoding)
end

.guess_from_filename(filename) ⇒ `Encoding`^?

Guess the encoding based on filename suffixes/extensions alone. Useful when a file name carries an explicit encoding hint (e.g., “data.utf8.csv”, “legacy.latin1.txt”). Falls back to nil when no hint can be extracted — callers should then use detect_file to inspect the bytes.

Matching is case-insensitive and considers the final two file extension tokens; the rightmost recognizable hint wins.

Parameters:

filename (String) —

filename or path

Returns:

(Encoding, nil) —

detected encoding or nil when no hint matches

# File 'lib/philiprehberger/encoding_kit.rb', line 277

def self.guess_from_filename(filename)
  name = File.basename(filename.to_s).downcase
  tokens = name.split('.').last(3) # extension + up to two modifiers
  tokens.reverse_each do |token|
    enc = FILENAME_ENCODING_HINTS[token]
    return enc if enc
  end
  nil
end

.normalize(string) ⇒ `String`

Normalize a string to valid UTF-8, replacing invalid/undefined bytes with the Unicode replacement character (U+FFFD).

Parameters:

string (String) —

the input string

Returns:

(String) —

valid UTF-8 string



104
105
106

# File 'lib/philiprehberger/encoding_kit.rb', line 104

def self.normalize(string)
  Converter.normalize(string)
end

.normalize_line_endings(string, to: :lf) ⇒ `String`

Normalize line endings to a single canonical form.

Parameters:

string (String) —

the input string
to (Symbol) (defaults to: :lf) —

target line ending: ‘:lf`, `:crlf`, or `:cr`

Returns:

(String) —

string with normalized line endings

Raises:

(Error) —

if ‘to:` is not one of `:lf`, `:crlf`, or `:cr`

# File 'lib/philiprehberger/encoding_kit.rb', line 127

def self.normalize_line_endings(string, to: :lf)
  target = LINE_ENDINGS[to] or raise Error, "Unknown line ending: #{to.inspect} (expected :lf, :crlf, or :cr)"

  string.gsub(/\r\n|\r|\n/, target)
end

.read_as_utf8(path, from: nil, strip_bom: false) ⇒ `String`

Read a file and return its content as UTF-8. Auto-detects the source encoding unless specified via ‘from:`.

Parameters:

path (String) —

path to the file
from (String, Encoding, nil) (defaults to: nil) —

source encoding (auto-detect if nil)
strip_bom (Boolean) (defaults to: false) —

remove any leading UTF BOM from the result (default: false)

Returns:

(String) —

UTF-8 encoded file content

# File 'lib/philiprehberger/encoding_kit.rb', line 216

def self.read_as_utf8(path, from: nil, strip_bom: false)
  raw = File.binread(path)
  to_utf8(raw, from: from, strip_bom: strip_bom)
end

.scrub(string) ⇒ `String`

Strip invalid bytes from a string, returning valid UTF-8.

Unlike normalize, which replaces invalid bytes with ‘�`, this method removes them entirely.

Parameters:

string (String) —

the input string

Returns:

(String) —

valid UTF-8 string with invalid bytes removed



115
116
117

# File 'lib/philiprehberger/encoding_kit.rb', line 115

def self.scrub(string)
  Converter.scrub(string)
end

.strip_bom(string) ⇒ `String`

Remove a byte order mark from the beginning of a string.

Parameters:

string (String) —

the input string

Returns:

(String) —

the string without a BOM

# File 'lib/philiprehberger/encoding_kit.rb', line 178

def self.strip_bom(string)
  bytes = string.b
  BOMS.each do |bom, _encoding| # rubocop:disable Style/HashEachMethods
    if bytes.start_with?(bom)
      result = bytes[bom.bytesize..]
      return result.force_encoding(string.encoding)
    end
  end
  string.dup
end

.to_utf8(string, from: nil, strip_bom: false) ⇒ `String`

Convert a string to UTF-8, auto-detecting source encoding if not specified.

Parameters:

string (String) —

the input string
from (String, Encoding, nil) (defaults to: nil) —

source encoding (auto-detect if nil)
strip_bom (Boolean) (defaults to: false) —

remove any leading UTF BOM from the result (default: false)

Returns:

(String) —

UTF-8 encoded string



95
96
97

# File 'lib/philiprehberger/encoding_kit.rb', line 95

def self.to_utf8(string, from: nil, strip_bom: false)
  Converter.to_utf8(string, from: from, strip_bom: strip_bom)
end

.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ `String`

Transcode a string to the target encoding, auto-detecting the source. Simpler API for the most common conversion pattern.

Parameters:

string (String) —

the input string
to (String, Encoding) (defaults to: Encoding::UTF_8) —

target encoding (default: UTF-8)
fallback (Symbol) (defaults to: :replace) —

fallback strategy (:replace or :raise)
replace (String) (defaults to: '?') —

replacement character for invalid bytes

Returns:

(String) —

the transcoded string

Raises:

(EncodingKit::Error) —

on conversion failure when fallback is :raise

# File 'lib/philiprehberger/encoding_kit.rb', line 166

def self.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?')
  detected = Detector.call(string)
  source = detected.encoding
  target = to.is_a?(Encoding) ? to : Encoding.find(to.to_s)

  Converter.convert(string, from: source, to: target, fallback: fallback, replace: replace)
end

.valid?(string, encoding: nil) ⇒ `Boolean`

Check if a string is valid in the given encoding (or its current encoding).

Parameters:

string (String) —

the input string
encoding (String, Encoding, nil) (defaults to: nil) —

encoding to check against (defaults to string’s encoding)

Returns:

(Boolean)

# File 'lib/philiprehberger/encoding_kit.rb', line 138

def self.valid?(string, encoding: nil)
  if encoding
    enc = Encoding.find(encoding.to_s)
    string.dup.force_encoding(enc).valid_encoding?
  else
    string.valid_encoding?
  end
end

Module: Philiprehberger::EncodingKit

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.analyze(string) ⇒ Hash

.bom?(string) ⇒ Boolean

.convert(string, from:, to:) ⇒ String

.detect(string) ⇒ DetectionResult

.detect_file(path, sample_size: 4096) ⇒ DetectionResult

.detect_stream(io, sample_size: 4096) ⇒ DetectionResult

.file_valid?(path, encoding: nil) ⇒ Boolean

.guess_from_filename(filename) ⇒ Encoding?

.normalize(string) ⇒ String

.normalize_line_endings(string, to: :lf) ⇒ String

.read_as_utf8(path, from: nil, strip_bom: false) ⇒ String

.scrub(string) ⇒ String

.strip_bom(string) ⇒ String

.to_utf8(string, from: nil, strip_bom: false) ⇒ String

.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ String

.valid?(string, encoding: nil) ⇒ Boolean

.analyze(string) ⇒ `Hash`

.bom?(string) ⇒ `Boolean`

.convert(string, from:, to:) ⇒ `String`

.detect(string) ⇒ `DetectionResult`

.detect_file(path, sample_size: 4096) ⇒ `DetectionResult`

.detect_stream(io, sample_size: 4096) ⇒ `DetectionResult`

.file_valid?(path, encoding: nil) ⇒ `Boolean`

.guess_from_filename(filename) ⇒ `Encoding`^?

.normalize(string) ⇒ `String`

.normalize_line_endings(string, to: :lf) ⇒ `String`

.read_as_utf8(path, from: nil, strip_bom: false) ⇒ `String`

.scrub(string) ⇒ `String`

.strip_bom(string) ⇒ `String`

.to_utf8(string, from: nil, strip_bom: false) ⇒ `String`

.transcode(string, to: Encoding::UTF_8, fallback: :replace, replace: '?') ⇒ `String`

.valid?(string, encoding: nil) ⇒ `Boolean`