Module: Clacky::Utils::Encoding

Defined in:: lib/clacky/utils/encoding.rb

Overview

Centralised UTF-8 encoding helpers used throughout the codebase.

Three distinct use-cases exist:

1. to_utf8       – binary/unknown bytes → valid UTF-8 String.
                 Used when reading shell output, HTTP response bodies,
                 or any raw byte stream that is *expected* to be UTF-8
                 but arrives with ASCII-8BIT (binary) encoding.
                 Strategy: force_encoding("UTF-8") then scrub invalid
                 sequences with U+FFFD so multibyte characters (CJK,
                 emoji, …) are preserved as-is.

2. sanitize_utf8 – UTF-8 String → clean UTF-8 String.
                 Used for UI rendering (terminal output, screen
                 buffers) where the string is already nominally UTF-8
                 but may still contain isolated invalid bytes.
                 Strategy: encode UTF-8→UTF-8 replacing invalid /
                 undefined codepoints with an empty string so the
                 rendered output never contains replacement characters.

3. safe_check    – any String → ASCII-safe UTF-8 String for regex.
                 Used only for security pattern matching (terminal/Security).
                 Multibyte bytes are replaced with '?' so that Ruby's
                 regex engine operates on a plain ASCII-compatible
                 string without raising Encoding errors.

Class Method Summary collapse

.cmd_to_utf8(data, source_encoding: "GBK") ⇒ String
Convert raw shell command output to valid UTF-8.
.pty_to_utf8(data) ⇒ String
Decode a raw PTY byte stream to valid UTF-8, auto-detecting the source encoding.
.safe_check(str) ⇒ String
Return an ASCII-safe UTF-8 copy of str suitable for security regex pattern matching.
.sanitize_utf8(str) ⇒ String
Clean an already-UTF-8 string by removing (not replacing) any invalid or undefined byte sequences.
.to_utf8(data) ⇒ String
Convert a binary (or unknown-encoding) byte string to a valid UTF-8 String.

Class Method Details

.cmd_to_utf8(data, source_encoding: "GBK") ⇒ `String`

Convert raw shell command output to valid UTF-8. Handles two common cases:

- Windows commands (e.g. powershell.exe) that output GBK/CP936 bytes
- Unix commands that output UTF-8 or ASCII bytes with ASCII-8BIT encoding

Strategy: try GBK decode first (superset of ASCII, covers Chinese Windows); if that fails fall back to UTF-8 scrub.

Parameters:

data (String, nil) —
raw bytes from backtick / IO.popen
source_encoding (String) (defaults to: "GBK") —
hint for source encoding (default: "GBK")

Returns:

(String) —
valid UTF-8 string

# File 'lib/clacky/utils/encoding.rb', line 68

def self.cmd_to_utf8(data, source_encoding: "GBK")
  return "" if data.nil? || data.empty?

  data.dup
      .force_encoding(source_encoding)
      .encode("UTF-8", invalid: :replace, undef: :replace, replace: "")
rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError
  to_utf8(data)
end

.pty_to_utf8(data) ⇒ `String`

Decode a raw PTY byte stream to valid UTF-8, auto-detecting the source encoding. UTF-8 is tried first (Linux/macOS and modern programs); when the bytes are not valid UTF-8 they are decoded as GBK/CP936 (Simplified Chinese Windows powershell.exe / cmd.exe default output); anything that still fails is scrubbed.

MUST be called on complete byte runs — callers slice on "\n" (0x0A), which is never a trailing byte of a GBK or UTF-8 multibyte sequence, so a character is never split across the boundary.

Parameters:

data (String, nil) —
raw PTY bytes (binary/ASCII-8BIT)

Returns:

(String) —
valid UTF-8 string

# File 'lib/clacky/utils/encoding.rb', line 90

def self.pty_to_utf8(data)
  return "" if data.nil? || data.empty?

  s = data.dup.force_encoding("UTF-8")
  return s if s.valid_encoding?

  data.dup
      .force_encoding("GBK")
      .encode("UTF-8", invalid: :replace, undef: :replace, replace: "?")
rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError
  to_utf8(data)
end

.safe_check(str) ⇒ `String`

Return an ASCII-safe UTF-8 copy of str suitable for security regex pattern matching. Any byte that is not valid in the source encoding, or that cannot be represented in UTF-8, is replaced with '?'. The original string is never mutated.

Parameters:

str (String, nil)

Returns:

(String) —
UTF-8 string safe for regex matching

# File 'lib/clacky/utils/encoding.rb', line 110

def self.safe_check(str)
  return "" if str.nil? || str.empty?

  str.encode("UTF-8", invalid: :replace, undef: :replace, replace: "?")
end

.sanitize_utf8(str) ⇒ `String`

Clean an already-UTF-8 string by removing (not replacing) any invalid or undefined byte sequences. Suitable for terminal / UI rendering where replacement characters would appear as visual noise.

Parameters:

str (String, nil) —
nominally UTF-8 string

Returns:

(String) —
clean UTF-8 string (invalid bytes silently dropped)

# File 'lib/clacky/utils/encoding.rb', line 51

def self.sanitize_utf8(str)
  return "" if str.nil? || str.empty?

  str.encode("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "")
end

.to_utf8(data) ⇒ `String`

Convert a binary (or unknown-encoding) byte string to a valid UTF-8 String. Multibyte sequences that are already valid UTF-8 (e.g. CJK characters) are preserved unchanged; only genuinely invalid byte sequences are replaced with U+FFFD (the Unicode replacement character).

Parameters:

data (String, nil) —
raw bytes, typically from a pipe or HTTP body

Returns:

(String) —
valid UTF-8 string

# File 'lib/clacky/utils/encoding.rb', line 39

def self.to_utf8(data)
  return "" if data.nil? || data.empty?

  data.dup.force_encoding("UTF-8").scrub("\u{FFFD}")
end

Module: Clacky::Utils::Encoding

Overview

Class Method Summary collapse

Class Method Details

.cmd_to_utf8(data, source_encoding: "GBK") ⇒ String

.pty_to_utf8(data) ⇒ String

.safe_check(str) ⇒ String

.sanitize_utf8(str) ⇒ String

.to_utf8(data) ⇒ String

.cmd_to_utf8(data, source_encoding: "GBK") ⇒ `String`

.pty_to_utf8(data) ⇒ `String`

.safe_check(str) ⇒ `String`

.sanitize_utf8(str) ⇒ `String`

.to_utf8(data) ⇒ `String`