Module: Clacky::Utils::Encoding
- Defined in:
- lib/clacky/utils/encoding.rb
Overview
Centralised UTF-8 encoding helpers used throughout the codebase.
Three distinct use-cases exist:
1. to_utf8 – binary/unknown bytes → valid UTF-8 String.
Used when reading shell output, HTTP response bodies,
or any raw byte stream that is *expected* to be UTF-8
but arrives with ASCII-8BIT (binary) encoding.
Strategy: force_encoding("UTF-8") then scrub invalid
sequences with U+FFFD so multibyte characters (CJK,
emoji, …) are preserved as-is.
2. sanitize_utf8 – UTF-8 String → clean UTF-8 String.
Used for UI rendering (terminal output, screen
buffers) where the string is already nominally UTF-8
but may still contain isolated invalid bytes.
Strategy: encode UTF-8→UTF-8 replacing invalid /
undefined codepoints with an empty string so the
rendered output never contains replacement characters.
3. safe_check – any String → ASCII-safe UTF-8 String for regex.
Used only for security pattern matching (terminal/Security).
Multibyte bytes are replaced with '?' so that Ruby's
regex engine operates on a plain ASCII-compatible
string without raising Encoding errors.
Class Method Summary collapse
-
.cmd_to_utf8(data, source_encoding: "GBK") ⇒ String
Convert raw shell command output to valid UTF-8.
-
.pty_to_utf8(data) ⇒ String
Decode a raw PTY byte stream to valid UTF-8, auto-detecting the source encoding.
-
.safe_check(str) ⇒ String
Return an ASCII-safe UTF-8 copy of str suitable for security regex pattern matching.
-
.sanitize_utf8(str) ⇒ String
Clean an already-UTF-8 string by removing (not replacing) any invalid or undefined byte sequences.
-
.to_utf8(data) ⇒ String
Convert a binary (or unknown-encoding) byte string to a valid UTF-8 String.
Class Method Details
.cmd_to_utf8(data, source_encoding: "GBK") ⇒ String
Convert raw shell command output to valid UTF-8. Handles two common cases:
- Windows commands (e.g. powershell.exe) that output GBK/CP936 bytes
- Unix commands that output UTF-8 or ASCII bytes with ASCII-8BIT encoding
Strategy: try GBK decode first (superset of ASCII, covers Chinese Windows); if that fails fall back to UTF-8 scrub.
68 69 70 71 72 73 74 75 76 |
# File 'lib/clacky/utils/encoding.rb', line 68 def self.cmd_to_utf8(data, source_encoding: "GBK") return "" if data.nil? || data.empty? data.dup .force_encoding(source_encoding) .encode("UTF-8", invalid: :replace, undef: :replace, replace: "") rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError to_utf8(data) end |
.pty_to_utf8(data) ⇒ String
Decode a raw PTY byte stream to valid UTF-8, auto-detecting the source encoding. UTF-8 is tried first (Linux/macOS and modern programs); when the bytes are not valid UTF-8 they are decoded as GBK/CP936 (Simplified Chinese Windows powershell.exe / cmd.exe default output); anything that still fails is scrubbed.
MUST be called on complete byte runs — callers slice on "\n" (0x0A), which is never a trailing byte of a GBK or UTF-8 multibyte sequence, so a character is never split across the boundary.
90 91 92 93 94 95 96 97 98 99 100 101 |
# File 'lib/clacky/utils/encoding.rb', line 90 def self.pty_to_utf8(data) return "" if data.nil? || data.empty? s = data.dup.force_encoding("UTF-8") return s if s.valid_encoding? data.dup .force_encoding("GBK") .encode("UTF-8", invalid: :replace, undef: :replace, replace: "?") rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError to_utf8(data) end |
.safe_check(str) ⇒ String
Return an ASCII-safe UTF-8 copy of str suitable for security regex pattern matching. Any byte that is not valid in the source encoding, or that cannot be represented in UTF-8, is replaced with '?'. The original string is never mutated.
110 111 112 113 114 |
# File 'lib/clacky/utils/encoding.rb', line 110 def self.safe_check(str) return "" if str.nil? || str.empty? str.encode("UTF-8", invalid: :replace, undef: :replace, replace: "?") end |
.sanitize_utf8(str) ⇒ String
Clean an already-UTF-8 string by removing (not replacing) any invalid or undefined byte sequences. Suitable for terminal / UI rendering where replacement characters would appear as visual noise.
51 52 53 54 55 |
# File 'lib/clacky/utils/encoding.rb', line 51 def self.sanitize_utf8(str) return "" if str.nil? || str.empty? str.encode("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "") end |
.to_utf8(data) ⇒ String
Convert a binary (or unknown-encoding) byte string to a valid UTF-8 String. Multibyte sequences that are already valid UTF-8 (e.g. CJK characters) are preserved unchanged; only genuinely invalid byte sequences are replaced with U+FFFD (the Unicode replacement character).
39 40 41 42 43 |
# File 'lib/clacky/utils/encoding.rb', line 39 def self.to_utf8(data) return "" if data.nil? || data.empty? data.dup.force_encoding("UTF-8").scrub("\u{FFFD}") end |