Module: Clacky::Utils::Encoding
- Defined in:
- lib/clacky/utils/encoding.rb
Overview
Centralised UTF-8 encoding helpers used throughout the codebase.
Three distinct use-cases exist:
1. to_utf8 – binary/unknown bytes → valid UTF-8 String.
Used when reading shell output, HTTP response bodies,
or any raw byte stream that is *expected* to be UTF-8
but arrives with ASCII-8BIT (binary) encoding.
Strategy: force_encoding("UTF-8") then scrub invalid
sequences with U+FFFD so multibyte characters (CJK,
emoji, …) are preserved as-is.
2. sanitize_utf8 – UTF-8 String → clean UTF-8 String.
Used for UI rendering (terminal output, screen
buffers) where the string is already nominally UTF-8
but may still contain isolated invalid bytes.
Strategy: encode UTF-8→UTF-8 replacing invalid /
undefined codepoints with an empty string so the
rendered output never contains replacement characters.
3. safe_check – any String → ASCII-safe UTF-8 String for regex.
Used only for security pattern matching (safe_shell).
Multibyte bytes are replaced with '?' so that Ruby's
regex engine operates on a plain ASCII-compatible
string without raising Encoding errors.
Class Method Summary collapse
-
.cmd_to_utf8(data, source_encoding: "GBK") ⇒ String
Convert raw shell command output to valid UTF-8.
-
.safe_check(str) ⇒ String
Return an ASCII-safe UTF-8 copy of str suitable for security regex pattern matching.
-
.sanitize_utf8(str) ⇒ String
Clean an already-UTF-8 string by removing (not replacing) any invalid or undefined byte sequences.
-
.to_utf8(data) ⇒ String
Convert a binary (or unknown-encoding) byte string to a valid UTF-8 String.
Class Method Details
.cmd_to_utf8(data, source_encoding: "GBK") ⇒ String
Convert raw shell command output to valid UTF-8. Handles two common cases:
- Windows commands (e.g. powershell.exe) that output GBK/CP936 bytes
- Unix commands that output UTF-8 or ASCII bytes with ASCII-8BIT encoding
Strategy: try GBK decode first (superset of ASCII, covers Chinese Windows); if that fails fall back to UTF-8 scrub.
68 69 70 71 72 73 74 75 76 |
# File 'lib/clacky/utils/encoding.rb', line 68 def self.cmd_to_utf8(data, source_encoding: "GBK") return "" if data.nil? || data.empty? data.dup .force_encoding(source_encoding) .encode("UTF-8", invalid: :replace, undef: :replace, replace: "") rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError to_utf8(data) end |
.safe_check(str) ⇒ String
Return an ASCII-safe UTF-8 copy of str suitable for security regex pattern matching. Any byte that is not valid in the source encoding, or that cannot be represented in UTF-8, is replaced with ‘?’. The original string is never mutated.
85 86 87 88 89 |
# File 'lib/clacky/utils/encoding.rb', line 85 def self.safe_check(str) return "" if str.nil? || str.empty? str.encode("UTF-8", invalid: :replace, undef: :replace, replace: "?") end |
.sanitize_utf8(str) ⇒ String
Clean an already-UTF-8 string by removing (not replacing) any invalid or undefined byte sequences. Suitable for terminal / UI rendering where replacement characters would appear as visual noise.
51 52 53 54 55 |
# File 'lib/clacky/utils/encoding.rb', line 51 def self.sanitize_utf8(str) return "" if str.nil? || str.empty? str.encode("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "") end |
.to_utf8(data) ⇒ String
Convert a binary (or unknown-encoding) byte string to a valid UTF-8 String. Multibyte sequences that are already valid UTF-8 (e.g. CJK characters) are preserved unchanged; only genuinely invalid byte sequences are replaced with U+FFFD (the Unicode replacement character).
39 40 41 42 43 |
# File 'lib/clacky/utils/encoding.rb', line 39 def self.to_utf8(data) return "" if data.nil? || data.empty? data.dup.force_encoding("UTF-8").scrub("\u{FFFD}") end |