Class: SasLinter::Rules::EncodingIssues
- Inherits:
-
SasLinter::Rule
- Object
- SasLinter::Rule
- SasLinter::Rules::EncodingIssues
- Defined in:
- lib/sas_linter/rules/encoding_issues.rb
Overview
Flag and (optionally) rewrite encoding issues in SAS source —smart quotes (‘‘ ’ “ ”`), en/em dashes (`– —`), ellipsis (`…`), non-break space, U+2000–U+200A typographic spaces, line/para separators, and the Windows-1252 single-byte forms of the same characters (0x91/0x92, 0x93/0x94, 0x96/0x97, 0xA0, …) that bypass UTF-8 transcoding when source files arrive from Word / Outlook / a legacy Latin-1 round-trip.
The rule has two layers, both off by default:
1. `use_defaults: true` enables the canonical fix
table — UTF8_REPLACEMENTS (multibyte UTF-8 smart
punctuation) plus BYTE_REPLACEMENTS (single Windows-1252
bytes the lexer can't make sense of). The fixer walks the
source as a byte stream and only touches a Win-1252 byte
when it's not already part of a valid UTF-8 sequence — so
names like `MÖLLER` (`\xC3\x96`) survive intact.
2. `replacements: { from => to }` adds project-specific
string-level substitutions. Use this for site-local
cleanups, stylistic preferences (em-dash → "--" instead
of "-"), to override default behavior on specific bytes,
or to add extra characters not in the default table.
Findings carry a line:column position for every match. Autofix runs the user ‘replacements:` map FIRST and the canonical fixer SECOND — so a project-specific pattern can target byte sequences the canonical defaults would otherwise consume (e.g. catch a multi-byte ellipsis in a specific surname before the default `… → …` rewrite hits it).
Recognized config options:
use_defaults: true | false (default: false)
replacements: { String => String } (default: {})
autofix: true | false (default: false)
Constant Summary collapse
- UTF8_REPLACEMENTS =
Map of UTF-8 (multi-byte) byte sequences to their ASCII replacement. Stored as raw byte strings so substitution doesn’t depend on the encoding state of the source.
{ "\xE2\x80\x98".b => "'", # U+2018 LEFT SINGLE QUOTATION MARK "\xE2\x80\x99".b => "'", # U+2019 RIGHT SINGLE QUOTATION MARK "\xE2\x80\x9A".b => "'", # U+201A SINGLE LOW-9 QUOTATION MARK "\xE2\x80\x9B".b => "'", # U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK "\xE2\x80\x9C".b => '"', # U+201C LEFT DOUBLE QUOTATION MARK "\xE2\x80\x9D".b => '"', # U+201D RIGHT DOUBLE QUOTATION MARK "\xE2\x80\x9E".b => '"', # U+201E DOUBLE LOW-9 QUOTATION MARK "\xE2\x80\x93".b => "-", # U+2013 EN DASH "\xE2\x80\x94".b => "-", # U+2014 EM DASH "\xE2\x80\x95".b => "-", # U+2015 HORIZONTAL BAR "\xE2\x80\xA6".b => "...", # U+2026 HORIZONTAL ELLIPSIS "\xC2\xA0".b => " ", # U+00A0 NO-BREAK SPACE # Typographic spaces in the U+2000–U+200A range plus the line/para # separators. Word docs sprinkle these in liberally and SAS chokes # with `ERROR 217-322: Invalid statement due to first character # being unprintable`. "\xE2\x80\x80".b => " ", # U+2000 EN QUAD "\xE2\x80\x81".b => " ", # U+2001 EM QUAD "\xE2\x80\x82".b => " ", # U+2002 EN SPACE "\xE2\x80\x83".b => " ", # U+2003 EM SPACE "\xE2\x80\x84".b => " ", # U+2004 THREE-PER-EM SPACE "\xE2\x80\x85".b => " ", # U+2005 FOUR-PER-EM SPACE "\xE2\x80\x86".b => " ", # U+2006 SIX-PER-EM SPACE "\xE2\x80\x87".b => " ", # U+2007 FIGURE SPACE "\xE2\x80\x88".b => " ", # U+2008 PUNCTUATION SPACE "\xE2\x80\x89".b => " ", # U+2009 THIN SPACE "\xE2\x80\x8A".b => " ", # U+200A HAIR SPACE "\xE2\x80\xA8".b => "\n", # U+2028 LINE SEPARATOR "\xE2\x80\xA9".b => "\n", # U+2029 PARAGRAPH SEPARATOR # Mac Roman 0xD0–0xD5 misread as Win-1252 → Latin-1 letters. # Mac OS Roman uses these byte slots for smart punctuation; when # `SasLinter#read_source` interprets a Mac-Roman-authored file # as Win-1252 it produces these spurious Latin-1 letters in the # post-transcode UTF-8. A typical SAS source corpus (English, # with documents that round-tripped through Word on Mac) shows # this as Latin-1 letters Ð / Ò / Ó / Ô / Õ standing in for # smart-punctuation glyphs. # # Skipping U+00D1 (Ñ) since it has too much legitimate Spanish- # name traffic to safely auto-replace. "\xC3\x90".b => "-", # U+00D0 Ð (Mac Roman: en dash) "\xC3\x92".b => '"', # U+00D2 Ò (Mac Roman: left double quote) "\xC3\x93".b => '"', # U+00D3 Ó (Mac Roman: right double quote) "\xC3\x94".b => "'", # U+00D4 Ô (Mac Roman: left single quote) "\xC3\x95".b => "'", # U+00D5 Õ (Mac Roman: right single quote) }.freeze
- BYTE_REPLACEMENTS =
Map of single Windows-1252 bytes (0x80-0x9F range, plus 0xA0) to their ASCII replacement. These bytes are invalid as standalone UTF-8 but are how Word documents on legacy Windows render the same characters covered by UTF8_REPLACEMENTS above. Only applied to bytes that are not part of a valid UTF-8 sequence in context.
0x85 (HORIZONTAL ELLIPSIS) is intentionally absent. In real-world SAS source corpora a standalone 0x85 is overwhelmingly a corrupted Latin-1 letter inside a name (e.g. ‘x85` standing in for `Ö`, `Á`, etc.), not a real ellipsis. Mapping it to `…` would corrupt those names. Recovering the proper letter is data-loss recovery and belongs in a separate, source-specific fix configured via the user `replacements:` map.
{ 0x82 => "'", # SINGLE LOW-9 QUOTATION MARK 0x91 => "'", # LEFT SINGLE QUOTATION MARK 0x92 => "'", # RIGHT SINGLE QUOTATION MARK 0x93 => '"', # LEFT DOUBLE QUOTATION MARK 0x94 => '"', # RIGHT DOUBLE QUOTATION MARK 0x96 => "-", # EN DASH 0x97 => "-", # EM DASH 0xA0 => " ", # NO-BREAK SPACE }.freeze
Instance Attribute Summary collapse
-
#replacements ⇒ Object
readonly
Returns the value of attribute replacements.
-
#use_defaults ⇒ Object
readonly
Returns the value of attribute use_defaults.
Class Method Summary collapse
Instance Method Summary collapse
-
#apply_canonical_fix(source) ⇒ Object
Walk ‘source` as a byte stream applying UTF8_REPLACEMENTS and BYTE_REPLACEMENTS.
- #autofix(source) ⇒ Object
-
#check(_tokens, path:, all_tokens: nil, source: nil) ⇒ Object
rubocop:disable Lint/UnusedMethodArgument.
-
#initialize(use_defaults: false, replacements: {}, autofix: false) ⇒ EncodingIssues
constructor
A new instance of EncodingIssues.
Methods inherited from SasLinter::Rule
all, #autofix?, description, fetch, inherited, register, registry, rule_id, severity
Constructor Details
#initialize(use_defaults: false, replacements: {}, autofix: false) ⇒ EncodingIssues
Returns a new instance of EncodingIssues.
139 140 141 142 143 |
# File 'lib/sas_linter/rules/encoding_issues.rb', line 139 def initialize(use_defaults: false, replacements: {}, autofix: false) super(autofix: autofix) @use_defaults = use_defaults @replacements = replacements end |
Instance Attribute Details
#replacements ⇒ Object (readonly)
Returns the value of attribute replacements.
137 138 139 |
# File 'lib/sas_linter/rules/encoding_issues.rb', line 137 def replacements @replacements end |
#use_defaults ⇒ Object (readonly)
Returns the value of attribute use_defaults.
137 138 139 |
# File 'lib/sas_linter/rules/encoding_issues.rb', line 137 def use_defaults @use_defaults end |
Class Method Details
.from_config(opts = {}) ⇒ Object
127 128 129 130 131 132 133 134 135 |
# File 'lib/sas_linter/rules/encoding_issues.rb', line 127 def self.from_config(opts = {}) opts = opts.transform_keys(&:to_s) replacements = (opts["replacements"] || {}).to_h { |k, v| [k.to_s, v.to_s] } new( use_defaults: opts.fetch("use_defaults", false) ? true : false, replacements: replacements, autofix: opts["autofix"] ? true : false ) end |
.supports_autofix? ⇒ Boolean
123 124 125 |
# File 'lib/sas_linter/rules/encoding_issues.rb', line 123 def self.supports_autofix? true end |
Instance Method Details
#apply_canonical_fix(source) ⇒ Object
Walk ‘source` as a byte stream applying UTF8_REPLACEMENTS and BYTE_REPLACEMENTS. A Win-1252 byte is replaced only when it’s not already part of a valid UTF-8 multibyte sequence — so ‘MxC3x96LLER` (the Ö in `MÖLLER`) survives, but a standalone `x96` becomes `-`. Returns ASCII-8BIT.
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
# File 'lib/sas_linter/rules/encoding_issues.rb', line 182 def apply_canonical_fix(source) bytes = source.bytes out = String.new(encoding: Encoding::BINARY) i = 0 n = bytes.length while i < n b = bytes[i] if b < 0x80 out << b i += 1 next end seq_len = utf8_sequence_length(bytes, i, n) if seq_len.positive? seq = bytes[i, seq_len].pack("C*") replacement = UTF8_REPLACEMENTS[seq] out << (replacement ? replacement.b : seq) i += seq_len else replacement = BYTE_REPLACEMENTS[b] out << (replacement ? replacement.b : b) i += 1 end end out end |
#autofix(source) ⇒ Object
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
# File 'lib/sas_linter/rules/encoding_issues.rb', line 154 def autofix(source) # User `replacements:` run FIRST so a project-specific pattern # can target byte sequences the canonical defaults would # otherwise consume. The motivating case: a SAS source corpus # has stray `\x85` bytes inside surnames (a corrupted Latin-1 # letter) that `SasLinter#read_source` transcodes to # `\xE2\x80\xA6` (the UTF-8 ellipsis); a project-specific # `"M…LLER": "MÖLLER"` entry only fires if it sees those bytes # BEFORE the canonical map rewrites them to `...`. # # `.b` is required on every side of the `gsub` so mismatched # encodings can't blow up `String#gsub` with # `Encoding::CompatibilityError`. `source` arrives from # `lint_file` as UTF-8; `@replacements` keys/values from YAML # are UTF-8; `apply_canonical_fix` returns ASCII-8BIT. # Either combination raises when a multi-byte pattern # actually matches. `.b` returns a binary-encoded duplicate # without altering bytes, so the substitution stays byte- # faithful regardless of which direction the mismatch goes. step1 = @replacements.inject(source.b) { |s, (from, to)| s.gsub(from.b, to.b) } @use_defaults ? apply_canonical_fix(step1) : step1 end |
#check(_tokens, path:, all_tokens: nil, source: nil) ⇒ Object
rubocop:disable Lint/UnusedMethodArgument
145 146 147 148 149 150 151 152 |
# File 'lib/sas_linter/rules/encoding_issues.rb', line 145 def check(_tokens, path:, all_tokens: nil, source: nil) # rubocop:disable Lint/UnusedMethodArgument return [] unless source findings = [] findings.concat(default_findings(source, path: path)) if @use_defaults findings.concat(replacement_findings(source, path: path)) unless @replacements.empty? findings end |