Module: AllStak::Sanitizer

Defined in:
lib/allstak/sanitizer.rb

Constant Summary collapse

REDACTED =
"[REDACTED]"
DEFAULT_DENYLIST =
%w[
  authorization
  proxy-authorization
  cookie
  set-cookie
  password
  passwd
  pwd
  api_key
  apikey
  x-api-key
  x-allstak-key
  x-auth-token
  x-access-token
  token
  bearer
  jwt
  session
  sessionid
  session_id
  secret
  credit_card
  card_number
  cvv
  ssn
  csrf
].freeze
ALLOWLIST =

Exact, CASE-SENSITIVE keys that look sensitive by substring but are NOT —they are first-class SDK telemetry fields that must survive scrubbing. The release-health ‘sessionId` (camelCase) carries the SDK’s own per-process session id (a random UUID, not a user/auth session token); the backend error consumer needs it to attribute crashes, so it must never be redacted. Matched exactly and case-sensitively, so genuine cookie/auth keys like ‘session`, `session_id`, or `sessionid` (the lower-case denylist terms) are still scrubbed.

%w[
  sessionId
].freeze
MAX_SCAN_LENGTH =

Longest single string we will scan for value patterns. Larger strings are passed through untouched so a pathological multi-MB blob never stalls the wire path. Key-name redaction still applies to its containing key.

16_384
VALUE_SCRUB_SKIP_KEYS =

Keys whose scalar string value is exempt from value-pattern scrubbing (matched case-sensitively against the original key, then case-insensitively as a fallback). These carry structured identifiers / locations that the patterns would otherwise corrupt: stack-frame fields, release/sdk/build metadata, span & trace ids, URLs/paths (their own URL redactor owns them).

%w[
  filename
  function
  abspath
  abs_path
  lineno
  colno
  release
  version
  dist
  platform
  environment
  sdkname
  sdk_name
  sdkversion
  sdk_version
  sdk.name
  sdk.version
  commit.sha
  commit.branch
  commit_sha
  url
  path
  host
  hostname
  route
  operation
  op
  spanid
  span_id
  parentspanid
  parent_span_id
  traceid
  trace_id
  requestid
  request_id
  sessionid
  sessionId
  timestamp
].each_with_object({}) { |k, h| h[k.downcase] = true }.freeze
VALUE_SCRUB_SKIP_SUBTREES =

Top-level subtrees that are never value-scrubbed. ‘user` holds data the caller explicitly set via setUser (intentional identification — ships as before). `frames`/`stackTrace` hold structured stack frames whose filenames/functions must not be corrupted.

%w[
  user
  frames
  stackTrace
  stacktrace
].each_with_object({}) { |k, h| h[k.downcase] = true }.freeze
SSN_REGEX =

US SSN — REQUIRE the hyphens so bare 9-digit numbers (order ids, etc.) are not nuked. Compiled once.

/\b\d{3}-\d{2}-\d{4}\b/.freeze
CC_CANDIDATE_REGEX =

Candidate credit-card runs: 13–19 digits with optional single space/hyphen separators between groups. Luhn-validated before redaction (see #luhn?), so digit runs that fail the checksum (timestamps, order ids) survive. Word-boundary-ish anchors keep us from matching the middle of a longer digit string.

/(?<![\d-])(?:\d[ -]?){12,18}\d(?![\d-])/.freeze
EMAIL_REGEX =

Standard email address. Compiled once.

/\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b/.freeze
IPV4_OCTET =

IPv4 with each octet validated to 0–255. Compiled once.

'(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)'
IPV4_REGEX =
/\b#{IPV4_OCTET}\.#{IPV4_OCTET}\.#{IPV4_OCTET}\.#{IPV4_OCTET}\b/.freeze
IPV6_REGEX =
IPv6 best-effort: 2+ groups of hex separated by colons, with optional

compression. Intentionally loose — IPv6 detection is best-effort per spec.

/\b(?:[0-9A-Fa-f]{1,4}:){2,7}[0-9A-Fa-f]{0,4}\b|\b::(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4}\b/.freeze

Class Method Summary collapse

Class Method Details

.luhn?(digits) ⇒ Boolean

Luhn (mod-10) checksum over a string of digits.

Returns:

  • (Boolean)


305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
# File 'lib/allstak/sanitizer.rb', line 305

def luhn?(digits)
  return false unless digits =~ /\A\d{13,19}\z/

  sum = 0
  double = false
  digits.reverse.each_char do |ch|
    d = ch.to_i
    if double
      d *= 2
      d -= 9 if d > 9
    end
    sum += d
    double = !double
  end
  (sum % 10).zero?
end

.scrub(payload, extra_denylist: nil, send_default_pii: false, values: true) ⇒ Object

Returns a sanitized deep copy of ‘payload`.

Parameters:

  • extra_denylist (Array<String>, nil) (defaults to: nil)

    additional key terms to redact; may extend but not narrow the canonical list.

  • send_default_pii (Boolean) (defaults to: false)

    when true, the tier-B value scrubbers (email, IPv4/IPv6) are disabled — the caller has opted into PII. Tier-A (credit card, SSN) is ALWAYS applied. Default false (privacy-safe).

  • values (Boolean) (defaults to: true)

    when false, only key-name redaction runs (no value-pattern scrubbing). Useful for an intermediate pre-scrub (e.g. Sidekiq job args) where the wire-path scrub will value-scrub later with the authoritative config. Default true.



180
181
182
183
184
185
186
# File 'lib/allstak/sanitizer.rb', line 180

def scrub(payload, extra_denylist: nil, send_default_pii: false, values: true)
  denylist = DEFAULT_DENYLIST.dup
  denylist.concat(extra_denylist.map { |t| t.to_s.downcase }) if extra_denylist
  denylist.uniq!
  return walk_keys_only(payload, denylist, Set.new) unless values
  walk(payload, denylist, Set.new, send_default_pii)
end

.scrub_credit_cards(str) ⇒ Object

Replace only those candidate credit-card runs that pass the Luhn checksum. A run that fails Luhn (e.g. an order id or timestamp that happens to be 13–19 digits) is left intact, minimizing over-redaction.



293
294
295
296
297
298
299
300
301
302
# File 'lib/allstak/sanitizer.rb', line 293

def scrub_credit_cards(str)
  str.gsub(CC_CANDIDATE_REGEX) do |match|
    digits = match.gsub(/[ -]/, "")
    if digits.length.between?(13, 19) && luhn?(digits)
      REDACTED
    else
      match
    end
  end
end

.scrub_value(str, send_default_pii) ⇒ Object

Apply value-pattern scrubbing to a single string. Fail-open: any error returns the original string. Oversized strings are passed through.



268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
# File 'lib/allstak/sanitizer.rb', line 268

def scrub_value(str, send_default_pii)
  return str unless str.is_a?(String)
  return str if str.empty? || str.length > MAX_SCAN_LENGTH

  out = str

  # Tier A — ALWAYS (regardless of send_default_pii).
  out = out.gsub(SSN_REGEX, REDACTED)
  out = scrub_credit_cards(out)

  # Tier B — only when the caller has NOT opted into PII.
  unless send_default_pii
    out = out.gsub(EMAIL_REGEX, REDACTED)
    out = out.gsub(IPV4_REGEX, REDACTED)
    out = out.gsub(IPV6_REGEX, REDACTED)
  end

  out
rescue StandardError
  str
end

.sensitive?(key, denylist) ⇒ Boolean

Returns:

  • (Boolean)


188
189
190
191
192
193
194
195
196
197
198
199
# File 'lib/allstak/sanitizer.rb', line 188

def sensitive?(key, denylist)
  return false unless key.is_a?(String) || key.is_a?(Symbol)

  # Exact, case-sensitive allowlist wins: a first-class SDK field (e.g.
  # release-health `sessionId`) is never scrubbed even though its lowercase
  # form contains a denied substring. Checked against the ORIGINAL key so
  # `sessionId` survives while `sessionid`/`session_id`/`session` are scrubbed.
  return false if ALLOWLIST.include?(key.to_s)

  k = key.to_s.downcase
  denylist.any? { |term| k.include?(term) }
end

.skip_subtree?(key) ⇒ Boolean

Returns:

  • (Boolean)


256
257
258
259
# File 'lib/allstak/sanitizer.rb', line 256

def skip_subtree?(key)
  return false unless key.is_a?(String) || key.is_a?(Symbol)
  VALUE_SCRUB_SKIP_SUBTREES.key?(key.to_s.downcase)
end

.skip_value_scrub_key?(key) ⇒ Boolean

Returns:

  • (Boolean)


261
262
263
264
# File 'lib/allstak/sanitizer.rb', line 261

def skip_value_scrub_key?(key)
  return false unless key.is_a?(String) || key.is_a?(Symbol)
  VALUE_SCRUB_SKIP_KEYS.key?(key.to_s.downcase)
end

.walk(value, denylist, seen, send_default_pii) ⇒ Object



201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
# File 'lib/allstak/sanitizer.rb', line 201

def walk(value, denylist, seen, send_default_pii)
  case value
  when Hash
    return REDACTED if seen.include?(value.object_id)

    seen.add(value.object_id)
    value.each_with_object({}) do |(k, v), out|
      out[k] =
        if sensitive?(k, denylist)
          REDACTED
        elsif skip_subtree?(k)
          # Explicit user object / stack frames: deep-copy with key-name
          # redaction still applied, but NO value-pattern scrubbing.
          walk_keys_only(v, denylist, seen)
        elsif skip_value_scrub_key?(k)
          # Structured scalar (release, url, span id, …): recurse for nested
          # collections, but do not value-scrub a scalar string here.
          v.is_a?(Hash) || v.is_a?(Array) ? walk(v, denylist, seen, send_default_pii) : v
        else
          walk(v, denylist, seen, send_default_pii)
        end
    end
  when Array
    return REDACTED if seen.include?(value.object_id)

    seen.add(value.object_id)
    value.map { |v| walk(v, denylist, seen, send_default_pii) }
  when String
    scrub_value(value, send_default_pii)
  else
    value
  end
end

.walk_keys_only(value, denylist, seen) ⇒ Object

Recurse applying ONLY key-name redaction (no value-pattern scrubbing). Used for exempt subtrees (explicit user object, stack frames).



237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
# File 'lib/allstak/sanitizer.rb', line 237

def walk_keys_only(value, denylist, seen)
  case value
  when Hash
    return REDACTED if seen.include?(value.object_id)

    seen.add(value.object_id)
    value.each_with_object({}) do |(k, v), out|
      out[k] = sensitive?(k, denylist) ? REDACTED : walk_keys_only(v, denylist, seen)
    end
  when Array
    return REDACTED if seen.include?(value.object_id)

    seen.add(value.object_id)
    value.map { |v| walk_keys_only(v, denylist, seen) }
  else
    value
  end
end