Module: IuguLogger::Pii

Defined in:
lib/iugu_logger/pii.rb

Overview

PII detection and redaction module.

3-layer defense:

- Layer 1 (ParamFilter): blocks values of keys whose names match a
  sensitive blocklist (password, secret, token, etc.) BEFORE the deep
  content scan
- Layer 2 (Scanner): regex-based deep content redaction in all string
  fields, with strategy-based replacement (full_redact, last4,
  detect_only, preserve)
- Layer 3 (Logger): emitted log payload always carries pii.scanned=true
  populated by Scanner — handled in Logger, not here

PII patterns reuse those validated in production by core/utils/sanitizer.py (iugu-agents).

Decisions applied:

- ILS-002: iugu.account_id 32-hex preserved (SAFE_PATTERNS exclusion)
- ILS-003: email :detect_only by default (deferred — tech debt)

Spec: IUGU_LOGGING_STANDARD.md §5

Defined Under Namespace

Classes: Result, Scanner

Constant Summary collapse

PATTERNS =
{
  cpf:            /\b\d{3}\.?\d{3}\.?\d{3}-?\d{2}\b/,
  cnpj:           /\b\d{2}\.\d{3}\.\d{3}\/\d{4}-\d{2}\b/,
  email:          /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b/,
  # Lookarounds (?<![\w-]) and (?![\w-]) require the phone-shaped digit
  # group to be flanked by non-identifier chars. Without them the regex
  # matched the middle of dense identifiers (span_id, trace_id, jids,
  # UUIDs without hyphens) and produced false positives that broke trace
  # correlation in production. SAFE_KEY_PATHS is the primary defense;
  # this is defense-in-depth for arbitrary user content.
  phone:          /(?<![\w-])\(?\d{2}\)?\s?9?\d{4}-?\d{4}(?![\w-])/,
  cc:             /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{1,7}\b/,
  aws_key:        /\bAKIA[0-9A-Z]{16}\b/,
  bearer:         /Bearer\s+[A-Za-z0-9\-._~+\/]+=*\b/i,
  url_with_creds: /https?:\/\/[^\/\s:]+:[^\/\s@]+@\S+/
}.freeze
SAFE_PATTERNS =

Strings matching SAFE_PATTERNS are excluded from redaction even when they incidentally match a PII pattern. Hex identifiers (trace_id 32, span_id 16, account_id 32, UUID v4) are structural identifiers that by definition never carry PII; pre-empting them at the value level avoids the regex coincidence that any 10+ consecutive digits look like a Brazilian phone number.

{
  iugu_account_id: /\A[A-Fa-f0-9]{32}\z/, # 32-hex (case-insensitive: legacy uppercase + modern lowercase)
  otel_trace_id:   /\A[a-fA-F0-9]{32}\z/, # OpenTelemetry trace_id (16 bytes hex)
  otel_span_id:    /\A[a-fA-F0-9]{16}\z/, # OpenTelemetry span_id (8 bytes hex)
  uuid:            /\A[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}\z/i # UUID v4 / v7
}.freeze
SAFE_KEY_PATHS =

Canonical schema paths whose values are skipped entirely by the scan. These fields hold structural identifiers / controlled metadata defined by IUGU_LOGGING_STANDARD §2 — never user-supplied content. Skipping them prevents false-positive PII detection on hex/UUID-shaped values AND saves CPU on the hot path.

Path = dot-joined hash keys from the user_section root, e.g. a value at ‘payload[’span_id’]‘ has path “trace.span_id”.

%w[
  @timestamp
  log.level
  event.kind
  event.action
  service.name
  service.version
  service.environment
  service.instance
  trace.id
  trace.span_id
  trace.parent_id
  request.id
  http.status_code
  http.duration_ms
].freeze
DEFAULT_STRATEGIES =

Default redaction strategies. Override via Configuration#pii_redaction.

Strategies:

:full_redact  → "[<TYPE>_REDACTED]"
:last4        → "**** **** **** 1234" (CC only)
:detect_only  → unchanged content, but `detected` is recorded
:preserve     → neither detected nor redacted (escape hatch)

Philosophy (data-completeness-first, since v0.7):

Operational logs in iugu serve ops, support, fraud analysts, compliance, and ML pipelines — not just engineers. Redacting personal data at emission time breaks those downstream consumers; the legacy rails_semantic_logger output that they already rely on includes full CPF, CNPJ, phone, address, email, bank account details. We normalize that — ‘:detect_only` means we still RECORD that PII was found (so `pii.detected: [cpf, phone]` is queryable for audit) but we don’t remove the values from the log. LGPD compliance is met via access-control on the log store and retention policies, not via redaction at the source.

Things that DO stay redacted by default:

- Payment card numbers (`:cc` → `:last4`) — PCI-DSS hard rule
- Credentials (`aws_key`, `bearer`, `url_with_creds`) — these are
  never user data, only ever leak risk

Override per-app: any app needing stricter redaction (e.g. external log export targets) can set ‘:full_redact` for the types it needs via `IuguLogger.configure { |c| c.pii_redaction = … }`.

{
  cpf:            :detect_only,  # personal data — detected, not redacted (v0.7+)
  cnpj:           :detect_only,  # legal entity — detected, not redacted (v0.7+)
  email:          :detect_only,  # personal data — detected, not redacted (was always)
  phone:          :detect_only,  # personal data — detected, not redacted (v0.7+)
  cc:             :last4,        # PCI-DSS — last 4 only (KEPT)
  aws_key:        :full_redact,  # credential — never log (KEPT)
  bearer:         :full_redact,  # credential — never log (KEPT)
  url_with_creds: :full_redact   # credential — never log (KEPT)
}.freeze
DEFAULT_PARAM_BLOCKLIST =

Layer 1: keys whose values are filtered before any scanning. Case-insensitive.

%w[
  password password_confirmation passwd
  secret token api_key apikey
  authorization auth bearer_token
  credit_card cc_number ccnumber cvv cvc
  ssn pin private_key
].freeze
PARAM_FILTER_PLACEHOLDER =
'[FILTERED]'