Module: DataRedactor

Defined in:: lib/data_redactor.rb,
lib/data_redactor/version.rb,
lib/data_redactor/name_pattern.rb,
lib/data_redactor/integrations/rack.rb,
lib/data_redactor/integrations/rails.rb,
lib/data_redactor/integrations/logger.rb,
ext/data_redactor/data_redactor.c

Overview

High-performance regex-based redactor for sensitive data.

DataRedactor scans text for sensitive patterns (API keys, IBANs, national IDs, emails, phone numbers, etc.) and replaces matches with a configurable placeholder. The matching is done by a C extension backed by POSIX regex.h, so it is fast enough to run inline on large payloads.

Examples:

Basic redaction

DataRedactor.redact("key is AKIAIOSFODNN7EXAMPLE")
# => "key is [REDACTED]"

Filter by tag or pattern name

DataRedactor.redact(text, only: :credentials)
DataRedactor.redact(text, except: [:contact, :network])
DataRedactor.redact(text, only: :contact, except: ["email"])
DataRedactor.redact(text, only: ["aws_access_key_id"])

Custom placeholder

DataRedactor.redact(text, placeholder: "***")
DataRedactor.redact(text, placeholder: :tagged) # => "[REDACTED:CONTACT]"
DataRedactor.redact(text, placeholder: :hash)   # => "[CONTACT_a3f9]"

Audit / dry-run

DataRedactor.scan(text)
# => { redacted: "...", matches: [{tag:, name:, value:, start:, length:}, ...] }

Custom pattern

DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")

Defined Under Namespace

Modules: Integrations Classes: InvalidPatternError, UnknownPatternError, UnknownTagError

Constant Summary collapse

TAGS = Map of tag symbol to the integer bit used by the C layer. The keys of this hash are the canonical list of supported tags; pass any of them to redact or scan via only: / except:. Returns: (Hash{Symbol => Integer}) — frozen tag-to-bit map

{
  credentials: TAG_CREDENTIALS,
  financial:   TAG_FINANCIAL,
  tax_id:      TAG_TAX_ID,
  national_id: TAG_NATIONAL_ID,
  contact:     TAG_CONTACT,
  network:     TAG_NETWORK,
  travel:      TAG_TRAVEL,
  other:       TAG_OTHER,
  custom:      TAG_CUSTOM
}.freeze

CAPTURE_GROUP_RE =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Capture groups break boundary-wrapper group index assumptions ([1],,[3] shift).

/(?<!\\)\((?!\?:)/.freeze

RUBY_ONLY_SYNTAX_RE =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Ruby regex syntax that has no POSIX ERE equivalent.

/\\[dDwWsShHbB]|\(\?[<!=]|\(\?<[a-zA-Z]|\(\?[imx]|[*+?]\?/.freeze

PLACEHOLDER_DEFAULT = Default placeholder used when placeholder: is not given to redact.

"[REDACTED]"

CHUNK_SIZE =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Inputs larger than this (bytes) are split into newline-bounded chunks before being handed to the C engine. Bounds the per-call O(N) cost glibc regexec pays for state-log allocation, turning total redaction cost from O(N²) (one giant pass) into O(N × CHUNK_SIZE) (many bounded passes). 64 KB is a compromise: small enough to keep per-call cost low, large enough that typical log/JSON inputs use few chunks. See option G in TODO.md.

64 * 1024

VERSION = Current gem version. Follows Semantic Versioning 2.0.0.

"0.10.1"

DIACRITIC_FOLD =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Maps a base ASCII letter to the set of accented characters that should also match it. Used to make generated name patterns diacritic-tolerant: an input “Jose” still matches “José”, and “Munoz” matches “Muñoz”.

{
  "a" => "àáâãäåāăą",
  "c" => "çćĉċč",
  "e" => "èéêëēĕėęě",
  "i" => "ìíîïĩīĭįı",
  "n" => "ñńņňŉ",
  "o" => "òóôõöøōŏő",
  "u" => "ùúûüũūŭůűų",
  "y" => "ýÿŷ",
  "s" => "śŝşš",
  "z" => "źżž",
  "g" => "ĝğġģ",
  "l" => "ĺļľŀł",
  "r" => "ŕŗř",
  "t" => "ţťŧ"
}.freeze

BUILTIN_PATTERN_NAMES =

rb_ary_freeze(builtin_names)

BUILTIN_PATTERN_TAG_BITS =

rb_ary_freeze(builtin_tag_bits)

BUILTIN_PATTERN_SOURCES =

rb_ary_freeze(builtin_sources)

BUILTIN_PATTERN_BOUNDARY =

rb_ary_freeze(builtin_boundary)

PH_MODE_PLAIN = Placeholder mode constants.

INT2NUM(PLACEHOLDER_MODE_PLAIN)

PH_MODE_TAGGED =

INT2NUM(PLACEHOLDER_MODE_TAGGED)

PH_MODE_HASH =

INT2NUM(PLACEHOLDER_MODE_HASH)

TAG_CREDENTIALS = Tag bitmask values used by the Ruby wrapper to build only/except masks.

INT2NUM(TAG_CREDENTIALS)

TAG_FINANCIAL =

INT2NUM(TAG_FINANCIAL)

TAG_TAX_ID =

INT2NUM(TAG_TAX_ID)

TAG_NATIONAL_ID =

INT2NUM(TAG_NATIONAL_ID)

TAG_CONTACT =

INT2NUM(TAG_CONTACT)

TAG_NETWORK =

INT2NUM(TAG_NETWORK)

TAG_TRAVEL =

INT2NUM(TAG_TRAVEL)

TAG_OTHER =

INT2NUM(TAG_OTHER)

TAG_CUSTOM =

INT2NUM(TAG_CUSTOM)

TAG_ALL =

INT2NUM(TAG_ALL)

Class Method Summary collapse

._add_pattern ⇒ Object

Note: _redact(text, ph_mode, ph_str, enable_bits) and _scan(text, enable_bits).
._ascii_base(char) ⇒ String^? private

If char is an accented letter, return the bare ASCII letter it folds to; otherwise nil.
._chunk_bytes(text) ⇒ Array<String> private

Split text into byte-bounded chunks for the chunked redact/scan path.
._chunked_scan(text, enable_bits) ⇒ Hash{Symbol => Object} private

Chunked variant of _scan: runs the C scanner on each chunk, then offsets each match’s :start by the chunk’s base byte-position in the original input so the byteslice invariant holds end-to-end.
._clear_custom_patterns ⇒ Object
._custom_patterns ⇒ Object
._letter_class(char) ⇒ String private

Build a POSIX bracket expression matching one letter case-insensitively and, where applicable, its accented variants.
._part_token(part) ⇒ String private

Build the alternation for one name part: the full case-insensitive name, or its initial (with optional dot).
._redact(rb_text, rb_ph_mode, rb_ph_str, rb_enable_bits) ⇒ Object
._remove_pattern ⇒ Object
._scan(rb_text, rb_enable_bits) ⇒ Object
._validate_name_arg!(value, label) ⇒ Object private
._walk(node, only:, except:, placeholder:, seen:) ⇒ Object private

Depth-first recursive walker for DataRedactor.redact_deep.
._word_alternatives(word) ⇒ Array<String> private

Alternatives for a single whitespace-free word: the full name (each letter as a case-insensitive, diacritic-folded class) and its initial.
.add_pattern(name:, regex:, tag: :custom, boundary: false) ⇒ Boolean

Register a custom redaction pattern.
.build_enable_bits(only, except) ⇒ Array<Integer> private

Build the per-pattern enable bit-list passed to the C layer.
.clear_custom_patterns! ⇒ nil

Remove every registered custom pattern.
.custom_patterns ⇒ Array<Hash{Symbol => Object}>

List every currently registered custom pattern.
.name_pattern(first, last, middle: nil) ⇒ String

Build a POSIX ERE that matches a person’s name across common written variations, ready to hand to DataRedactor.add_pattern.
.pattern_enabled?(name, tag_bit, only_present, only_bits, only_names, except_bits, except_names) ⇒ Boolean private
.pattern_names ⇒ Array<String>

List of every pattern name the redactor knows about.
.redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ String

Redact every match of the configured patterns in text.
.redact_deep(data, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ Hash, ...

Recursively redact every String value in a nested Hash/Array structure.
.redact_json(json_string, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ String

Parse json_string, redact every String value in the resulting structure, and return valid JSON.
.remove_pattern(name) ⇒ Boolean

Remove a previously registered custom pattern.
.resolve_placeholder(placeholder) ⇒ Array(Integer, String) private

Translate the user-facing placeholder: value into the (mode_int, str) pair the C layer expects.
.scan(text, only: nil, except: nil) ⇒ Hash{Symbol => Object}

Scan text and return both the redacted string and per-match metadata.
.split_filter(entries) ⇒ Array(Integer, Set<String>) private

Split a mixed Symbol/String filter list into (tag_bitmask, name_set).
.tags ⇒ Array<Symbol>

List of supported tag symbols.

Class Method Details

._add_pattern ⇒ `Object`

Note: _redact(text, ph_mode, ph_str, enable_bits) and _scan(text, enable_bits).

._ascii_base(char) ⇒ `String`^?

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

If char is an accented letter, return the bare ASCII letter it folds to; otherwise nil.

Parameters:

char (String) —

a single lowercase character.

Returns:

(String, nil)

# File 'lib/data_redactor/name_pattern.rb', line 159

def _ascii_base(char)
  DIACRITIC_FOLD.each { |ascii, accents| return ascii if accents.include?(char) }
  nil
end

._chunk_bytes(text) ⇒ `Array<String>`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Split text into byte-bounded chunks for the chunked redact/scan path. Chunks end at a \n when possible so no match straddles a boundary; if a single line exceeds CHUNK_SIZE (rare in real inputs), it becomes one oversized chunk and pays the per-pattern O(N) cost — documented limitation. Returns an Array of byte-Strings whose concatenation equals text exactly (including the original newline separators).

Parameters:

text (String)

Returns:

(Array<String>)

# File 'lib/data_redactor.rb', line 452

def _chunk_bytes(text)
  chunks = []
  pos = 0
  len = text.bytesize
  while pos < len
    remaining = len - pos
    if remaining <= CHUNK_SIZE
      chunks << text.byteslice(pos, remaining)
      break
    end
    # Find the last \n in [pos, pos+CHUNK_SIZE). If none, chunk is one long
    # line — take CHUNK_SIZE bytes as a fallback (boundary-split risk).
    window = text.byteslice(pos, CHUNK_SIZE)
    nl = window.rindex("\n")
    take = nl ? nl + 1 : CHUNK_SIZE
    chunks << text.byteslice(pos, take)
    pos += take
  end
  chunks
end

._chunked_scan(text, enable_bits) ⇒ `Hash{Symbol => Object}`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Chunked variant of _scan: runs the C scanner on each chunk, then offsets each match’s :start by the chunk’s base byte-position in the original input so the byteslice invariant holds end-to-end.

Parameters:

text (String)
enable_bits (Array<Integer>)

Returns:

(Hash{Symbol => Object}) —

{ redacted: String, matches: Array<Hash> }

# File 'lib/data_redactor.rb', line 481

def _chunked_scan(text, enable_bits)
  redacted = +""
  matches = []
  base = 0
  _chunk_bytes(text).each do |chunk|
    part = _scan(chunk, enable_bits)
    redacted << part[:redacted]
    part[:matches].each do |m|
      m[:start] += base
      matches << m
    end
    base += chunk.bytesize
  end
  { redacted: redacted, matches: matches }
end

._clear_custom_patterns ⇒ `Object`

._custom_patterns ⇒ `Object`

._letter_class(char) ⇒ `String`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Build a POSIX bracket expression matching one letter case-insensitively and, where applicable, its accented variants.

Parameters:

char (String) —

a single character.

Returns:

(String) —

a bracket expression, e.g. “[Mm]” or “[EeÈÉÊËèéêë]”.

# File 'lib/data_redactor/name_pattern.rb', line 137

def _letter_class(char)
  down = char.downcase
  up   = char.upcase
  members = [down]
  members << up unless up == down

  base = DIACRITIC_FOLD.key?(down) ? down : _ascii_base(down)
  if base && DIACRITIC_FOLD.key?(base)
    accented = DIACRITIC_FOLD[base]
    members << accented << accented.upcase
    members << base << base.upcase # accented input still matches bare ASCII
  end

  "[#{members.join}]"
end

._part_token(part) ⇒ `String`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Build the alternation for one name part: the full case-insensitive name, or its initial (with optional dot). Hyphenated/multi-word parts also match each sub-word alone and tolerant separators between sub-words.

Parameters:

part (String) —

a single name part, e.g. “Mario” or “Anne-Marie”.

Returns:

(String) —

a parenthesised POSIX ERE alternation.

# File 'lib/data_redactor/name_pattern.rb', line 103

def _part_token(part)
  words = part.split(/[ -]+/).reject(&:empty?)

  word_alts = words.map { |w| _word_alternatives(w) }

  forms = []
  # whole part with tolerant separators between its words
  forms << word_alts.map { |alts| "(#{alts.join('|')})" }.join("[ -]?")
  # each word on its own (covers "Anne" / "Marie" from "Anne-Marie")
  if words.length > 1
    word_alts.each { |alts| forms << "(#{alts.join('|')})" }
  end

  "(#{forms.uniq.join('|')})"
end

._redact(rb_text, rb_ph_mode, rb_ph_str, rb_enable_bits) ⇒ `Object`

# File 'ext/data_redactor/redact.c', line 182

VALUE rb_data_redactor_redact(VALUE self, VALUE rb_text,
                              VALUE rb_ph_mode, VALUE rb_ph_str,
                              VALUE rb_enable_bits) {
    Check_Type(rb_text,         T_STRING);
    Check_Type(rb_ph_str,       T_STRING);
    Check_Type(rb_enable_bits,  T_ARRAY);

    int ph_mode = NUM2INT(rb_ph_mode);
    const char *ph_str_plain = StringValueCStr(rb_ph_str);

    const char *input = RSTRING_PTR(rb_text);
    size_t in_len = (size_t)RSTRING_LEN(rb_text);

    /* Stage 1: built-ins through the fast v19 engine (single pass, resolved to
     * earlier-index-wins). */
    int *bits = builtin_enable_bits(rb_enable_bits);
    if (!bits) rb_raise(rb_eNoMemError, "enable_bits allocation failed");
    size_t work_len = 0;
    char *working = redact_builtins(input, in_len, bits, ph_mode, ph_str_plain, &work_len);
    free(bits);
    if (!working) rb_raise(rb_eNoMemError, "built-in redaction allocation failed");

    /* Stage 2: custom patterns through the glibc regexec path, on the buffer the
     * built-ins already rewrote — preserving the sequential built-ins→customs
     * order and full UTF-8 matching for user regex (see Gap 2 hybrid split). The
     * "[REDACTED…]" placeholders introduce none of any custom pattern's literals
     * incidentally beyond what today already did. */
    placeholder_t ph;
    ph.mode = ph_mode;
    for (int i = 0; i < custom_count; i++) {
        if (!enable_bit(rb_enable_bits, NUM_PATTERNS + i)) continue;
        ph.str = (ph_mode == PLACEHOLDER_MODE_PLAIN)
                     ? ph_str_plain
                     : tag_name_for_bit(custom_patterns[i].tag);
        char *result = replace_all_matches(&custom_patterns[i].compiled, working,
                                           custom_patterns[i].boundary, &ph);
        free(working);
        if (!result) rb_raise(rb_eNoMemError, "replace_all_matches allocation failed (custom)");
        working = result;
    }

    VALUE rb_result = rb_str_new_cstr(working);
    free(working);
    /* Preserve the input's encoding. We go through Ruby's force_encoding rather
     * than the C rb_enc_* API because pulling in ruby/encoding.h drags in
     * onigmo.h, whose regex_t collides with the POSIX <regex.h> this TU uses for
     * the custom-pattern path. Placeholders are pure ASCII, valid in every
     * encoding the gem accepts. */
    rb_funcall(rb_result, rb_intern("force_encoding"), 1,
               rb_funcall(rb_text, rb_intern("encoding"), 0));
    return rb_result;
}

._remove_pattern ⇒ `Object`

._scan(rb_text, rb_enable_bits) ⇒ `Object`

# File 'ext/data_redactor/scan.c', line 48

VALUE rb_data_redactor_scan(VALUE self, VALUE rb_text, VALUE rb_enable_bits) {
    Check_Type(rb_text,        T_STRING);
    Check_Type(rb_enable_bits, T_ARRAY);

    const char *input  = RSTRING_PTR(rb_text);
    size_t      in_len = (size_t)RSTRING_LEN(rb_text);

    static const placeholder_t ph_plain = { PLACEHOLDER_MODE_PLAIN, "[REDACTED]" };

    /* ------------------------------------------------------------------ */
    /* Stage 1: built-ins through v19 (original-frame coords, no rewrite  */
    /* coordinate mapping needed).                                         */
    /* ------------------------------------------------------------------ */

    /* Build enable-bits array for built-ins. */
    int *bits = (int *)malloc((size_t)NUM_PATTERNS * sizeof(int));
    if (!bits) rb_raise(rb_eNoMemError, "enable_bits allocation failed");
    long alen = RARRAY_LEN(rb_enable_bits);
    for (int i = 0; i < NUM_PATTERNS; i++) {
        if (i < alen) {
            VALUE v = rb_ary_entry(rb_enable_bits, i);
            bits[i] = (RTEST(v) && NUM2INT(v) != 0) ? 1 : 0;
        } else {
            bits[i] = 0;
        }
    }

    /* Scan + resolve, growing buffer if needed. */
    size_t cap = in_len / 4 + 16;
    mm_match_t *ev = NULL;
    size_t n_ev;
    for (;;) {
        mm_match_t *grown = (mm_match_t *)realloc(ev, cap * sizeof(mm_match_t));
        if (!grown) { free(ev); free(bits); rb_raise(rb_eNoMemError, "mm_scan alloc"); }
        ev = grown;
        n_ev = mm_scan(input, in_len, bits, (size_t)NUM_PATTERNS, ev, cap);
        if (n_ev < cap) break;
        cap *= 2;
    }
    free(bits);
    n_ev = mm_resolve(ev, n_ev);

    /* Collect built-in match hashes. */
    VALUE matches_arr = rb_ary_new();
    for (size_t i = 0; i < n_ev; i++) {
        int   pid = ev[i].pattern_id;
        VALUE h   = rb_hash_new();
        rb_hash_aset(h, ID2SYM(rb_intern("tag")),
                     ID2SYM(rb_intern(tag_name_for_bit(pattern_tags[pid]))));
        rb_hash_aset(h, ID2SYM(rb_intern("name")),
                     rb_str_new_cstr(pattern_names[pid]));
        rb_hash_aset(h, ID2SYM(rb_intern("value")),
                     rb_str_new(input + ev[i].start, ev[i].length));
        rb_hash_aset(h, ID2SYM(rb_intern("start")),
                     LONG2NUM((long)ev[i].start));
        rb_hash_aset(h, ID2SYM(rb_intern("length")),
                     LONG2NUM((long)ev[i].length));
        rb_ary_push(matches_arr, h);
    }

    /* Build the redacted working buffer (same logic as redact_builtins). */
    size_t ph_len  = strlen(ph_plain.str); /* "[REDACTED]" = 10 */
    size_t out_cap = in_len + n_ev * ph_len + 1;
    char *working  = (char *)malloc(out_cap);
    if (!working) { free(ev); rb_raise(rb_eNoMemError, "scan working buffer alloc"); }

    size_t out_len = 0, cur = 0;
    for (size_t i = 0; i < n_ev; i++) {
        size_t s = ev[i].start, l = ev[i].length;
        if (s > cur) { memcpy(working + out_len, input + cur, s - cur); out_len += s - cur; }
        memcpy(working + out_len, ph_plain.str, ph_len);
        out_len += ph_len;
        cur = s + l;
    }
    if (cur < in_len) { memcpy(working + out_len, input + cur, in_len - cur); out_len += in_len - cur; }
    working[out_len] = '\0';

    /* ------------------------------------------------------------------ */
    /* Stage 2: custom patterns via glibc on the rewritten buffer.         */
    /* Original coords recovered via working_to_orig() using ev[].         */
    /* ------------------------------------------------------------------ */
    for (int i = 0; i < custom_count; i++) {
        if (!scan_enable_bit(rb_enable_bits, NUM_PATTERNS + i)) continue;

        const char *cur_ptr = working;
        regmatch_t  m[4];
        while (regexec(&custom_patterns[i].compiled, cur_ptr, 4, m, 0) == 0) {
            regoff_t fso = m[0].rm_so, feo = m[0].rm_eo;
            if (fso < 0 || feo < fso) break;

            regoff_t cso = fso, ceo = feo;
            if (custom_patterns[i].boundary) {
                if (m[1].rm_so >= 0 && m[1].rm_eo > m[1].rm_so) cso = m[1].rm_eo;
                if (m[3].rm_so >= 0 && m[3].rm_eo > m[3].rm_so) ceo = m[3].rm_so;
            }

            long wpos_core  = (long)(cur_ptr - working) + (long)cso;
            long orig_start = working_to_orig(wpos_core, ev, n_ev, ph_len);
            long core_len   = (long)(ceo - cso);

            VALUE h = rb_hash_new();
            rb_hash_aset(h, ID2SYM(rb_intern("tag")),
                         ID2SYM(rb_intern(tag_name_for_bit(custom_patterns[i].tag))));
            rb_hash_aset(h, ID2SYM(rb_intern("name")),
                         rb_str_new_cstr(custom_patterns[i].name));
            rb_hash_aset(h, ID2SYM(rb_intern("value")),
                         rb_str_new(cur_ptr + cso, (size_t)core_len));
            rb_hash_aset(h, ID2SYM(rb_intern("start")),  LONG2NUM(orig_start));
            rb_hash_aset(h, ID2SYM(rb_intern("length")), LONG2NUM(core_len));
            rb_ary_push(matches_arr, h);

            if (feo == fso) { if (*cur_ptr) cur_ptr++; else break; }
            else cur_ptr += feo;
        }

        char *next = replace_all_matches(&custom_patterns[i].compiled, working,
                                         custom_patterns[i].boundary, &ph_plain);
        free(working);
        if (!next) { free(ev); rb_raise(rb_eNoMemError, "replace_all_matches failed in scan"); }
        working = next;
    }

    free(ev);

    VALUE result      = rb_hash_new();
    VALUE rb_redacted = rb_str_new_cstr(working);
    free(working);
    rb_funcall(rb_redacted, rb_intern("force_encoding"), 1,
               rb_funcall(rb_text, rb_intern("encoding"), 0));
    rb_hash_aset(result, ID2SYM(rb_intern("redacted")), rb_redacted);
    rb_hash_aset(result, ID2SYM(rb_intern("matches")),  matches_arr);
    return result;
}

._validate_name_arg!(value, label) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Raises:

(ArgumentError)

# File 'lib/data_redactor/name_pattern.rb', line 165

def _validate_name_arg!(value, label)
  return if value.is_a?(String) && !value.strip.empty?

  raise ArgumentError, "#{label} must be a non-empty String, got #{value.inspect}"
end

._walk(node, only:, except:, placeholder:, seen:) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Depth-first recursive walker for redact_deep. seen is a Set of object_ids already on the current traversal stack, used to detect circular references.

# File 'lib/data_redactor.rb', line 393

def _walk(node, only:, except:, placeholder:, seen:)
  case node
  when String
    redact(node, only: only, except: except, placeholder: placeholder)
  when Hash
    raise ArgumentError, "redact_deep: circular reference detected" if seen.include?(node.object_id)
    seen.add(node.object_id)
    result = node.transform_values { |v| _walk(v, only: only, except: except, placeholder: placeholder, seen: seen) }
    seen.delete(node.object_id)
    result
  when Array
    raise ArgumentError, "redact_deep: circular reference detected" if seen.include?(node.object_id)
    seen.add(node.object_id)
    result = node.map { |v| _walk(v, only: only, except: except, placeholder: placeholder, seen: seen) }
    seen.delete(node.object_id)
    result
  else
    node
  end
end

._word_alternatives(word) ⇒ `Array<String>`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Alternatives for a single whitespace-free word: the full name (each letter as a case-insensitive, diacritic-folded class) and its initial.

Parameters:

word (String) —

a single word with no spaces or hyphens.

Returns:

(Array<String>) —

alternation members for this word.

# File 'lib/data_redactor/name_pattern.rb', line 125

def _word_alternatives(word)
  full    = word.chars.map { |ch| _letter_class(ch) }.join
  initial = "#{_letter_class(word[0])}\\.?"
  [full, initial]
end

.add_pattern(name:, regex:, tag: :custom, boundary: false) ⇒ `Boolean`

Patterns must be valid POSIX ERE. Ruby-only syntax (\d, \s, \w, \b, lookaround, non-greedy quantifiers, named groups) is rejected at registration time, never at redaction time.

If a pattern with the same name is already registered, it is replaced (the old compiled regex_t is freed).

Examples:

DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
DataRedactor.add_pattern(name: "internal_key",
                         regex: /INT-[A-Z]{3}/,
                         tag: :credentials,
                         boundary: true)

Parameters:

name (String) —

unique identifier for this pattern. Used by remove_pattern.
regex (String, Regexp) —

POSIX ERE source. A Regexp is accepted for convenience but only its .source is used; flags are ignored.
tag (Symbol) (defaults to: :custom) —

one of TAGS keys. Defaults to :custom.
boundary (Boolean) (defaults to: false) —

when true, the pattern is wrapped with (^|[^0-9A-Za-z])(…)(|$) so it only matches when not embedded in a longer alphanumeric token. Incompatible with patterns that contain capture groups.

Returns:

(Boolean) —

true on success.

Raises:

(ArgumentError) —

if name is not a non-empty String, or regex is neither a String nor a Regexp.
(InvalidPatternError) —

if the pattern uses Ruby-only syntax, contains capture groups while boundary: true, or fails regcomp.
(UnknownTagError) —

if tag is not in TAGS.

# File 'lib/data_redactor.rb', line 263

def add_pattern(name:, regex:, tag: :custom, boundary: false)
  raise ArgumentError, "name must be a non-empty String" \
    unless name.is_a?(String) && !name.empty?

  source = case regex
           when String then regex
           when Regexp then regex.source
           else raise ArgumentError, "regex must be a String or Regexp, got #{regex.class}"
           end

  if source =~ RUBY_ONLY_SYNTAX_RE
    raise InvalidPatternError,
      "pattern #{name.inspect} uses Ruby-only syntax (#{$&.inspect}); " \
      "use POSIX ERE — no \\d, \\s, \\w, \\b, lookaround, non-greedy, or named groups"
  end

  if boundary && source =~ CAPTURE_GROUP_RE
    raise InvalidPatternError,
      "pattern #{name.inspect} has capture groups and cannot use boundary: true"
  end

  tag_bit = TAGS[tag] or raise UnknownTagError,
    "unknown tag #{tag.inspect}; valid tags: #{TAGS.keys.inspect}"

  _add_pattern(name, source, tag_bit, boundary ? 1 : 0)
end

.build_enable_bits(only, except) ⇒ `Array<Integer>`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Build the per-pattern enable bit-list passed to the C layer.

The list has one Integer (0 or 1) per pattern in execution order: built-ins first (NUM_PATTERNS entries), then currently registered custom patterns in registration order. C iterates by index and skips zeros.

Semantics of only: / except: — both accept a mix of Symbols (tags) and Strings (pattern names):

enabled(p) iff
  (only is nil OR p.tag ∈ only_tags OR p.name ∈ only_names)
  AND p.tag ∉ except_tags AND p.name ∉ except_names

Returns:

(Array<Integer>) —

same length as built-ins + customs.

# File 'lib/data_redactor.rb', line 366

def build_enable_bits(only, except)
  only_bits,   only_names   = split_filter(only)
  except_bits, except_names = split_filter(except)
  only_present = !only.nil?

  bits = Array.new(BUILTIN_PATTERN_NAMES.length + _custom_patterns.length, 0)

  BUILTIN_PATTERN_NAMES.each_with_index do |name, i|
    tag_bit = BUILTIN_PATTERN_TAG_BITS[i]
    bits[i] = 1 if pattern_enabled?(name, tag_bit, only_present,
                                    only_bits, only_names,
                                    except_bits, except_names)
  end

  _custom_patterns.each_with_index do |h, i|
    bits[BUILTIN_PATTERN_NAMES.length + i] = 1 if pattern_enabled?(
      h[:name], h[:tag_bit], only_present,
      only_bits, only_names, except_bits, except_names)
  end

  bits
end

.clear_custom_patterns! ⇒ `nil`

Remove every registered custom pattern.

Mostly useful in test suites that need a clean slate between examples.

Returns:

(nil)



316
317
318

# File 'lib/data_redactor.rb', line 316

def clear_custom_patterns!
  _clear_custom_patterns
end

.custom_patterns ⇒ `Array<Hash{Symbol => Object}>`

List every currently registered custom pattern.

Returns:

(Array<Hash{Symbol => Object}>) —

one hash per pattern with keys :name (String), :source (String — the POSIX ERE source), :tag (Symbol), :boundary (Boolean).

# File 'lib/data_redactor.rb', line 304

def custom_patterns
  _custom_patterns.map do |h|
    { name: h[:name], source: h[:source], tag: TAGS.key(h[:tag_bit]) || :custom,
      boundary: h[:boundary] }
  end
end

.name_pattern(first, last, middle: nil) ⇒ `String`

Build a POSIX ERE that matches a person’s name across common written variations, ready to hand to add_pattern.

The returned pattern is boundary-wrapped — it embeds (^|[^A-Za-z]) … ([^A-Za-z]|$) so that “Mario” matches as a whole word but not inside “Mariolino”. Because the wrapper uses capture groups, register the pattern with the default boundary: false (do not pass boundary: true — that would double-wrap and reject the groups).

Variations covered:

Case — every letter becomes a case-insensitive character class ([Mm][Aa]...), since POSIX ERE has no /i flag.
Order — “First Last”, “Last First”, “Last, First”, “Last,First”.
Initials — “M. Last”, “M Last”, “First R.”, “First R”, “M.R.”, “M R”, “MR”.
Diacritics — an ASCII letter with a DIACRITIC_FOLD entry also matches its accented forms (+“Jose”+ matches “José”). An accented input letter also matches its bare ASCII form.
Separators — spaces and hyphens are interchangeable between and within name parts. A hyphenated part like “Anne-Marie” also matches “Anne Marie”, “AnneMarie”, and each half on its own (+“Anne”+, “Marie”). Multi-word parts like “Van der Berg” tolerate any space/hyphen separator between words.

Examples:

Register a name pattern

DataRedactor.add_pattern(
  name:  "person_mario_rossi",
  regex: DataRedactor.name_pattern("Mario", "Rossi"),
  tag:   :contact
)

With a middle name

DataRedactor.name_pattern("Mario", "Rossi", middle: "Luigi")

Parameters:

first (String) —

the given name. May contain hyphens or spaces.
last (String) —

the family name. May contain hyphens or spaces.
middle (String, nil) (defaults to: nil) —

optional middle name. When given, the pattern matches both the no-middle forms and the with-middle forms.

Returns:

(String) —

a POSIX ERE source string.

Raises:

(ArgumentError) —

if first or last is not a non-empty String, or middle is given but is not a non-empty String.

# File 'lib/data_redactor/name_pattern.rb', line 71

def name_pattern(first, last, middle: nil)
  _validate_name_arg!(first, "first")
  _validate_name_arg!(last, "last")
  _validate_name_arg!(middle, "middle") unless middle.nil?

  first_tok  = _part_token(first)
  last_tok   = _part_token(last)
  middle_tok = middle && _part_token(middle)

  # Separator between name parts. Optional so initial-only forms collapse
  # ("MR", "M.R.") and so "First,Last" with no space still matches.
  sep = "[ ,-]*"

  bodies = []
  bodies << "#{first_tok}#{sep}#{last_tok}"            # First Last
  bodies << "#{last_tok}#{sep}#{first_tok}"            # Last First / Last, First

  if middle_tok
    bodies << "#{first_tok}#{sep}#{middle_tok}#{sep}#{last_tok}" # First Middle Last
    bodies << "#{last_tok}#{sep}#{first_tok}#{sep}#{middle_tok}" # Last First Middle
  end

  "(^|[^A-Za-z])(#{bodies.join('|')})([^A-Za-z]|$)"
end

.pattern_enabled?(name, tag_bit, only_present, only_bits, only_names, except_bits, except_names) ⇒ `Boolean`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Returns:

(Boolean)

# File 'lib/data_redactor.rb', line 415

def pattern_enabled?(name, tag_bit, only_present, only_bits, only_names,
                     except_bits, except_names)
  return false if (tag_bit & except_bits) != 0
  return false if except_names.include?(name)
  return true  unless only_present
  return true  if (tag_bit & only_bits) != 0
  only_names.include?(name)
end

.pattern_names ⇒ `Array<String>`

List of every pattern name the redactor knows about.

Includes the BUILTIN_PATTERN_NAMES plus any names registered via add_pattern. Useful for discovering what String values only: / except: accept, and for filtering / debugging.

Returns:

(Array<String>) —

built-in names first (in execution order), then custom names in registration order.



103
104
105

# File 'lib/data_redactor.rb', line 103

def pattern_names
  BUILTIN_PATTERN_NAMES + _custom_patterns.map { |h| h[:name] }
end

.redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ `String`

Redact every match of the configured patterns in text.

only: and except: both accept a single value or an Array, mixing:

Symbols — tag names from TAGS (e.g. :contact, :credentials).
Strings — specific pattern names from pattern_names (e.g. “email”).

They can be combined: only: :contact, except: [“email”] means “redact every contact pattern except email.” Symbols give you tag-level control; Strings give you per-pattern precision.

Precedence: a pattern is redacted iff (only is nil OR pattern matches only:) AND (pattern does not match except:). except: always wins over only: when they overlap — e.g. only: :contact, except: :contact produces an empty redaction (no-op), and only: [“email”], except: [“email”] likewise skips email entirely.

Examples:

DataRedactor.redact("token sk_live_abc123", only: :credentials)
DataRedactor.redact(text, only: [:contact, "aws_access_key_id"])
DataRedactor.redact(text, only: :contact, except: ["email"])

Parameters:

text (String) —

input string. Returned unchanged if no patterns match.
only (Symbol, String, Array, nil) (defaults to: nil) —

include only the given tag(s) and/or pattern name(s).
except (Symbol, String, Array, nil) (defaults to: nil) —

exclude the given tag(s) and/or pattern name(s). May be combined with only:.
placeholder (String, :tagged, :hash) (defaults to: PLACEHOLDER_DEFAULT) —

replacement strategy. A String is used verbatim. :tagged produces [REDACTED:TAGNAME]. :hash produces a deterministic [TAGNAME_xxxx] token (4-hex djb2) so the same input value always maps to the same token.

Returns:

(String) —

a new string with every match replaced.

Raises:

(ArgumentError) —

if placeholder: is not a String/:tagged/:hash.
(UnknownTagError) —

if any Symbol in only:/except: is not in TAGS.
(UnknownPatternError) —

if any String in only:/except: is not in pattern_names.

# File 'lib/data_redactor.rb', line 141

def redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
  enable_bits = build_enable_bits(only, except)
  ph_mode, ph_str = resolve_placeholder(placeholder)
  # Defer to the C layer's TypeError for non-Strings; only chunk if the input
  # is a String big enough to benefit (avoid bytesize on non-Strings).
  if text.is_a?(String) && text.bytesize > CHUNK_SIZE
    return _chunk_bytes(text).map { |c| _redact(c, ph_mode, ph_str, enable_bits) }.join
  end
  _redact(text, ph_mode, ph_str, enable_bits)
end

.redact_deep(data, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ `Hash`, ...

Recursively redact every String value in a nested Hash/Array structure.

Walks the structure depth-first. Only String leaves are passed through redact; all other leaf types (Integer, Float, nil, Symbol, Boolean) are copied unchanged. Hash keys are never modified.

Returns a deep copy — the original structure is never mutated.

Examples:

Rails params

safe = DataRedactor.redact_deep(params.to_h)

Mixed filter

DataRedactor.redact_deep(payload, only: :credentials, placeholder: :tagged)

Parameters:

data (Hash, Array, String, Object) —

the structure to walk. Any type is accepted; non-String scalars are returned as-is.
only (Symbol, String, Array, nil) (defaults to: nil) —

forwarded to redact.
except (Symbol, String, Array, nil) (defaults to: nil) —

forwarded to redact.
placeholder (String, :tagged, :hash) (defaults to: PLACEHOLDER_DEFAULT) —

forwarded to redact.

Returns:

(Hash, Array, String, Object) —

a new structure of the same shape with all String leaves redacted.

Raises:

(ArgumentError) —

if the structure contains a circular reference.



207
208
209

# File 'lib/data_redactor.rb', line 207

def redact_deep(data, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
  _walk(data, only: only, except: except, placeholder: placeholder, seen: Set.new)
end

.redact_json(json_string, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ `String`

Parse json_string, redact every String value in the resulting structure, and return valid JSON.

Delegates traversal to redact_deep. All keyword arguments are forwarded to redact.

Examples:

DataRedactor.redact_json('{"email":"alice@example.com","count":3}')
# => '{"email":"[REDACTED]","count":3}'

Parameters:

json_string (String) —

valid JSON input.
only (Symbol, String, Array, nil) (defaults to: nil) —

forwarded to redact.
except (Symbol, String, Array, nil) (defaults to: nil) —

forwarded to redact.
placeholder (String, :tagged, :hash) (defaults to: PLACEHOLDER_DEFAULT) —

forwarded to redact.

Returns:

(String) —

a JSON string with all String values redacted.

Raises:

(JSON::ParserError) —

if json_string is not valid JSON.

# File 'lib/data_redactor.rb', line 227

def redact_json(json_string, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
  parsed = JSON.parse(json_string)
  redacted = redact_deep(parsed, only: only, except: except, placeholder: placeholder)
  JSON.generate(redacted)
end

.remove_pattern(name) ⇒ `Boolean`

Remove a previously registered custom pattern.

Parameters:

name (String, Symbol) —

the name used in add_pattern.

Returns:

(Boolean) —

true if a pattern was removed, false if no pattern with that name was registered.



295
296
297

# File 'lib/data_redactor.rb', line 295

def remove_pattern(name)
  _remove_pattern(name.to_s)
end

.resolve_placeholder(placeholder) ⇒ `Array(Integer, String)`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Translate the user-facing placeholder: value into the (mode_int, str) pair the C layer expects.

Parameters:

placeholder (String, :tagged, :hash)

Returns:

(Array(Integer, String))

Raises:

(ArgumentError) —

if placeholder is none of the accepted values.

# File 'lib/data_redactor.rb', line 431

def resolve_placeholder(placeholder)
  case placeholder
  when :tagged then [PH_MODE_TAGGED, ""]
  when :hash   then [PH_MODE_HASH,   ""]
  when String  then [PH_MODE_PLAIN,  placeholder]
  else
    raise ArgumentError,
      "placeholder must be a String, :tagged, or :hash — got #{placeholder.inspect}"
  end
end

.scan(text, only: nil, except: nil) ⇒ `Hash{Symbol => Object}`

Scan text and return both the redacted string and per-match metadata.

Useful for auditing, false-positive tuning, and compliance pipelines. :start and :length are byte offsets into the original string, so text.byteslice(m, m) == m.

Examples:

DataRedactor.scan("user@example.com")
# => { redacted: "[REDACTED]",
#      matches: [{tag: :contact, name: "email",
#                 value: "user@example.com", start: 0, length: 16}] }

Parameters:

text (String) —

input string.
only (Symbol, String, Array, nil) (defaults to: nil) —

same semantics as redact.
except (Symbol, String, Array, nil) (defaults to: nil) —

same semantics as redact.

Returns:

(Hash{Symbol => Object}) —

{ redacted: String, matches: Array<Hash> }. Each match hash has :tag (Symbol), :name (String), :value (String), :start (Integer byte offset), :length (Integer).

Raises:

(UnknownTagError) —

if any Symbol in only:/except: is not in TAGS.
(UnknownPatternError) —

if any String in only:/except: is not in pattern_names.

# File 'lib/data_redactor.rb', line 172

def scan(text, only: nil, except: nil)
  enable_bits = build_enable_bits(only, except)
  result =
    if text.is_a?(String) && text.bytesize > CHUNK_SIZE
      _chunked_scan(text, enable_bits)
    else
      _scan(text, enable_bits)
    end
  # Normalise: convert tag string from C (uppercase) back to the Symbol used in TAGS
  result[:matches].each { |m| m[:tag] = m[:tag].to_s.downcase.to_sym }
  result
end

.split_filter(entries) ⇒ `Array(Integer, Set<String>)`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Split a mixed Symbol/String filter list into (tag_bitmask, name_set).

Parameters:

entries (nil, Symbol, String, Array)

Returns:

(Array(Integer, Set<String>)) —

tag bits OR-ed together; set of pattern-name Strings.

Raises:

(UnknownTagError) —

for unknown Symbols.
(UnknownPatternError) —

for unknown Strings.

# File 'lib/data_redactor.rb', line 328

def split_filter(entries)
  bits = 0
  names = Set.new
  return [bits, names] if entries.nil?
  Array(entries).each do |e|
    case e
    when Symbol
      bit = TAGS[e] or raise UnknownTagError,
        "unknown tag #{e.inspect}; valid tags: #{TAGS.keys.inspect}"
      bits |= bit
    when String
      unless pattern_names.include?(e)
        raise UnknownPatternError,
          "unknown pattern name #{e.inspect}; see DataRedactor.pattern_names"
      end
      names << e
    else
      raise ArgumentError,
        "only:/except: entries must be a Symbol (tag) or String (pattern name), got #{e.inspect}"
    end
  end
  [bits, names]
end

.tags ⇒ `Array<Symbol>`

List of supported tag symbols.

Returns:

(Array<Symbol>) —

every key from TAGS



91
92
93

# File 'lib/data_redactor.rb', line 91

def tags
  TAGS.keys
end

Module: DataRedactor

Overview

Examples:

Basic redaction

Filter by tag or pattern name

Custom placeholder

Audit / dry-run

Custom pattern

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

._add_pattern ⇒ Object

._ascii_base(char) ⇒ String?

._chunk_bytes(text) ⇒ Array<String>

._chunked_scan(text, enable_bits) ⇒ Hash{Symbol => Object}

._clear_custom_patterns ⇒ Object

._custom_patterns ⇒ Object

._letter_class(char) ⇒ String

._part_token(part) ⇒ String

._redact(rb_text, rb_ph_mode, rb_ph_str, rb_enable_bits) ⇒ Object

._remove_pattern ⇒ Object

._scan(rb_text, rb_enable_bits) ⇒ Object

._validate_name_arg!(value, label) ⇒ Object

._walk(node, only:, except:, placeholder:, seen:) ⇒ Object

._word_alternatives(word) ⇒ Array<String>

.add_pattern(name:, regex:, tag: :custom, boundary: false) ⇒ Boolean

Examples:

.build_enable_bits(only, except) ⇒ Array<Integer>

.clear_custom_patterns! ⇒ nil

.custom_patterns ⇒ Array<Hash{Symbol => Object}>

.name_pattern(first, last, middle: nil) ⇒ String

Examples:

Register a name pattern

With a middle name

.pattern_enabled?(name, tag_bit, only_present, only_bits, only_names, except_bits, except_names) ⇒ Boolean

.pattern_names ⇒ Array<String>

.redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ String

Examples:

.redact_deep(data, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ Hash, ...

Examples:

Rails params

Mixed filter

.redact_json(json_string, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ String

Examples:

.remove_pattern(name) ⇒ Boolean

.resolve_placeholder(placeholder) ⇒ Array(Integer, String)

.scan(text, only: nil, except: nil) ⇒ Hash{Symbol => Object}

Examples:

.split_filter(entries) ⇒ Array(Integer, Set<String>)

.tags ⇒ Array<Symbol>

._add_pattern ⇒ `Object`

._ascii_base(char) ⇒ `String`^?

._chunk_bytes(text) ⇒ `Array<String>`

._chunked_scan(text, enable_bits) ⇒ `Hash{Symbol => Object}`

._clear_custom_patterns ⇒ `Object`

._custom_patterns ⇒ `Object`

._letter_class(char) ⇒ `String`

._part_token(part) ⇒ `String`

._redact(rb_text, rb_ph_mode, rb_ph_str, rb_enable_bits) ⇒ `Object`

._remove_pattern ⇒ `Object`

._scan(rb_text, rb_enable_bits) ⇒ `Object`

._validate_name_arg!(value, label) ⇒ `Object`

._walk(node, only:, except:, placeholder:, seen:) ⇒ `Object`

._word_alternatives(word) ⇒ `Array<String>`

.add_pattern(name:, regex:, tag: :custom, boundary: false) ⇒ `Boolean`

.build_enable_bits(only, except) ⇒ `Array<Integer>`

.clear_custom_patterns! ⇒ `nil`

.custom_patterns ⇒ `Array<Hash{Symbol => Object}>`

.name_pattern(first, last, middle: nil) ⇒ `String`

.pattern_enabled?(name, tag_bit, only_present, only_bits, only_names, except_bits, except_names) ⇒ `Boolean`

.pattern_names ⇒ `Array<String>`

.redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ `String`

.redact_deep(data, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ `Hash`, ...

.redact_json(json_string, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ `String`

.remove_pattern(name) ⇒ `Boolean`

.resolve_placeholder(placeholder) ⇒ `Array(Integer, String)`

.scan(text, only: nil, except: nil) ⇒ `Hash{Symbol => Object}`

.split_filter(entries) ⇒ `Array(Integer, Set<String>)`

.tags ⇒ `Array<Symbol>`