Module: DataRedactor

Defined in:
lib/data_redactor.rb,
lib/data_redactor/version.rb,
lib/data_redactor/name_pattern.rb,
lib/data_redactor/integrations/rack.rb,
lib/data_redactor/integrations/rails.rb,
lib/data_redactor/integrations/claude.rb,
lib/data_redactor/integrations/logger.rb,
lib/data_redactor/integrations/openai.rb,
lib/data_redactor/integrations/llm_support.rb,
ext/data_redactor/data_redactor.c

Overview

High-performance regex-based redactor for sensitive data.

DataRedactor scans text for sensitive patterns (API keys, IBANs, national IDs, emails, phone numbers, etc.) and replaces matches with a configurable placeholder. The matching is done by a C extension backed by POSIX regex.h, so it is fast enough to run inline on large payloads.

Examples:

Basic redaction

DataRedactor.redact("key is AKIAIOSFODNN7EXAMPLE")
# => "key is [REDACTED]"

Filter by tag or pattern name

DataRedactor.redact(text, only: :credentials)
DataRedactor.redact(text, except: [:contact, :network])
DataRedactor.redact(text, only: :contact, except: ["email"])
DataRedactor.redact(text, only: ["aws_access_key_id"])

Custom placeholder

DataRedactor.redact(text, placeholder: "***")
DataRedactor.redact(text, placeholder: :tagged) # => "[REDACTED:CONTACT]"
DataRedactor.redact(text, placeholder: :hash)   # => "[CONTACT_a3f9]"

Audit / dry-run

DataRedactor.scan(text)
# => { redacted: "...", matches: [{tag:, name:, value:, start:, length:}, ...] }

Custom pattern

DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")

Defined Under Namespace

Modules: Integrations Classes: InvalidPatternError, UnknownPatternError, UnknownTagError

Constant Summary collapse

TAGS =

Map of tag symbol to the integer bit used by the C layer.

The keys of this hash are the canonical list of supported tags; pass any of them to redact or scan via only: / except:.

Returns:

  • (Hash{Symbol => Integer})

    frozen tag-to-bit map

{
  credentials: TAG_CREDENTIALS,
  financial:   TAG_FINANCIAL,
  tax_id:      TAG_TAX_ID,
  national_id: TAG_NATIONAL_ID,
  contact:     TAG_CONTACT,
  network:     TAG_NETWORK,
  travel:      TAG_TRAVEL,
  other:       TAG_OTHER,
  custom:      TAG_CUSTOM
}.freeze
CAPTURE_GROUP_RE =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Capture groups break boundary-wrapper group index assumptions ([1],,[3] shift).

/(?<!\\)\((?!\?:)/.freeze
RUBY_ONLY_SYNTAX_RE =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Ruby regex syntax that has no POSIX ERE equivalent.

/\\[dDwWsShHbB]|\(\?[<!=]|\(\?<[a-zA-Z]|\(\?[imx]|[*+?]\?/.freeze
PLACEHOLDER_DEFAULT =

Default placeholder used when placeholder: is not given to redact.

"[REDACTED]"
CHUNK_SIZE =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Inputs larger than this (bytes) are split into newline-bounded chunks before being handed to the C engine. Bounds the per-call O(N) cost glibc regexec pays for state-log allocation, turning total redaction cost from O(N²) (one giant pass) into O(N × CHUNK_SIZE) (many bounded passes). 64 KB is a compromise: small enough to keep per-call cost low, large enough that typical log/JSON inputs use few chunks. See option G in TODO.md.

64 * 1024
VERSION =

Current gem version. Follows Semantic Versioning 2.0.0.

"0.11.0"
DIACRITIC_FOLD =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Maps a base ASCII letter to the set of accented characters that should also match it. Used to make generated name patterns diacritic-tolerant: an input “Jose” still matches “José”, and “Munoz” matches “Muñoz”.

{
  "a" => "àáâãäåāăą",
  "c" => "çćĉċč",
  "e" => "èéêëēĕėęě",
  "i" => "ìíîïĩīĭįı",
  "n" => "ñńņňʼn",
  "o" => "òóôõöøōŏő",
  "u" => "ùúûüũūŭůűų",
  "y" => "ýÿŷ",
  "s" => "śŝşš",
  "z" => "źżž",
  "g" => "ĝğġģ",
  "l" => "ĺļľŀł",
  "r" => "ŕŗř",
  "t" => "ţťŧ"
}.freeze
BUILTIN_PATTERN_NAMES =
rb_ary_freeze(builtin_names)
BUILTIN_PATTERN_TAG_BITS =
rb_ary_freeze(builtin_tag_bits)
BUILTIN_PATTERN_SOURCES =
rb_ary_freeze(builtin_sources)
BUILTIN_PATTERN_BOUNDARY =
rb_ary_freeze(builtin_boundary)
PH_MODE_PLAIN =

Placeholder mode constants.

INT2NUM(PLACEHOLDER_MODE_PLAIN)
PH_MODE_TAGGED =
INT2NUM(PLACEHOLDER_MODE_TAGGED)
PH_MODE_HASH =
INT2NUM(PLACEHOLDER_MODE_HASH)
TAG_CREDENTIALS =

Tag bitmask values used by the Ruby wrapper to build only/except masks.

INT2NUM(TAG_CREDENTIALS)
TAG_FINANCIAL =
INT2NUM(TAG_FINANCIAL)
TAG_TAX_ID =
INT2NUM(TAG_TAX_ID)
TAG_NATIONAL_ID =
INT2NUM(TAG_NATIONAL_ID)
TAG_CONTACT =
INT2NUM(TAG_CONTACT)
TAG_NETWORK =
INT2NUM(TAG_NETWORK)
TAG_TRAVEL =
INT2NUM(TAG_TRAVEL)
TAG_OTHER =
INT2NUM(TAG_OTHER)
TAG_CUSTOM =
INT2NUM(TAG_CUSTOM)
TAG_ALL =
INT2NUM(TAG_ALL)

Class Method Summary collapse

Class Method Details

._add_patternObject

Note: _redact(text, ph_mode, ph_str, enable_bits) and _scan(text, enable_bits).

._ascii_base(char) ⇒ String?

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

If char is an accented letter, return the bare ASCII letter it folds to; otherwise nil.

Parameters:

  • char (String)

    a single lowercase character.

Returns:

  • (String, nil)


159
160
161
162
# File 'lib/data_redactor/name_pattern.rb', line 159

def _ascii_base(char)
  DIACRITIC_FOLD.each { |ascii, accents| return ascii if accents.include?(char) }
  nil
end

._chunk_bytes(text) ⇒ Array<String>

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Split text into byte-bounded chunks for the chunked redact/scan path. Chunks end at a \n when possible so no match straddles a boundary; if a single line exceeds CHUNK_SIZE (rare in real inputs), it becomes one oversized chunk and pays the per-pattern O(N) cost — documented limitation. Returns an Array of byte-Strings whose concatenation equals text exactly (including the original newline separators).

Parameters:

  • text (String)

Returns:

  • (Array<String>)


452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
# File 'lib/data_redactor.rb', line 452

def _chunk_bytes(text)
  chunks = []
  pos = 0
  len = text.bytesize
  while pos < len
    remaining = len - pos
    if remaining <= CHUNK_SIZE
      chunks << text.byteslice(pos, remaining)
      break
    end
    # Find the last \n in [pos, pos+CHUNK_SIZE). If none, chunk is one long
    # line — take CHUNK_SIZE bytes as a fallback (boundary-split risk).
    window = text.byteslice(pos, CHUNK_SIZE)
    nl = window.rindex("\n")
    take = nl ? nl + 1 : CHUNK_SIZE
    chunks << text.byteslice(pos, take)
    pos += take
  end
  chunks
end

._chunked_scan(text, enable_bits) ⇒ Hash{Symbol => Object}

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Chunked variant of _scan: runs the C scanner on each chunk, then offsets each match’s :start by the chunk’s base byte-position in the original input so the byteslice invariant holds end-to-end.

Parameters:

  • text (String)
  • enable_bits (Array<Integer>)

Returns:

  • (Hash{Symbol => Object})

    { redacted: String, matches: Array<Hash> }



481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
# File 'lib/data_redactor.rb', line 481

def _chunked_scan(text, enable_bits)
  redacted = +""
  matches = []
  base = 0
  _chunk_bytes(text).each do |chunk|
    part = _scan(chunk, enable_bits)
    redacted << part[:redacted]
    part[:matches].each do |m|
      m[:start] += base
      matches << m
    end
    base += chunk.bytesize
  end
  { redacted: redacted, matches: matches }
end

._clear_custom_patternsObject

._custom_patternsObject

._letter_class(char) ⇒ String

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Build a POSIX bracket expression matching one letter case-insensitively and, where applicable, its accented variants.

Parameters:

  • char (String)

    a single character.

Returns:

  • (String)

    a bracket expression, e.g. “[Mm]” or “[EeÈÉÊËèéêë]”.



137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# File 'lib/data_redactor/name_pattern.rb', line 137

def _letter_class(char)
  down = char.downcase
  up   = char.upcase
  members = [down]
  members << up unless up == down

  base = DIACRITIC_FOLD.key?(down) ? down : _ascii_base(down)
  if base && DIACRITIC_FOLD.key?(base)
    accented = DIACRITIC_FOLD[base]
    members << accented << accented.upcase
    members << base << base.upcase # accented input still matches bare ASCII
  end

  "[#{members.join}]"
end

._part_token(part) ⇒ String

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Build the alternation for one name part: the full case-insensitive name, or its initial (with optional dot). Hyphenated/multi-word parts also match each sub-word alone and tolerant separators between sub-words.

Parameters:

  • part (String)

    a single name part, e.g. “Mario” or “Anne-Marie”.

Returns:

  • (String)

    a parenthesised POSIX ERE alternation.



103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/data_redactor/name_pattern.rb', line 103

def _part_token(part)
  words = part.split(/[ -]+/).reject(&:empty?)

  word_alts = words.map { |w| _word_alternatives(w) }

  forms = []
  # whole part with tolerant separators between its words
  forms << word_alts.map { |alts| "(#{alts.join('|')})" }.join("[ -]?")
  # each word on its own (covers "Anne" / "Marie" from "Anne-Marie")
  if words.length > 1
    word_alts.each { |alts| forms << "(#{alts.join('|')})" }
  end

  "(#{forms.uniq.join('|')})"
end

._redact(rb_text, rb_ph_mode, rb_ph_str, rb_enable_bits) ⇒ Object



182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
# File 'ext/data_redactor/redact.c', line 182

VALUE rb_data_redactor_redact(VALUE self, VALUE rb_text,
                              VALUE rb_ph_mode, VALUE rb_ph_str,
                              VALUE rb_enable_bits) {
    Check_Type(rb_text,         T_STRING);
    Check_Type(rb_ph_str,       T_STRING);
    Check_Type(rb_enable_bits,  T_ARRAY);

    int ph_mode = NUM2INT(rb_ph_mode);
    const char *ph_str_plain = StringValueCStr(rb_ph_str);

    const char *input = RSTRING_PTR(rb_text);
    size_t in_len = (size_t)RSTRING_LEN(rb_text);

    /* Stage 1: built-ins through the fast v19 engine (single pass, resolved to
     * earlier-index-wins). */
    int *bits = builtin_enable_bits(rb_enable_bits);
    if (!bits) rb_raise(rb_eNoMemError, "enable_bits allocation failed");
    size_t work_len = 0;
    char *working = redact_builtins(input, in_len, bits, ph_mode, ph_str_plain, &work_len);
    free(bits);
    if (!working) rb_raise(rb_eNoMemError, "built-in redaction allocation failed");

    /* Stage 2: custom patterns through the glibc regexec path, on the buffer the
     * built-ins already rewrote — preserving the sequential built-ins→customs
     * order and full UTF-8 matching for user regex (see Gap 2 hybrid split). The
     * "[REDACTED…]" placeholders introduce none of any custom pattern's literals
     * incidentally beyond what today already did. */
    placeholder_t ph;
    ph.mode = ph_mode;
    for (int i = 0; i < custom_count; i++) {
        if (!enable_bit(rb_enable_bits, NUM_PATTERNS + i)) continue;
        ph.str = (ph_mode == PLACEHOLDER_MODE_PLAIN)
                     ? ph_str_plain
                     : tag_name_for_bit(custom_patterns[i].tag);
        char *result = replace_all_matches(&custom_patterns[i].compiled, working,
                                           custom_patterns[i].boundary, &ph);
        free(working);
        if (!result) rb_raise(rb_eNoMemError, "replace_all_matches allocation failed (custom)");
        working = result;
    }

    VALUE rb_result = rb_str_new_cstr(working);
    free(working);
    /* Preserve the input's encoding. We go through Ruby's force_encoding rather
     * than the C rb_enc_* API because pulling in ruby/encoding.h drags in
     * onigmo.h, whose regex_t collides with the POSIX <regex.h> this TU uses for
     * the custom-pattern path. Placeholders are pure ASCII, valid in every
     * encoding the gem accepts. */
    rb_funcall(rb_result, rb_intern("force_encoding"), 1,
               rb_funcall(rb_text, rb_intern("encoding"), 0));
    return rb_result;
}

._remove_patternObject

._scan(rb_text, rb_enable_bits) ⇒ Object



48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
# File 'ext/data_redactor/scan.c', line 48

VALUE rb_data_redactor_scan(VALUE self, VALUE rb_text, VALUE rb_enable_bits) {
    Check_Type(rb_text,        T_STRING);
    Check_Type(rb_enable_bits, T_ARRAY);

    const char *input  = RSTRING_PTR(rb_text);
    size_t      in_len = (size_t)RSTRING_LEN(rb_text);

    static const placeholder_t ph_plain = { PLACEHOLDER_MODE_PLAIN, "[REDACTED]" };

    /* ------------------------------------------------------------------ */
    /* Stage 1: built-ins through v19 (original-frame coords, no rewrite  */
    /* coordinate mapping needed).                                         */
    /* ------------------------------------------------------------------ */

    /* Build enable-bits array for built-ins. */
    int *bits = (int *)malloc((size_t)NUM_PATTERNS * sizeof(int));
    if (!bits) rb_raise(rb_eNoMemError, "enable_bits allocation failed");
    long alen = RARRAY_LEN(rb_enable_bits);
    for (int i = 0; i < NUM_PATTERNS; i++) {
        if (i < alen) {
            VALUE v = rb_ary_entry(rb_enable_bits, i);
            bits[i] = (RTEST(v) && NUM2INT(v) != 0) ? 1 : 0;
        } else {
            bits[i] = 0;
        }
    }

    /* Scan + resolve, growing buffer if needed. */
    size_t cap = in_len / 4 + 16;
    mm_match_t *ev = NULL;
    size_t n_ev;
    for (;;) {
        mm_match_t *grown = (mm_match_t *)realloc(ev, cap * sizeof(mm_match_t));
        if (!grown) { free(ev); free(bits); rb_raise(rb_eNoMemError, "mm_scan alloc"); }
        ev = grown;
        n_ev = mm_scan(input, in_len, bits, (size_t)NUM_PATTERNS, ev, cap);
        if (n_ev < cap) break;
        cap *= 2;
    }
    free(bits);
    n_ev = mm_resolve(ev, n_ev);

    /* Collect built-in match hashes. */
    VALUE matches_arr = rb_ary_new();
    for (size_t i = 0; i < n_ev; i++) {
        int   pid = ev[i].pattern_id;
        VALUE h   = rb_hash_new();
        rb_hash_aset(h, ID2SYM(rb_intern("tag")),
                     ID2SYM(rb_intern(tag_name_for_bit(pattern_tags[pid]))));
        rb_hash_aset(h, ID2SYM(rb_intern("name")),
                     rb_str_new_cstr(pattern_names[pid]));
        rb_hash_aset(h, ID2SYM(rb_intern("value")),
                     rb_str_new(input + ev[i].start, ev[i].length));
        rb_hash_aset(h, ID2SYM(rb_intern("start")),
                     LONG2NUM((long)ev[i].start));
        rb_hash_aset(h, ID2SYM(rb_intern("length")),
                     LONG2NUM((long)ev[i].length));
        rb_ary_push(matches_arr, h);
    }

    /* Build the redacted working buffer (same logic as redact_builtins). */
    size_t ph_len  = strlen(ph_plain.str); /* "[REDACTED]" = 10 */
    size_t out_cap = in_len + n_ev * ph_len + 1;
    char *working  = (char *)malloc(out_cap);
    if (!working) { free(ev); rb_raise(rb_eNoMemError, "scan working buffer alloc"); }

    size_t out_len = 0, cur = 0;
    for (size_t i = 0; i < n_ev; i++) {
        size_t s = ev[i].start, l = ev[i].length;
        if (s > cur) { memcpy(working + out_len, input + cur, s - cur); out_len += s - cur; }
        memcpy(working + out_len, ph_plain.str, ph_len);
        out_len += ph_len;
        cur = s + l;
    }
    if (cur < in_len) { memcpy(working + out_len, input + cur, in_len - cur); out_len += in_len - cur; }
    working[out_len] = '\0';

    /* ------------------------------------------------------------------ */
    /* Stage 2: custom patterns via glibc on the rewritten buffer.         */
    /* Original coords recovered via working_to_orig() using ev[].         */
    /* ------------------------------------------------------------------ */
    for (int i = 0; i < custom_count; i++) {
        if (!scan_enable_bit(rb_enable_bits, NUM_PATTERNS + i)) continue;

        const char *cur_ptr = working;
        regmatch_t  m[4];
        while (regexec(&custom_patterns[i].compiled, cur_ptr, 4, m, 0) == 0) {
            regoff_t fso = m[0].rm_so, feo = m[0].rm_eo;
            if (fso < 0 || feo < fso) break;

            regoff_t cso = fso, ceo = feo;
            if (custom_patterns[i].boundary) {
                if (m[1].rm_so >= 0 && m[1].rm_eo > m[1].rm_so) cso = m[1].rm_eo;
                if (m[3].rm_so >= 0 && m[3].rm_eo > m[3].rm_so) ceo = m[3].rm_so;
            }

            long wpos_core  = (long)(cur_ptr - working) + (long)cso;
            long orig_start = working_to_orig(wpos_core, ev, n_ev, ph_len);
            long core_len   = (long)(ceo - cso);

            VALUE h = rb_hash_new();
            rb_hash_aset(h, ID2SYM(rb_intern("tag")),
                         ID2SYM(rb_intern(tag_name_for_bit(custom_patterns[i].tag))));
            rb_hash_aset(h, ID2SYM(rb_intern("name")),
                         rb_str_new_cstr(custom_patterns[i].name));
            rb_hash_aset(h, ID2SYM(rb_intern("value")),
                         rb_str_new(cur_ptr + cso, (size_t)core_len));
            rb_hash_aset(h, ID2SYM(rb_intern("start")),  LONG2NUM(orig_start));
            rb_hash_aset(h, ID2SYM(rb_intern("length")), LONG2NUM(core_len));
            rb_ary_push(matches_arr, h);

            if (feo == fso) { if (*cur_ptr) cur_ptr++; else break; }
            else cur_ptr += feo;
        }

        char *next = replace_all_matches(&custom_patterns[i].compiled, working,
                                         custom_patterns[i].boundary, &ph_plain);
        free(working);
        if (!next) { free(ev); rb_raise(rb_eNoMemError, "replace_all_matches failed in scan"); }
        working = next;
    }

    free(ev);

    VALUE result      = rb_hash_new();
    VALUE rb_redacted = rb_str_new_cstr(working);
    free(working);
    rb_funcall(rb_redacted, rb_intern("force_encoding"), 1,
               rb_funcall(rb_text, rb_intern("encoding"), 0));
    rb_hash_aset(result, ID2SYM(rb_intern("redacted")), rb_redacted);
    rb_hash_aset(result, ID2SYM(rb_intern("matches")),  matches_arr);
    return result;
}

._validate_name_arg!(value, label) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Raises:

  • (ArgumentError)


165
166
167
168
169
# File 'lib/data_redactor/name_pattern.rb', line 165

def _validate_name_arg!(value, label)
  return if value.is_a?(String) && !value.strip.empty?

  raise ArgumentError, "#{label} must be a non-empty String, got #{value.inspect}"
end

._walk(node, only:, except:, placeholder:, seen:) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Depth-first recursive walker for redact_deep. seen is a Set of object_ids already on the current traversal stack, used to detect circular references.



393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
# File 'lib/data_redactor.rb', line 393

def _walk(node, only:, except:, placeholder:, seen:)
  case node
  when String
    redact(node, only: only, except: except, placeholder: placeholder)
  when Hash
    raise ArgumentError, "redact_deep: circular reference detected" if seen.include?(node.object_id)
    seen.add(node.object_id)
    result = node.transform_values { |v| _walk(v, only: only, except: except, placeholder: placeholder, seen: seen) }
    seen.delete(node.object_id)
    result
  when Array
    raise ArgumentError, "redact_deep: circular reference detected" if seen.include?(node.object_id)
    seen.add(node.object_id)
    result = node.map { |v| _walk(v, only: only, except: except, placeholder: placeholder, seen: seen) }
    seen.delete(node.object_id)
    result
  else
    node
  end
end

._word_alternatives(word) ⇒ Array<String>

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Alternatives for a single whitespace-free word: the full name (each letter as a case-insensitive, diacritic-folded class) and its initial.

Parameters:

  • word (String)

    a single word with no spaces or hyphens.

Returns:

  • (Array<String>)

    alternation members for this word.



125
126
127
128
129
# File 'lib/data_redactor/name_pattern.rb', line 125

def _word_alternatives(word)
  full    = word.chars.map { |ch| _letter_class(ch) }.join
  initial = "#{_letter_class(word[0])}\\.?"
  [full, initial]
end

.add_pattern(name:, regex:, tag: :custom, boundary: false) ⇒ Boolean

Register a custom redaction pattern.

Patterns must be valid POSIX ERE. Ruby-only syntax (\d, \s, \w, \b, lookaround, non-greedy quantifiers, named groups) is rejected at registration time, never at redaction time.

If a pattern with the same name is already registered, it is replaced (the old compiled regex_t is freed).

Examples:

DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
DataRedactor.add_pattern(name: "internal_key",
                         regex: /INT-[A-Z]{3}/,
                         tag: :credentials,
                         boundary: true)

Parameters:

  • name (String)

    unique identifier for this pattern. Used by remove_pattern.

  • regex (String, Regexp)

    POSIX ERE source. A Regexp is accepted for convenience but only its .source is used; flags are ignored.

  • tag (Symbol) (defaults to: :custom)

    one of TAGS keys. Defaults to :custom.

  • boundary (Boolean) (defaults to: false)

    when true, the pattern is wrapped with (^|[^0-9A-Za-z])(…)(|$) so it only matches when not embedded in a longer alphanumeric token. Incompatible with patterns that contain capture groups.

Returns:

  • (Boolean)

    true on success.

Raises:

  • (ArgumentError)

    if name is not a non-empty String, or regex is neither a String nor a Regexp.

  • (InvalidPatternError)

    if the pattern uses Ruby-only syntax, contains capture groups while boundary: true, or fails regcomp.

  • (UnknownTagError)

    if tag is not in TAGS.



263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
# File 'lib/data_redactor.rb', line 263

def add_pattern(name:, regex:, tag: :custom, boundary: false)
  raise ArgumentError, "name must be a non-empty String" \
    unless name.is_a?(String) && !name.empty?

  source = case regex
           when String then regex
           when Regexp then regex.source
           else raise ArgumentError, "regex must be a String or Regexp, got #{regex.class}"
           end

  if source =~ RUBY_ONLY_SYNTAX_RE
    raise InvalidPatternError,
      "pattern #{name.inspect} uses Ruby-only syntax (#{$&.inspect}); " \
      "use POSIX ERE — no \\d, \\s, \\w, \\b, lookaround, non-greedy, or named groups"
  end

  if boundary && source =~ CAPTURE_GROUP_RE
    raise InvalidPatternError,
      "pattern #{name.inspect} has capture groups and cannot use boundary: true"
  end

  tag_bit = TAGS[tag] or raise UnknownTagError,
    "unknown tag #{tag.inspect}; valid tags: #{TAGS.keys.inspect}"

  _add_pattern(name, source, tag_bit, boundary ? 1 : 0)
end

.build_enable_bits(only, except) ⇒ Array<Integer>

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Build the per-pattern enable bit-list passed to the C layer.

The list has one Integer (0 or 1) per pattern in execution order: built-ins first (NUM_PATTERNS entries), then currently registered custom patterns in registration order. C iterates by index and skips zeros.

Semantics of only: / except: — both accept a mix of Symbols (tags) and Strings (pattern names):

enabled(p) iff
  (only is nil OR p.tag ∈ only_tags OR p.name ∈ only_names)
  AND p.tag ∉ except_tags AND p.name ∉ except_names

Returns:

  • (Array<Integer>)

    same length as built-ins + customs.



366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
# File 'lib/data_redactor.rb', line 366

def build_enable_bits(only, except)
  only_bits,   only_names   = split_filter(only)
  except_bits, except_names = split_filter(except)
  only_present = !only.nil?

  bits = Array.new(BUILTIN_PATTERN_NAMES.length + _custom_patterns.length, 0)

  BUILTIN_PATTERN_NAMES.each_with_index do |name, i|
    tag_bit = BUILTIN_PATTERN_TAG_BITS[i]
    bits[i] = 1 if pattern_enabled?(name, tag_bit, only_present,
                                    only_bits, only_names,
                                    except_bits, except_names)
  end

  _custom_patterns.each_with_index do |h, i|
    bits[BUILTIN_PATTERN_NAMES.length + i] = 1 if pattern_enabled?(
      h[:name], h[:tag_bit], only_present,
      only_bits, only_names, except_bits, except_names)
  end

  bits
end

.clear_custom_patterns!nil

Remove every registered custom pattern.

Mostly useful in test suites that need a clean slate between examples.

Returns:

  • (nil)


316
317
318
# File 'lib/data_redactor.rb', line 316

def clear_custom_patterns!
  _clear_custom_patterns
end

.custom_patternsArray<Hash{Symbol => Object}>

List every currently registered custom pattern.

Returns:

  • (Array<Hash{Symbol => Object}>)

    one hash per pattern with keys :name (String), :source (String — the POSIX ERE source), :tag (Symbol), :boundary (Boolean).



304
305
306
307
308
309
# File 'lib/data_redactor.rb', line 304

def custom_patterns
  _custom_patterns.map do |h|
    { name: h[:name], source: h[:source], tag: TAGS.key(h[:tag_bit]) || :custom,
      boundary: h[:boundary] }
  end
end

.name_pattern(first, last, middle: nil) ⇒ String

Build a POSIX ERE that matches a person’s name across common written variations, ready to hand to add_pattern.

The returned pattern is boundary-wrapped — it embeds (^|[^A-Za-z])([^A-Za-z]|$) so that “Mario” matches as a whole word but not inside “Mariolino”. Because the wrapper uses capture groups, register the pattern with the default boundary: false (do not pass boundary: true — that would double-wrap and reject the groups).

Variations covered:

  • Case — every letter becomes a case-insensitive character class ([Mm][Aa]...), since POSIX ERE has no /i flag.

  • Order“First Last”, “Last First”, “Last, First”, “Last,First”.

  • Initials“M. Last”, “M Last”, “First R.”, “First R”, “M.R.”, “M R”, “MR”.

  • Diacritics — an ASCII letter with a DIACRITIC_FOLD entry also matches its accented forms (+“Jose”+ matches “José”). An accented input letter also matches its bare ASCII form.

  • Separators — spaces and hyphens are interchangeable between and within name parts. A hyphenated part like “Anne-Marie” also matches “Anne Marie”, “AnneMarie”, and each half on its own (+“Anne”+, “Marie”). Multi-word parts like “Van der Berg” tolerate any space/hyphen separator between words.

Examples:

Register a name pattern

DataRedactor.add_pattern(
  name:  "person_mario_rossi",
  regex: DataRedactor.name_pattern("Mario", "Rossi"),
  tag:   :contact
)

With a middle name

DataRedactor.name_pattern("Mario", "Rossi", middle: "Luigi")

Parameters:

  • first (String)

    the given name. May contain hyphens or spaces.

  • last (String)

    the family name. May contain hyphens or spaces.

  • middle (String, nil) (defaults to: nil)

    optional middle name. When given, the pattern matches both the no-middle forms and the with-middle forms.

Returns:

  • (String)

    a POSIX ERE source string.

Raises:

  • (ArgumentError)

    if first or last is not a non-empty String, or middle is given but is not a non-empty String.



71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# File 'lib/data_redactor/name_pattern.rb', line 71

def name_pattern(first, last, middle: nil)
  _validate_name_arg!(first, "first")
  _validate_name_arg!(last, "last")
  _validate_name_arg!(middle, "middle") unless middle.nil?

  first_tok  = _part_token(first)
  last_tok   = _part_token(last)
  middle_tok = middle && _part_token(middle)

  # Separator between name parts. Optional so initial-only forms collapse
  # ("MR", "M.R.") and so "First,Last" with no space still matches.
  sep = "[ ,-]*"

  bodies = []
  bodies << "#{first_tok}#{sep}#{last_tok}"            # First Last
  bodies << "#{last_tok}#{sep}#{first_tok}"            # Last First / Last, First

  if middle_tok
    bodies << "#{first_tok}#{sep}#{middle_tok}#{sep}#{last_tok}" # First Middle Last
    bodies << "#{last_tok}#{sep}#{first_tok}#{sep}#{middle_tok}" # Last First Middle
  end

  "(^|[^A-Za-z])(#{bodies.join('|')})([^A-Za-z]|$)"
end

.pattern_enabled?(name, tag_bit, only_present, only_bits, only_names, except_bits, except_names) ⇒ Boolean

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Returns:

  • (Boolean)


415
416
417
418
419
420
421
422
# File 'lib/data_redactor.rb', line 415

def pattern_enabled?(name, tag_bit, only_present, only_bits, only_names,
                     except_bits, except_names)
  return false if (tag_bit & except_bits) != 0
  return false if except_names.include?(name)
  return true  unless only_present
  return true  if (tag_bit & only_bits) != 0
  only_names.include?(name)
end

.pattern_namesArray<String>

List of every pattern name the redactor knows about.

Includes the BUILTIN_PATTERN_NAMES plus any names registered via add_pattern. Useful for discovering what String values only: / except: accept, and for filtering / debugging.

Returns:

  • (Array<String>)

    built-in names first (in execution order), then custom names in registration order.



103
104
105
# File 'lib/data_redactor.rb', line 103

def pattern_names
  BUILTIN_PATTERN_NAMES + _custom_patterns.map { |h| h[:name] }
end

.redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ String

Redact every match of the configured patterns in text.

only: and except: both accept a single value or an Array, mixing:

  • Symbols — tag names from TAGS (e.g. :contact, :credentials).

  • Strings — specific pattern names from pattern_names (e.g. “email”).

They can be combined: only: :contact, except: [“email”] means “redact every contact pattern except email.” Symbols give you tag-level control; Strings give you per-pattern precision.

Precedence: a pattern is redacted iff (only is nil OR pattern matches only:) AND (pattern does not match except:). except: always wins over only: when they overlap — e.g. only: :contact, except: :contact produces an empty redaction (no-op), and only: [“email”], except: [“email”] likewise skips email entirely.

Examples:

DataRedactor.redact("token sk_live_abc123", only: :credentials)
DataRedactor.redact(text, only: [:contact, "aws_access_key_id"])
DataRedactor.redact(text, only: :contact, except: ["email"])

Parameters:

  • text (String)

    input string. Returned unchanged if no patterns match.

  • only (Symbol, String, Array, nil) (defaults to: nil)

    include only the given tag(s) and/or pattern name(s).

  • except (Symbol, String, Array, nil) (defaults to: nil)

    exclude the given tag(s) and/or pattern name(s). May be combined with only:.

  • placeholder (String, :tagged, :hash) (defaults to: PLACEHOLDER_DEFAULT)

    replacement strategy. A String is used verbatim. :tagged produces [REDACTED:TAGNAME]. :hash produces a deterministic [TAGNAME_xxxx] token (4-hex djb2) so the same input value always maps to the same token.

Returns:

  • (String)

    a new string with every match replaced.

Raises:



141
142
143
144
145
146
147
148
149
150
# File 'lib/data_redactor.rb', line 141

def redact(text, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
  enable_bits = build_enable_bits(only, except)
  ph_mode, ph_str = resolve_placeholder(placeholder)
  # Defer to the C layer's TypeError for non-Strings; only chunk if the input
  # is a String big enough to benefit (avoid bytesize on non-Strings).
  if text.is_a?(String) && text.bytesize > CHUNK_SIZE
    return _chunk_bytes(text).map { |c| _redact(c, ph_mode, ph_str, enable_bits) }.join
  end
  _redact(text, ph_mode, ph_str, enable_bits)
end

.redact_deep(data, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ Hash, ...

Recursively redact every String value in a nested Hash/Array structure.

Walks the structure depth-first. Only String leaves are passed through redact; all other leaf types (Integer, Float, nil, Symbol, Boolean) are copied unchanged. Hash keys are never modified.

Returns a deep copy — the original structure is never mutated.

Examples:

Rails params

safe = DataRedactor.redact_deep(params.to_h)

Mixed filter

DataRedactor.redact_deep(payload, only: :credentials, placeholder: :tagged)

Parameters:

  • data (Hash, Array, String, Object)

    the structure to walk. Any type is accepted; non-String scalars are returned as-is.

  • only (Symbol, String, Array, nil) (defaults to: nil)

    forwarded to redact.

  • except (Symbol, String, Array, nil) (defaults to: nil)

    forwarded to redact.

  • placeholder (String, :tagged, :hash) (defaults to: PLACEHOLDER_DEFAULT)

    forwarded to redact.

Returns:

  • (Hash, Array, String, Object)

    a new structure of the same shape with all String leaves redacted.

Raises:

  • (ArgumentError)

    if the structure contains a circular reference.



207
208
209
# File 'lib/data_redactor.rb', line 207

def redact_deep(data, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
  _walk(data, only: only, except: except, placeholder: placeholder, seen: Set.new)
end

.redact_json(json_string, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT) ⇒ String

Parse json_string, redact every String value in the resulting structure, and return valid JSON.

Delegates traversal to redact_deep. All keyword arguments are forwarded to redact.

Examples:

DataRedactor.redact_json('{"email":"alice@example.com","count":3}')
# => '{"email":"[REDACTED]","count":3}'

Parameters:

  • json_string (String)

    valid JSON input.

  • only (Symbol, String, Array, nil) (defaults to: nil)

    forwarded to redact.

  • except (Symbol, String, Array, nil) (defaults to: nil)

    forwarded to redact.

  • placeholder (String, :tagged, :hash) (defaults to: PLACEHOLDER_DEFAULT)

    forwarded to redact.

Returns:

  • (String)

    a JSON string with all String values redacted.

Raises:

  • (JSON::ParserError)

    if json_string is not valid JSON.



227
228
229
230
231
# File 'lib/data_redactor.rb', line 227

def redact_json(json_string, only: nil, except: nil, placeholder: PLACEHOLDER_DEFAULT)
  parsed = JSON.parse(json_string)
  redacted = redact_deep(parsed, only: only, except: except, placeholder: placeholder)
  JSON.generate(redacted)
end

.remove_pattern(name) ⇒ Boolean

Remove a previously registered custom pattern.

Parameters:

  • name (String, Symbol)

    the name used in add_pattern.

Returns:

  • (Boolean)

    true if a pattern was removed, false if no pattern with that name was registered.



295
296
297
# File 'lib/data_redactor.rb', line 295

def remove_pattern(name)
  _remove_pattern(name.to_s)
end

.resolve_placeholder(placeholder) ⇒ Array(Integer, String)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Translate the user-facing placeholder: value into the (mode_int, str) pair the C layer expects.

Parameters:

  • placeholder (String, :tagged, :hash)

Returns:

  • (Array(Integer, String))

Raises:

  • (ArgumentError)

    if placeholder is none of the accepted values.



431
432
433
434
435
436
437
438
439
440
# File 'lib/data_redactor.rb', line 431

def resolve_placeholder(placeholder)
  case placeholder
  when :tagged then [PH_MODE_TAGGED, ""]
  when :hash   then [PH_MODE_HASH,   ""]
  when String  then [PH_MODE_PLAIN,  placeholder]
  else
    raise ArgumentError,
      "placeholder must be a String, :tagged, or :hash — got #{placeholder.inspect}"
  end
end

.scan(text, only: nil, except: nil) ⇒ Hash{Symbol => Object}

Scan text and return both the redacted string and per-match metadata.

Useful for auditing, false-positive tuning, and compliance pipelines. :start and :length are byte offsets into the original string, so text.byteslice(m, m) == m.

Examples:

DataRedactor.scan("user@example.com")
# => { redacted: "[REDACTED]",
#      matches: [{tag: :contact, name: "email",
#                 value: "user@example.com", start: 0, length: 16}] }

Parameters:

  • text (String)

    input string.

  • only (Symbol, String, Array, nil) (defaults to: nil)

    same semantics as redact.

  • except (Symbol, String, Array, nil) (defaults to: nil)

    same semantics as redact.

Returns:

  • (Hash{Symbol => Object})

    { redacted: String, matches: Array<Hash> }. Each match hash has :tag (Symbol), :name (String), :value (String), :start (Integer byte offset), :length (Integer).

Raises:



172
173
174
175
176
177
178
179
180
181
182
183
# File 'lib/data_redactor.rb', line 172

def scan(text, only: nil, except: nil)
  enable_bits = build_enable_bits(only, except)
  result =
    if text.is_a?(String) && text.bytesize > CHUNK_SIZE
      _chunked_scan(text, enable_bits)
    else
      _scan(text, enable_bits)
    end
  # Normalise: convert tag string from C (uppercase) back to the Symbol used in TAGS
  result[:matches].each { |m| m[:tag] = m[:tag].to_s.downcase.to_sym }
  result
end

.split_filter(entries) ⇒ Array(Integer, Set<String>)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Split a mixed Symbol/String filter list into (tag_bitmask, name_set).

Parameters:

  • entries (nil, Symbol, String, Array)

Returns:

  • (Array(Integer, Set<String>))

    tag bits OR-ed together; set of pattern-name Strings.

Raises:



328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
# File 'lib/data_redactor.rb', line 328

def split_filter(entries)
  bits = 0
  names = Set.new
  return [bits, names] if entries.nil?
  Array(entries).each do |e|
    case e
    when Symbol
      bit = TAGS[e] or raise UnknownTagError,
        "unknown tag #{e.inspect}; valid tags: #{TAGS.keys.inspect}"
      bits |= bit
    when String
      unless pattern_names.include?(e)
        raise UnknownPatternError,
          "unknown pattern name #{e.inspect}; see DataRedactor.pattern_names"
      end
      names << e
    else
      raise ArgumentError,
        "only:/except: entries must be a Symbol (tag) or String (pattern name), got #{e.inspect}"
    end
  end
  [bits, names]
end

.tagsArray<Symbol>

List of supported tag symbols.

Returns:

  • (Array<Symbol>)

    every key from TAGS



91
92
93
# File 'lib/data_redactor.rb', line 91

def tags
  TAGS.keys
end