Class: Rubino::Memory::ThreatScanner

Inherits:

Object

Object
Rubino::Memory::ThreatScanner

show all

Defined in:: lib/rubino/memory/threat_scanner.rb

Overview

Scans content destined for the memories table for adversarial patterns.

Memory is a long-lived, cross-session channel that gets *spliced into every future system prompt*, so a single tainted write can persistently bias the agent across runs. We inspect every write at the boundary and refuse anything that smells like a known injection / exfiltration vector. We deliberately err on the side of false-positives — the agent can rephrase, but a planted directive in memory has no antidote.

‘.scan(content)` returns nil when safe, otherwise a short string describing the threat (used as both error_code label and audit log payload).

Constant Summary collapse

PROMPT_INJECTION_PATTERNS = Prompt-injection markers. These are the cliches that show up in documented jailbreak attempts; any one match is enough to refuse —legitimate user-profile content has no reason to embed them.

[
  /ignore (?:all |the )?previous/i,
  /disregard (?:all |the )?(?:above|previous)/i,
  /you are now/i,
  /new instructions:/i,
  /^\s*system\s*:/i,
  /^\s*assistant\s*:/i,
  /<\|im_start\|>/i,
  /<\|im_end\|>/i,
  /\[INST\]/i
].freeze

URL_CREDENTIAL_PATTERN = Credentials embedded in a URL — classic data-exfil channel (scheme://user:pass@host).

%r{\b[a-z][a-z0-9+\-.]*://[^/\s:@]+:[^/\s@]+@}i

BASE64_BLOB_PATTERN = Contiguous base64 of 200+ chars. Reasonable prose never has this; encoded payloads (binaries, encrypted blobs) do.

%r{[A-Za-z0-9+/]{200,}={0,2}}

PIPE_TO_SHELL_PATTERN = curl/wget piped to a shell — remote code execution recipe.

/\b(?:curl|wget)\b[^\n]*\|\s*(?:sudo\s+)?(?:bash|sh|zsh)\b/i

INVISIBLE_UNICODE_PATTERN = Zero-width characters and BIDI override / isolate codepoints. Used to hide instructions or swap visible text direction — see the “Trojan Source” class of attacks (CVE-2021-42574).

/[‌‍‮⁦-⁩]/

Class Method Summary collapse

.scan(content) ⇒ Object

Returns nil when the content is safe, otherwise a short string naming the detected threat class (e.g. “prompt_injection”).

Class Method Details

.scan(content) ⇒ `Object`

Returns nil when the content is safe, otherwise a short string naming the detected threat class (e.g. “prompt_injection”).