Class: Phronomy::Filter::PromptInjectionFilter

Inherits:
Base
  • Object
show all
Defined in:
lib/phronomy/filter/prompt_injection_filter.rb

Overview

Detects potential prompt injection attempts in the agent input.

Prompt injection is an attack where an adversary embeds LLM instructions inside data sources (e.g. RAG chunks, tool results, user input) to override the agent's intended behaviour.

This filter scans the input string for common injection patterns and calls Base#block! when a match is found. It is intended to be registered as an input filter on agents that consume untrusted external content.

Examples:

class MyAgent < Phronomy::Agent::Base
  model "gpt-4o"
  input_filter Phronomy::Filter::PromptInjectionFilter
end

Custom patterns

filter = Phronomy::Filter::PromptInjectionFilter.new(
  extra_patterns: [/exfiltrate/i]
)
agent.add_input_filter(filter)

Constant Summary collapse

DEFAULT_PATTERNS =

Common prompt injection / jailbreak patterns.

[
  /ignore\s+(previous|prior|all)\s+instructions?/i,
  /disregard\s+(previous|prior|all)\s+instructions?/i,
  /forget\s+(previous|prior|all)\s+instructions?/i,
  /override\s+(previous|prior|all)\s+instructions?/i,
  /new\s+instructions?:\s/i,
  /\byour\s+new\s+(role|instructions?|task)\b/i,
  /you\s+are\s+now\s+(a|an)\b/i,
  /\bact\s+as\s+(a|an)\b/i,
  /\bpretend\s+(you\s+are|to\s+be)\b/i,
  /\bdo\s+not\s+follow\s+(your|the)\s+instructions?\b/i
].freeze

Instance Method Summary collapse

Constructor Details

#initialize(extra_patterns: []) ⇒ PromptInjectionFilter

Returns a new instance of PromptInjectionFilter.

Parameters:

  • extra_patterns (Array<Regexp>) (defaults to: [])

    additional patterns to scan for



45
46
47
48
# File 'lib/phronomy/filter/prompt_injection_filter.rb', line 45

def initialize(extra_patterns: [])
  super()
  @patterns = DEFAULT_PATTERNS + extra_patterns
end

Instance Method Details

#call(value, **_context) ⇒ String, Hash

Scans the input string for injection patterns.

Parameters:

  • value (String, Hash)
  • context (Hash)

Returns:

  • (String, Hash)

    the original value when no injection is detected

Raises:



56
57
58
59
60
61
62
# File 'lib/phronomy/filter/prompt_injection_filter.rb', line 56

def call(value, **_context)
  text = value.is_a?(Hash) ? value.values.join(" ") : value.to_s
  @patterns.each do |pattern|
    block!("Potential prompt injection detected") if text.match?(pattern)
  end
  value
end