Class: Phronomy::Guardrail::PromptInjectionGuardrail

Inherits:
InputGuardrail show all
Defined in:
lib/phronomy/guardrail/prompt_injection_guardrail.rb

Overview

Detects potential prompt injection attempts in the agent input.

Prompt injection is an attack where an adversary embeds LLM instructions inside data sources (e.g. RAG chunks, tool results, user input) to override the agent's intended behaviour.

This guardrail scans the input string for common injection patterns and calls Base#fail! when a match is found. It is intended to be registered as an input guardrail on agents that consume untrusted external content.

Examples:

class MyAgent < Phronomy::Agent::Base
  model "gpt-4o"
  input_guardrails Phronomy::Guardrail::PromptInjectionGuardrail.new
end

Custom patterns

guard = Phronomy::Guardrail::PromptInjectionGuardrail.new(
  extra_patterns: [/exfiltrate/i]
)

Constant Summary collapse

DEFAULT_PATTERNS =

Common prompt injection / jailbreak patterns.

[
  /ignore\s+(previous|prior|all)\s+instructions?/i,
  /disregard\s+(previous|prior|all)\s+instructions?/i,
  /forget\s+(previous|prior|all)\s+instructions?/i,
  /override\s+(previous|prior|all)\s+instructions?/i,
  /new\s+instructions?:\s/i,
  /\byour\s+new\s+(role|instructions?|task)\b/i,
  /you\s+are\s+now\s+(a|an)\b/i,
  /\bact\s+as\s+(a|an)\b/i,
  /\bpretend\s+(you\s+are|to\s+be)\b/i,
  /\bdo\s+not\s+follow\s+(your|the)\s+instructions?\b/i
].freeze

Instance Method Summary collapse

Methods inherited from Base

#run!

Constructor Details

#initialize(extra_patterns: []) ⇒ PromptInjectionGuardrail

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Returns a new instance of PromptInjectionGuardrail.

Parameters:

  • extra_patterns (Array<Regexp>) (defaults to: [])

    additional patterns to scan for



42
43
44
45
# File 'lib/phronomy/guardrail/prompt_injection_guardrail.rb', line 42

def initialize(extra_patterns: [])
  super()
  @patterns = DEFAULT_PATTERNS + extra_patterns
end

Instance Method Details

#check(input) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Scans the input string for injection patterns.

Parameters:

  • input (String, Hash)


50
51
52
53
54
55
# File 'lib/phronomy/guardrail/prompt_injection_guardrail.rb', line 50

def check(input)
  text = input.is_a?(Hash) ? input.values.join(" ") : input.to_s
  @patterns.each do |pattern|
    fail!("Potential prompt injection detected") if text.match?(pattern)
  end
end