Class: Rubino::Tools::VisionTool

Inherits:
Base
  • Object
show all
Defined in:
lib/rubino/tools/vision_tool.rb

Overview

Delegates image-understanding to a multimodal aux model so a text-only primary can still “see” what the user uploaded. Implements the agent-as-tool semantics from the OpenAI Agents SDK: the primary stays in control, calls this tool with a focused question, and receives a structured (text) reply — no conversation handoff, no shared history.

The aux model is resolved from ‘auxiliary.vision` in config. Registry hides this tool ONLY when no aux vision model is configured AND the primary itself can’t see (per Configuration#model_supports_vision?) —the one case where calling it could only error. Whenever the primary supports vision OR an aux model is set, the tool stays EXPOSED (see Tools::Registry#aux_dependency_satisfied?), since the model may still prefer to delegate to a better-suited aux model.

Instance Attribute Summary

Attributes inherited from Base

#cancel_token, #read_tracker, #stream_chunk, #stream_kind

Instance Method Summary collapse

Methods inherited from Base

#cancellation_requested?, #config_key, #display_name, #emit_chunk, #mcp?, #risky?, #to_tool_definition, workspace_root, workspace_roots

Instance Method Details

#call(arguments) ⇒ Object



55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# File 'lib/rubino/tools/vision_tool.rb', line 55

def call(arguments)
  path     = (arguments["file_path"] || arguments[:file_path]).to_s
  question = (arguments["question"]  || arguments[:question] ||
              "Describe what you see in markdown.").to_s

  return "Error: file_path is required" if path.empty?

  expanded = File.expand_path(path)
  # Like summarize_file, vision sends the raw bytes off to the auxiliary
  # LLM, so an out-of-workspace image must be DENIED rather than read and
  # exfiltrated. Checked before existence so a file outside the sandbox
  # isn't even probed for presence (r5 MF-1 / r5c NEW-2).
  return outside_workspace_message(path) if outside_workspace?(expanded)
  return "Error: file not found: #{path}" unless File.exist?(expanded)
  return "Error: not a regular file: #{path}" unless File.file?(expanded)

  ext = File.extname(expanded).downcase
  unless LLM::ContentBuilder::SUPPORTED_IMAGE_TYPES.include?(ext)
    return "Error: unsupported image extension '#{ext}'. " \
           "Supported: #{LLM::ContentBuilder::SUPPORTED_IMAGE_TYPES.join(", ")}"
  end

  # Egress kill-switch (#578): routing the bytes to the aux vision model
  # is data egress. When attachments.policy.aux_vision_egress is set to
  # false, refuse BEFORE reading/shipping anything so the operator's
  # opt-out is real and not a dead config key.
  unless Attachments::Policy.aux_vision_egress?
    return "Error: image egress is disabled by config " \
           "(attachments.policy.aux_vision_egress: false). " \
           "The vision tool will not send image bytes to the auxiliary model."
  end

  # Content-sniff BEFORE egress (#579): the extension check above can be
  # spoofed (a text/binary file renamed `.png`), and on the bare tool
  # path the raw bytes would otherwise reach the EXTERNAL aux/vision model
  # before the model itself rejects them — the bytes have already left the
  # host. Reuse Attachments::Classify (magic wins, fail-closed; same
  # detector the executor's native-attachment path uses) and reject when
  # the real content isn't an image, so nothing is shipped off-host.
  classification = Attachments::Classify.call(expanded)
  unless classification&.safe && classification.kind == :image
    return "Error: '#{path}' is not a valid image (extension spoof or corrupt file?). " \
           "Its content is not a recognised image format, so nothing was sent to the vision model."
  end

  # Pass the image through ruby_llm's native `with:` slot (image_paths),
  # NOT as an OpenAI-style content array. ruby_llm's `ask` stringifies an
  # array content, so the base64 bytes would reach the model as TEXT and
  # it hallucinates (prod sessions 38/41: M3 saw the image perfectly when
  # called directly, but got a text blob through this path). image_paths
  # attaches the file as a real multimodal part — same route the primary
  # uses for native vision.
  response = LLM::AuxiliaryClient.new.call(
    task: :vision,
    messages: [{ role: "user", content: question }],
    image_paths: [expanded]
  )
  response.content.to_s
rescue StandardError => e
  "Error calling vision model: #{e.class}: #{e.message}"
end

#descriptionObject



27
28
29
30
31
32
# File 'lib/rubino/tools/vision_tool.rb', line 27

def description
  "Ask a multimodal model to describe or interpret an image. " \
    "Use when you need to understand visual content (charts, screenshots, " \
    "diagrams, photos). Provide an optional focused question to direct the " \
    "analysis; default is a full markdown description."
end

#input_schemaObject



34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/rubino/tools/vision_tool.rb', line 34

def input_schema
  {
    type: "object",
    properties: {
      file_path: {
        type: "string",
        description: "Absolute path to an image file (.png .jpg .jpeg .webp .gif .bmp)"
      },
      question: {
        type: "string",
        description: "Optional focused question. Default: 'Describe what you see in markdown.'"
      }
    },
    required: %w[file_path]
  }
end

#nameObject



23
24
25
# File 'lib/rubino/tools/vision_tool.rb', line 23

def name
  "vision"
end

#risk_levelObject



51
52
53
# File 'lib/rubino/tools/vision_tool.rb', line 51

def risk_level
  :low
end