Class: Rubino::Tools::VisionTool

Inherits:
Base
  • Object
show all
Defined in:
lib/rubino/tools/vision_tool.rb

Overview

Delegates image-understanding to a multimodal aux model so a text-only primary can still “see” what the user uploaded. Implements the agent-as-tool semantics from the OpenAI Agents SDK: the primary stays in control, calls this tool with a focused question, and receives a structured (text) reply — no conversation handoff, no shared history.

The aux model is resolved from ‘auxiliary.vision` in config. When the primary already supports vision (per Configuration#model_supports_vision?) AND no aux is configured, Registry hides this tool — there’s no useful delegation to perform.

Instance Attribute Summary

Attributes inherited from Base

#cancel_token, #read_tracker, #stream_chunk

Instance Method Summary collapse

Methods inherited from Base

#cancellation_requested?, #config_key, #emit_chunk, #risky?, #to_tool_definition, workspace_root, workspace_roots

Instance Method Details

#call(arguments) ⇒ Object



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# File 'lib/rubino/tools/vision_tool.rb', line 50

def call(arguments)
  path     = (arguments["file_path"] || arguments[:file_path]).to_s
  question = (arguments["question"]  || arguments[:question] ||
              "Describe what you see in markdown.").to_s

  return "Error: file_path is required" if path.empty?

  expanded = File.expand_path(path)
  return "Error: file not found: #{path}" unless File.exist?(expanded)
  return "Error: not a regular file: #{path}" unless File.file?(expanded)

  ext = File.extname(expanded).downcase
  unless LLM::ContentBuilder::SUPPORTED_IMAGE_TYPES.include?(ext)
    return "Error: unsupported image extension '#{ext}'. " \
           "Supported: #{LLM::ContentBuilder::SUPPORTED_IMAGE_TYPES.join(", ")}"
  end

  # Pass the image through ruby_llm's native `with:` slot (image_paths),
  # NOT as an OpenAI-style content array. ruby_llm's `ask` stringifies an
  # array content, so the base64 bytes would reach the model as TEXT and
  # it hallucinates (prod sessions 38/41: M3 saw the image perfectly when
  # called directly, but got a text blob through this path). image_paths
  # attaches the file as a real multimodal part — same route the primary
  # uses for native vision.
  response = LLM::AuxiliaryClient.new.call(
    task: :vision,
    messages: [{ role: "user", content: question }],
    image_paths: [expanded]
  )
  response.content.to_s
rescue StandardError => e
  "Error calling vision model: #{e.class}: #{e.message}"
end

#descriptionObject



22
23
24
25
26
27
# File 'lib/rubino/tools/vision_tool.rb', line 22

def description
  "Ask a multimodal model to describe or interpret an image. " \
    "Use when you need to understand visual content (charts, screenshots, " \
    "diagrams, photos). Provide an optional focused question to direct the " \
    "analysis; default is a full markdown description."
end

#input_schemaObject



29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# File 'lib/rubino/tools/vision_tool.rb', line 29

def input_schema
  {
    type: "object",
    properties: {
      file_path: {
        type: "string",
        description: "Absolute path to an image file (.png .jpg .jpeg .webp .gif .bmp)"
      },
      question: {
        type: "string",
        description: "Optional focused question. Default: 'Describe what you see in markdown.'"
      }
    },
    required: %w[file_path]
  }
end

#nameObject



18
19
20
# File 'lib/rubino/tools/vision_tool.rb', line 18

def name
  "vision"
end

#risk_levelObject



46
47
48
# File 'lib/rubino/tools/vision_tool.rb', line 46

def risk_level
  :low
end