Module: AgentSandbox::BrowserTools::VisionSupport

Defined in:
lib/agent_sandbox/browser_tools.rb

Overview

Mixin: download bytes out of the sandbox into a host tempfile, run a multimodal sub-call on the image, and clean up the tempfile right after. Keeps no global state — each call is self-contained.

Constant Summary collapse

DEFAULT_FOCUS_PROMPT =
lambda { |focus|
  "Read this image. Focus on: #{focus}. Return structured plain " \
    "text. Quote exact numbers and labels as they appear. If " \
    "something isn't visible, say so instead of guessing."
}
DEFAULT_GENERAL_PROMPT =
"Describe this image. List every product, price, heading, and " \
"notable text you see. Be exact with numbers and labels."

Class Method Summary collapse

Class Method Details

.read_image_bytes(bytes, extension:, focus:, vision_model:) ⇒ Object



96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/agent_sandbox/browser_tools.rb', line 96

def self.read_image_bytes(bytes, extension:, focus:, vision_model:)
  tmp = Tempfile.new(["agent-vision", ".#{extension}"])
  tmp.binmode
  tmp.write(bytes)
  tmp.close
  begin
    prompt = focus && !focus.empty? ? DEFAULT_FOCUS_PROMPT.call(focus) : DEFAULT_GENERAL_PROMPT
    chat = RubyLLM.chat(model: vision_model)
    reply = chat.ask(prompt, with: tmp.path)
    reply.content
  ensure
    tmp.close! rescue nil
  end
end