Module: AgentSandbox::BrowserTools::VisionSupport
- Defined in:
- lib/agent_sandbox/browser_tools.rb
Overview
Mixin: download bytes out of the sandbox into a host tempfile, run a multimodal sub-call on the image, and clean up the tempfile right after. Keeps no global state — each call is self-contained.
Constant Summary collapse
- DEFAULT_FOCUS_PROMPT =
lambda { |focus| "Read this image. Focus on: #{focus}. Return structured plain " \ "text. Quote exact numbers and labels as they appear. If " \ "something isn't visible, say so instead of guessing." }
- DEFAULT_GENERAL_PROMPT =
"Describe this image. List every product, price, heading, and " \ "notable text you see. Be exact with numbers and labels."
Class Method Summary collapse
Class Method Details
.read_image_bytes(bytes, extension:, focus:, vision_model:) ⇒ Object
96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
# File 'lib/agent_sandbox/browser_tools.rb', line 96 def self.read_image_bytes(bytes, extension:, focus:, vision_model:) tmp = Tempfile.new(["agent-vision", ".#{extension}"]) tmp.binmode tmp.write(bytes) tmp.close begin prompt = focus && !focus.empty? ? DEFAULT_FOCUS_PROMPT.call(focus) : DEFAULT_GENERAL_PROMPT chat = RubyLLM.chat(model: vision_model) reply = chat.ask(prompt, with: tmp.path) reply.content ensure tmp.close! rescue nil end end |