Class: AgentSandbox::BrowserTools::Screenshot

Inherits:
Base
  • Object
show all
Defined in:
lib/agent_sandbox/browser_tools.rb

Overview

Screenshots the current viewport and “reads” it by running a sub-request against a multimodal model. Returns the description as text so the main tool-loop isn’t constrained by OpenAI’s rule that only role:user messages may contain images.

Instance Method Summary collapse

Methods inherited from Base

#run_ab

Constructor Details

#initialize(sandbox, vision_model:) ⇒ Screenshot

Returns a new instance of Screenshot.



224
225
226
227
# File 'lib/agent_sandbox/browser_tools.rb', line 224

def initialize(sandbox, vision_model:)
  @vision_model = vision_model
  super(sandbox)
end

Instance Method Details

#execute(focus: nil) ⇒ Object



229
230
231
232
233
234
235
236
237
238
239
240
241
# File 'lib/agent_sandbox/browser_tools.rb', line 229

def execute(focus: nil)
  sandbox_path = "/tmp/agent-shot-#{Time.now.to_f.to_s.tr('.', '')}.png"
  data = run_ab(["screenshot", sandbox_path, "--json"])
  return data if data.is_a?(Hash) && data[:error]

  bytes = @sandbox.read_file(sandbox_path)
  @sandbox.exec("rm -f #{Shellwords.escape(sandbox_path)}")

  description = VisionSupport.read_image_bytes(
    bytes, extension: "png", focus: focus, vision_model: @vision_model
  )
  { description: description, bytes: bytes.bytesize, vision_model: @vision_model }
end