Module: Pikuri::Tool::WebScrape

Defined in:: lib/pikuri/tool/web_scrape.rb

Overview

Truncation policy and Tool spec for the web_scrape tool. The actual scraping lives in Scraper::Simple; this module is a thin wrapper that picks the scraper, applies a character cap so the LLM doesn’t drown in long-form content, and exposes the result to the agent loop in OpenAI tool-call shape.

Constant Summary collapse

DEFAULT_MAX_CHARS = Returns default character cap on the Markdown returned by visit. Sized to cover most post-readability article bodies in full on the first call, so the LLM doesn’t have to re-request with a larger cap and pollute its context with the same prefix twice. ~5K tokens at the typical char/token ratio — light even for small local models. The genuinely long pages (long Wikipedia entries, multi-section docs) still get cut, and the truncation marker invites a deliberate larger visit call when needed. Returns: (Integer) — default character cap on the Markdown returned by visit. Sized to cover most post-readability article bodies in full on the first call, so the LLM doesn’t have to re-request with a larger cap and pollute its context with the same prefix twice. ~5K tokens at the typical char/token ratio — light even for small local models. The genuinely long pages (long Wikipedia entries, multi-section docs) still get cut, and the truncation marker invites a deliberate larger visit call when needed.

20_000

MAX_MAX_CHARS = Returns hard ceiling on the max_chars argument to visit. Requests above this are clamped silently so the LLM cannot dump an arbitrarily large page into the conversation. Returns: (Integer) — hard ceiling on the max_chars argument to visit. Requests above this are clamped silently so the LLM cannot dump an arbitrarily large page into the conversation.

100_000

CACHE = On-disk cache used by visit to memoize fetched pages. Defined as a method so specs can swap it for an isolated cache or UrlCache::NULL without touching the shared instance. Returns: (UrlCache, #fetch)

UrlCache.new(ttl: UrlCache::DEFAULT_TTL, dir: "#{UrlCache::ROOT_DIR}/web_scrape")

Class Method Summary collapse

.cache ⇒ Object
.truncate(markdown, max_chars) ⇒ String

Cut markdown to at most max_chars characters, appending a marker describing the original length when truncation actually happens.
.visit(url, max_chars: DEFAULT_MAX_CHARS) ⇒ String

Fetch url via Scraper::Simple and truncate the rendered Markdown to max_chars characters.

Class Method Details

.cache ⇒ `Object`



32
33
34

# File 'lib/pikuri/tool/web_scrape.rb', line 32

def self.cache
  CACHE
end

.truncate(markdown, max_chars) ⇒ `String`

Cut markdown to at most max_chars characters, appending a marker describing the original length when truncation actually happens. Returns markdown unchanged if it already fits.

Parameters:

markdown (String) —

full Markdown text
max_chars (Integer) —

character cap; assumed already clamped

Returns:

(String)

# File 'lib/pikuri/tool/web_scrape.rb', line 77

def self.truncate(markdown, max_chars)
  return markdown if markdown.length <= max_chars

  "#{markdown[0, max_chars]}\n\n" \
    "... [truncated at #{max_chars} of #{markdown.length} chars; " \
    'call again with a larger `max_chars` to see more]'
end

.visit(url, max_chars: DEFAULT_MAX_CHARS) ⇒ `String`

Fetch url via Scraper::Simple and truncate the rendered Markdown to max_chars characters.

The full extracted Markdown is cached on disk via cache, keyed by URL, so repeat visits within the cache TTL skip the network and the extraction pass entirely. max_chars is not part of the cache key — different values for the same URL share one entry, and truncation runs after the cache lookup.

Scraper::FetchError (HTTP non-2xx, network failure, redirect-loop, missing Location header) is caught and returned as “Error: …” in the calculator-style convention so the agent loop feeds the failure back to the model as the next observation instead of crashing — the LLM can then try a different URL or search again. The rescue lives outside cache‘s fetch block, so failure strings are never persisted: a retry on the next call hits the network again. Other exceptions (parser bugs in our own code) bubble up unchanged.