Module: Pikuri::Tool::Fetch
- Defined in:
- lib/pikuri/tool/fetch.rb
Overview
Truncation policy and Tool spec for the fetch tool. The HTTP work lives in Scraper::Simple.fetch; this module is a thin wrapper that accepts only textual content-types, applies a character cap so the LLM doesn’t drown in long-form bodies, and exposes the result to the agent loop in OpenAI tool-call shape.
Sister of WebScrape, but without HTML→Markdown or PDF→text extraction: bodies are returned verbatim. Useful for raw textual data — JSON APIs, CSV files, robots.txt, sitemaps, source files —where any rendering pass would corrupt the payload.
Constant Summary collapse
- DEFAULT_MAX_CHARS =
Returns default character cap on the body returned by fetch. Smaller than WebScrape::DEFAULT_MAX_CHARS because fetch’s content profile is bimodal — most JSON/XML/CSV responses are tiny, and the long-tail (large data dumps) is better re-requested deliberately than padded into every default.
5_000- MAX_MAX_CHARS =
Returns hard ceiling on the
max_charsargument to fetch. Matches WebScrape::MAX_MAX_CHARS. 100_000- TEXTUAL_APPLICATION_TYPES =
Application content-types that are textual in practice and so safe to return verbatim to the LLM, despite their
application/prefix making them fail thetext/*check. Anything outsidetext/*and this allowlist is refused. %w[ application/json application/xml application/javascript application/xhtml+xml application/rss+xml application/atom+xml ].freeze
- CACHE =
On-disk cache used by fetch to memoize downloads. Defined as a method so specs can swap it for an isolated cache or UrlCache::NULL without touching the shared instance. Lives in its own subdir under UrlCache::ROOT_DIR so a
fetchon a URL and aweb_scrapeon the same URL cannot collide on the same cache file (one returns the raw body, the other returns extracted Markdown). UrlCache.new(ttl: UrlCache::DEFAULT_TTL, dir: "#{UrlCache::ROOT_DIR}/fetch")
Class Method Summary collapse
- .cache ⇒ Object
-
.download(url) ⇒ String
GET
urland verify the response’s content-type is textual. -
.fetch(url, max_chars: DEFAULT_MAX_CHARS) ⇒ String
Download
urlvia Scraper::Simple.fetch and return the response body verbatim, provided the content-type is one we deem textual (anytext/*, plus the formats listed in TEXTUAL_APPLICATION_TYPES). -
.textual?(content_type) ⇒ Boolean
True when the content-type is
text/*or one of TEXTUAL_APPLICATION_TYPES. -
.truncate(body, max_chars) ⇒ String
Cut
bodyto at mostmax_charscharacters, appending a marker describing the original length when truncation actually happens.
Class Method Details
.cache ⇒ Object
51 52 53 |
# File 'lib/pikuri/tool/fetch.rb', line 51 def self.cache CACHE end |
.download(url) ⇒ String
GET url and verify the response’s content-type is textual. Caller is responsible for caching and truncation; this method always hits the network.
98 99 100 101 102 103 104 105 |
# File 'lib/pikuri/tool/fetch.rb', line 98 def self.download(url) fetched = Scraper::Simple.fetch(url) return fetched.body if textual?(fetched.content_type) raise Scraper::FetchError, "refused to fetch #{url}: content-type #{fetched.content_type.inspect} " \ 'is not textual (use web_scrape for PDFs or rendered pages)' end |
.fetch(url, max_chars: DEFAULT_MAX_CHARS) ⇒ String
Download url via Scraper::Simple.fetch and return the response body verbatim, provided the content-type is one we deem textual (any text/*, plus the formats listed in TEXTUAL_APPLICATION_TYPES). Anything else — PDFs, images, other binaries — produces an “Error: …” string in the calculator- style convention so the agent loop feeds the failure back to the model as the next observation.
The body is cached on disk via cache, keyed by URL, so repeat fetches within the cache TTL skip the network. max_chars is not part of the cache key — different values for the same URL share one entry, and truncation runs after the cache lookup. The cache is only populated on success: Scraper::FetchError (HTTP non-2xx, network failure, redirect-loop exhaustion, refused content-type) is caught outside the cache.fetch block, so failure strings are never persisted and a retry on the next call hits the network again. Other exceptions (parser bugs in our own code) bubble up unchanged.
81 82 83 84 85 86 87 |
# File 'lib/pikuri/tool/fetch.rb', line 81 def self.fetch(url, max_chars: DEFAULT_MAX_CHARS) max_chars = max_chars.clamp(1, MAX_MAX_CHARS) body = cache.fetch(url) { download(url) } truncate(body, max_chars) rescue Scraper::FetchError => e "Error: #{e.}" end |
.textual?(content_type) ⇒ Boolean
Returns true when the content-type is text/* or one of TEXTUAL_APPLICATION_TYPES.
111 112 113 114 |
# File 'lib/pikuri/tool/fetch.rb', line 111 def self.textual?(content_type) content_type.start_with?('text/') || TEXTUAL_APPLICATION_TYPES.include?(content_type) end |
.truncate(body, max_chars) ⇒ String
Cut body to at most max_chars characters, appending a marker describing the original length when truncation actually happens. Returns body unchanged if it already fits. Same shape as WebScrape.truncate so the LLM sees a consistent truncation marker across both tools.
125 126 127 128 129 130 131 |
# File 'lib/pikuri/tool/fetch.rb', line 125 def self.truncate(body, max_chars) return body if body.length <= max_chars "#{body[0, max_chars]}\n\n" \ "... [truncated at #{max_chars} of #{body.length} chars; " \ 'call again with a larger `max_chars` to see more]' end |