Module: Pikuri::Tool::Fetch

Defined in:
lib/pikuri/tool/fetch.rb

Overview

Truncation policy and Tool spec for the fetch tool. The HTTP work lives in Scraper::Simple.fetch; this module is a thin wrapper that accepts only textual content-types, applies a character cap so the LLM doesn’t drown in long-form bodies, and exposes the result to the agent loop in OpenAI tool-call shape.

Sister of WebScrape, but without HTML→Markdown or PDF→text extraction: bodies are returned verbatim. Useful for raw textual data — JSON APIs, CSV files, robots.txt, sitemaps, source files —where any rendering pass would corrupt the payload.

Constant Summary collapse

DEFAULT_MAX_CHARS =

Returns default character cap on the body returned by fetch. Smaller than WebScrape::DEFAULT_MAX_CHARS because fetch’s content profile is bimodal — most JSON/XML/CSV responses are tiny, and the long-tail (large data dumps) is better re-requested deliberately than padded into every default.

Returns:

  • (Integer)

    default character cap on the body returned by fetch. Smaller than WebScrape::DEFAULT_MAX_CHARS because fetch’s content profile is bimodal — most JSON/XML/CSV responses are tiny, and the long-tail (large data dumps) is better re-requested deliberately than padded into every default.

5_000
MAX_MAX_CHARS =

Returns hard ceiling on the max_chars argument to fetch. Matches WebScrape::MAX_MAX_CHARS.

Returns:

100_000
TEXTUAL_APPLICATION_TYPES =

Application content-types that are textual in practice and so safe to return verbatim to the LLM, despite their application/ prefix making them fail the text/* check. Anything outside text/* and this allowlist is refused.

Returns:

  • (Array<String>)
%w[
  application/json
  application/xml
  application/javascript
  application/xhtml+xml
  application/rss+xml
  application/atom+xml
].freeze
CACHE =

On-disk cache used by fetch to memoize downloads. Defined as a method so specs can swap it for an isolated cache or UrlCache::NULL without touching the shared instance. Lives in its own subdir under UrlCache::ROOT_DIR so a fetch on a URL and a web_scrape on the same URL cannot collide on the same cache file (one returns the raw body, the other returns extracted Markdown).

Returns:

UrlCache.new(ttl: UrlCache::DEFAULT_TTL, dir: "#{UrlCache::ROOT_DIR}/fetch")

Class Method Summary collapse

Class Method Details

.cacheObject



51
52
53
# File 'lib/pikuri/tool/fetch.rb', line 51

def self.cache
  CACHE
end

.download(url) ⇒ String

GET url and verify the response’s content-type is textual. Caller is responsible for caching and truncation; this method always hits the network.

Parameters:

  • url (String)

Returns:

  • (String)

    response body

Raises:

  • (Scraper::FetchError)

    on HTTP non-2xx, network failure, redirect-loop exhaustion, missing Location on a 3xx, or a non-textual content-type



98
99
100
101
102
103
104
105
# File 'lib/pikuri/tool/fetch.rb', line 98

def self.download(url)
  fetched = Scraper::Simple.fetch(url)
  return fetched.body if textual?(fetched.content_type)

  raise Scraper::FetchError,
        "refused to fetch #{url}: content-type #{fetched.content_type.inspect} " \
        'is not textual (use web_scrape for PDFs or rendered pages)'
end

.fetch(url, max_chars: DEFAULT_MAX_CHARS) ⇒ String

Download url via Scraper::Simple.fetch and return the response body verbatim, provided the content-type is one we deem textual (any text/*, plus the formats listed in TEXTUAL_APPLICATION_TYPES). Anything else — PDFs, images, other binaries — produces an “Error: …” string in the calculator- style convention so the agent loop feeds the failure back to the model as the next observation.

The body is cached on disk via cache, keyed by URL, so repeat fetches within the cache TTL skip the network. max_chars is not part of the cache key — different values for the same URL share one entry, and truncation runs after the cache lookup. The cache is only populated on success: Scraper::FetchError (HTTP non-2xx, network failure, redirect-loop exhaustion, refused content-type) is caught outside the cache.fetch block, so failure strings are never persisted and a retry on the next call hits the network again. Other exceptions (parser bugs in our own code) bubble up unchanged.

Parameters:

  • url (String)

    absolute HTTP(S) URL to download

  • max_chars (Integer) (defaults to: DEFAULT_MAX_CHARS)

    character cap on the returned body. Clamped to [1, {MAX_MAX_CHARS}]; defaults to DEFAULT_MAX_CHARS. When the body exceeds the cap, output is cut and a marker noting the original length is appended.

Returns:

  • (String)

    response body, possibly truncated, or “Error: …” on a recoverable failure



81
82
83
84
85
86
87
# File 'lib/pikuri/tool/fetch.rb', line 81

def self.fetch(url, max_chars: DEFAULT_MAX_CHARS)
  max_chars = max_chars.clamp(1, MAX_MAX_CHARS)
  body = cache.fetch(url) { download(url) }
  truncate(body, max_chars)
rescue Scraper::FetchError => e
  "Error: #{e.message}"
end

.textual?(content_type) ⇒ Boolean

Returns true when the content-type is text/* or one of TEXTUAL_APPLICATION_TYPES.

Parameters:

  • content_type (String)

    normalized content-type (no charset parameter, lowercased) as produced by Scraper::Simple.fetch

Returns:



111
112
113
114
# File 'lib/pikuri/tool/fetch.rb', line 111

def self.textual?(content_type)
  content_type.start_with?('text/') ||
    TEXTUAL_APPLICATION_TYPES.include?(content_type)
end

.truncate(body, max_chars) ⇒ String

Cut body to at most max_chars characters, appending a marker describing the original length when truncation actually happens. Returns body unchanged if it already fits. Same shape as WebScrape.truncate so the LLM sees a consistent truncation marker across both tools.

Parameters:

  • body (String)

    full response body

  • max_chars (Integer)

    character cap; assumed already clamped

Returns:

  • (String)


125
126
127
128
129
130
131
# File 'lib/pikuri/tool/fetch.rb', line 125

def self.truncate(body, max_chars)
  return body if body.length <= max_chars

  "#{body[0, max_chars]}\n\n" \
    "... [truncated at #{max_chars} of #{body.length} chars; " \
    'call again with a larger `max_chars` to see more]'
end