Module: Pikuri::Tool::Scraper

Defined in:
lib/pikuri/tool/scraper.rb

Overview

HTTP side of the web tools (WEB_SCRAPE and FETCH): GET the URL with a real-browser User-Agent, follow redirects, and hand the response body to Extractor.extract with the response’s Content-Type as the hint. HTML/XHTML render via Extractor::HTML, any other text/* type passes through verbatim, and plug-in extractors extend the set (with pikuri-pdf registered, application/pdf extracts — by header or by %PDF- magic, so a PDF served under a lying header still works); the remaining types raise FetchError so the LLM observes the failure instead of receiving an empty rendering.

Split into a thin HTTP fetch (Scraper.fetch) and the extraction wrapper (Scraper.visit) so tests can drive each piece in isolation and Fetch can reuse the HTTP half without the extraction pass. Nothing here knows about the LLM; the tools that wrap this module own caching and truncation and turn rendered Markdown (or FetchError) into the next observation.

Defined Under Namespace

Classes: FetchError, Fetched

Constant Summary collapse

USER_AGENT =

Returns User-Agent sent with each request; many sites reject requests with no UA or an obvious bot UA.

Returns:

  • (String)

    User-Agent sent with each request; many sites reject requests with no UA or an obvious bot UA

'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ' \
'(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
ACCEPT =

Returns Accept header sent with each request, so servers that content-negotiate hand back something we can use: rendered HTML first, application/pdf for hosts with a PDF extractor registered, then any text/* for the verbatim pass-through arm.

Returns:

  • (String)

    Accept header sent with each request, so servers that content-negotiate hand back something we can use: rendered HTML first, application/pdf for hosts with a PDF extractor registered, then any text/* for the verbatim pass-through arm.

'text/html,application/xhtml+xml,application/pdf,text/*;q=0.8'
MAX_REDIRECTS =

Returns maximum number of HTTP redirects to follow before giving up.

Returns:

  • (Integer)

    maximum number of HTTP redirects to follow before giving up

5
OPEN_TIMEOUT =

Returns connect timeout in seconds for the underlying Faraday request.

Returns:

  • (Integer)

    connect timeout in seconds for the underlying Faraday request

10
READ_TIMEOUT =

Returns read timeout in seconds for the underlying Faraday request.

Returns:

  • (Integer)

    read timeout in seconds for the underlying Faraday request

20
ERROR_BODY_EXCERPT =

Returns maximum number of characters of an error response body to include in a FetchError message. The body is often a multi-kilobyte HTML challenge page (Cloudflare, WAF interstitial, etc.); a short excerpt tells the LLM what kind of page came back without flooding the next observation.

Returns:

  • (Integer)

    maximum number of characters of an error response body to include in a FetchError message. The body is often a multi-kilobyte HTML challenge page (Cloudflare, WAF interstitial, etc.); a short excerpt tells the LLM what kind of page came back without flooding the next observation.

200

Class Method Summary collapse

Class Method Details

.extract(fetched) ⇒ String

Render a Fetched response as Markdown via Extractor.extract, re-raising both extraction failure modes as FetchError — the single exception type the web tools rescue. The content-type is passed verbatim (including the “” of a missing header, which matches no text arm — a body without transport metadata is refused, not sniffed; only a strong magic sniff like pikuri-pdf’s %PDF- overrides a wrong or missing header, because such a sniff never misfires on text).

Parameters:

Returns:

  • (String)

    Markdown representation produced by the matched extractor

Raises:

  • (FetchError)

    when no extractor matches the response’s content-type, or when extraction fails



108
109
110
111
112
113
114
# File 'lib/pikuri/tool/scraper.rb', line 108

def self.extract(fetched)
  Pikuri::Extractor.extract(StringIO.new(fetched.body), content_type: fetched.content_type)
rescue Pikuri::Extractor::Unsupported
  raise FetchError, "unsupported content-type #{fetched.content_type.inspect} for #{fetched.url}"
rescue Pikuri::Extractor::Error => e
  raise FetchError, e.message
end

.fetch(url, limit: MAX_REDIRECTS) ⇒ Fetched

Download the body of url, manually following up to MAX_REDIRECTS redirects. Faraday is configured with no middleware so behavior here mirrors the rest of the codebase (see Tool::Search::DuckDuckGo.search).

All recoverable failures — HTTP 4xx/5xx, Faraday::Error network blips, exhausted redirect budget, 3xx without a Location —surface as FetchError so the caller has a single exception type to rescue. Error bodies are trimmed to ERROR_BODY_EXCERPT characters with whitespace collapsed, so a Cloudflare-challenge response doesn’t dump kilobytes of inline HTML into the next LLM observation.

Parameters:

  • url (String)

    absolute HTTP(S) URL to fetch

  • limit (Integer) (defaults to: MAX_REDIRECTS)

    redirects remaining; recurses with limit - 1 on each 3xx

Returns:

  • (Fetched)

    body, normalized content-type, and final URL after redirects

Raises:

  • (FetchError)

    on non-2xx/3xx responses, network errors, redirect-loop exhaustion, or 3xx without a Location header



136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# File 'lib/pikuri/tool/scraper.rb', line 136

def self.fetch(url, limit: MAX_REDIRECTS)
  raise FetchError, "too many redirects fetching #{url}" if limit.zero?

  response = begin
    Faraday.new(request: { open_timeout: OPEN_TIMEOUT, timeout: READ_TIMEOUT }).get(url) do |req|
      req.headers['User-Agent'] = USER_AGENT
      req.headers['Accept']     = ACCEPT
    end
  rescue Faraday::Error => e
    raise FetchError, "#{e.class.name.split('::').last} fetching #{url}: #{e.message}"
  end

  case response.status
  when 200..299
    Fetched.new(body: response.body, content_type: normalize_content_type(response.headers['content-type']), url: url)
  when 300..399
    location = response.headers['location']
    raise FetchError, "HTTP #{response.status} from #{url} with no Location header" if location.nil? || location.empty?

    fetch(URI.join(url, location).to_s, limit: limit - 1)
  else
    raise FetchError, "HTTP #{response.status} fetching #{url}: #{excerpt(response.body)}"
  end
end

.visit(url) ⇒ String

Fetch url and render its main content as Markdown.

No caching here — every call hits the network. Callers that want to memoize results should wrap this method themselves (see WebScrape.visit, which does exactly that).

The extracted output is String#strip‘d so the LLM never sees a body that opens or closes with blank lines — common with extracted PDFs’ page-feed whitespace and with text bodies that carry a trailing newline. Interior whitespace is preserved because Markdown paragraph breaks and source-code indentation are load-bearing.

Parameters:

  • url (String)

    absolute HTTP(S) URL of the page to download

Returns:

  • (String)

    full Markdown representation of the page with leading/trailing whitespace trimmed, uncapped otherwise —caller is responsible for any size limiting before feeding the result back to the LLM

Raises:

  • (FetchError)

    on HTTP non-2xx, network failure, redirect loop, a 3xx without a Location header, a response no extractor recognizes, or an extraction failure (malformed PDF, …)



90
91
92
# File 'lib/pikuri/tool/scraper.rb', line 90

def self.visit(url)
  extract(fetch(url)).strip
end