Module: Pikuri::Tool::Scraper
- Defined in:
- lib/pikuri/tool/scraper.rb
Overview
HTTP side of the web tools (WEB_SCRAPE and FETCH): GET the URL with a real-browser User-Agent, follow redirects, and hand the response body to Extractor.extract with the response’s Content-Type as the hint. HTML/XHTML render via Extractor::HTML, any other text/* type passes through verbatim, and plug-in extractors extend the set (with pikuri-pdf registered, application/pdf extracts — by header or by %PDF- magic, so a PDF served under a lying header still works); the remaining types raise FetchError so the LLM observes the failure instead of receiving an empty rendering.
Split into a thin HTTP fetch (Scraper.fetch) and the extraction wrapper (Scraper.visit) so tests can drive each piece in isolation and Fetch can reuse the HTTP half without the extraction pass. Nothing here knows about the LLM; the tools that wrap this module own caching and truncation and turn rendered Markdown (or FetchError) into the next observation.
Defined Under Namespace
Classes: FetchError, Fetched
Constant Summary collapse
- USER_AGENT =
Returns User-Agent sent with each request; many sites reject requests with no UA or an obvious bot UA.
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ' \ '(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
- ACCEPT =
Returns
Acceptheader sent with each request, so servers that content-negotiate hand back something we can use: rendered HTML first,application/pdffor hosts with a PDF extractor registered, then anytext/*for the verbatim pass-through arm. 'text/html,application/xhtml+xml,application/pdf,text/*;q=0.8'- MAX_REDIRECTS =
Returns maximum number of HTTP redirects to follow before giving up.
5- OPEN_TIMEOUT =
Returns connect timeout in seconds for the underlying Faraday request.
10- READ_TIMEOUT =
Returns read timeout in seconds for the underlying Faraday request.
20- ERROR_BODY_EXCERPT =
Returns maximum number of characters of an error response body to include in a FetchError message. The body is often a multi-kilobyte HTML challenge page (Cloudflare, WAF interstitial, etc.); a short excerpt tells the LLM what kind of page came back without flooding the next observation.
200
Class Method Summary collapse
-
.extract(fetched) ⇒ String
Render a Fetched response as Markdown via Extractor.extract, re-raising both extraction failure modes as FetchError — the single exception type the web tools rescue.
-
.fetch(url, limit: MAX_REDIRECTS) ⇒ Fetched
Download the body of
url, manually following up to MAX_REDIRECTS redirects. -
.visit(url) ⇒ String
Fetch
urland render its main content as Markdown.
Class Method Details
.extract(fetched) ⇒ String
Render a Fetched response as Markdown via Extractor.extract, re-raising both extraction failure modes as FetchError — the single exception type the web tools rescue. The content-type is passed verbatim (including the “” of a missing header, which matches no text arm — a body without transport metadata is refused, not sniffed; only a strong magic sniff like pikuri-pdf’s %PDF- overrides a wrong or missing header, because such a sniff never misfires on text).
108 109 110 111 112 113 114 |
# File 'lib/pikuri/tool/scraper.rb', line 108 def self.extract(fetched) Pikuri::Extractor.extract(StringIO.new(fetched.body), content_type: fetched.content_type) rescue Pikuri::Extractor::Unsupported raise FetchError, "unsupported content-type #{fetched.content_type.inspect} for #{fetched.url}" rescue Pikuri::Extractor::Error => e raise FetchError, e. end |
.fetch(url, limit: MAX_REDIRECTS) ⇒ Fetched
Download the body of url, manually following up to MAX_REDIRECTS redirects. Faraday is configured with no middleware so behavior here mirrors the rest of the codebase (see Tool::Search::DuckDuckGo.search).
All recoverable failures — HTTP 4xx/5xx, Faraday::Error network blips, exhausted redirect budget, 3xx without a Location —surface as FetchError so the caller has a single exception type to rescue. Error bodies are trimmed to ERROR_BODY_EXCERPT characters with whitespace collapsed, so a Cloudflare-challenge response doesn’t dump kilobytes of inline HTML into the next LLM observation.
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
# File 'lib/pikuri/tool/scraper.rb', line 136 def self.fetch(url, limit: MAX_REDIRECTS) raise FetchError, "too many redirects fetching #{url}" if limit.zero? response = begin Faraday.new(request: { open_timeout: OPEN_TIMEOUT, timeout: READ_TIMEOUT }).get(url) do |req| req.headers['User-Agent'] = USER_AGENT req.headers['Accept'] = ACCEPT end rescue Faraday::Error => e raise FetchError, "#{e.class.name.split('::').last} fetching #{url}: #{e.}" end case response.status when 200..299 Fetched.new(body: response.body, content_type: normalize_content_type(response.headers['content-type']), url: url) when 300..399 location = response.headers['location'] raise FetchError, "HTTP #{response.status} from #{url} with no Location header" if location.nil? || location.empty? fetch(URI.join(url, location).to_s, limit: limit - 1) else raise FetchError, "HTTP #{response.status} fetching #{url}: #{excerpt(response.body)}" end end |
.visit(url) ⇒ String
Fetch url and render its main content as Markdown.
No caching here — every call hits the network. Callers that want to memoize results should wrap this method themselves (see WebScrape.visit, which does exactly that).
The extracted output is String#strip‘d so the LLM never sees a body that opens or closes with blank lines — common with extracted PDFs’ page-feed whitespace and with text bodies that carry a trailing newline. Interior whitespace is preserved because Markdown paragraph breaks and source-code indentation are load-bearing.
90 91 92 |
# File 'lib/pikuri/tool/scraper.rb', line 90 def self.visit(url) extract(fetch(url)).strip end |