Module: Pikuri::Tool::Scraper::Simple
- Defined in:
- lib/pikuri/tool/scraper/simple.rb
Overview
Plain HTTP scraper: GET the URL with a real-browser User-Agent, follow redirects, and dispatch the response body to the parser matching its Content-Type. HTML and XHTML route to HTML.extract; application/pdf routes to PDF.extract; any other text/* type (plain text, Markdown, source files, …) is passed through verbatim since the LLM can already read it; the remaining types raise FetchError so the LLM observes the failure instead of receiving an empty rendering.
Split into a thin HTTP fetch (Simple.fetch) and a content-type dispatcher (Simple.visit) so tests can drive each piece in isolation. “Simple” because everything happens in one Faraday GET — no headless browser, no JS execution.
Defined Under Namespace
Classes: Fetched
Constant Summary collapse
- USER_AGENT =
Returns User-Agent sent with each request; many sites reject requests with no UA or an obvious bot UA.
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ' \ '(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
- ACCEPT =
Returns
Acceptheader sent with each request. Lists every content-type the dispatcher in visit knows how to render, so servers that content-negotiate hand back something we can use. The trailing text/*;q=0.8 covers the verbatim pass-through arm (plain text, Markdown, source files, …) at a lower preference than rendered HTML/PDF. 'text/html,application/xhtml+xml,application/pdf,text/*;q=0.8'- MAX_REDIRECTS =
Returns maximum number of HTTP redirects to follow before giving up.
5- OPEN_TIMEOUT =
Returns connect timeout in seconds for the underlying Faraday request.
10- READ_TIMEOUT =
Returns read timeout in seconds for the underlying Faraday request.
20- ERROR_BODY_EXCERPT =
Returns maximum number of characters of an error response body to include in a FetchError message. The body is often a multi-kilobyte HTML challenge page (Cloudflare, WAF interstitial, etc.); a short excerpt tells the LLM what kind of page came back without flooding the next observation.
200
Class Method Summary collapse
-
.dispatch(fetched) ⇒ String
Route a Fetched response to the parser that matches its content-type.
-
.fetch(url, limit: MAX_REDIRECTS) ⇒ Fetched
Download the body of
url, manually following up to MAX_REDIRECTS redirects. -
.visit(url) ⇒ String
Fetch
urland render its main content as Markdown.
Class Method Details
.dispatch(fetched) ⇒ String
Route a Fetched response to the parser that matches its content-type. Unknown types raise FetchError so the LLM gets a legible observation instead of an empty string.
138 139 140 141 142 143 144 145 146 147 148 149 |
# File 'lib/pikuri/tool/scraper/simple.rb', line 138 def self.dispatch(fetched) case fetched.content_type when 'text/html', 'application/xhtml+xml' HTML.extract(fetched.body) when 'application/pdf' PDF.extract(fetched.body) when %r{\Atext/} fetched.body else raise FetchError, "unsupported content-type #{fetched.content_type.inspect} for #{fetched.url}" end end |
.fetch(url, limit: MAX_REDIRECTS) ⇒ Fetched
Download the body of url, manually following up to MAX_REDIRECTS redirects. Faraday is configured with no middleware so behavior here mirrors the rest of the codebase (see Tool::Search::DuckDuckGo.search).
All recoverable failures — HTTP 4xx/5xx, Faraday::Error network blips, exhausted redirect budget, 3xx without a Location —surface as FetchError so the caller has a single exception type to rescue. Error bodies are trimmed to ERROR_BODY_EXCERPT characters with whitespace collapsed, so a Cloudflare-challenge response doesn’t dump kilobytes of inline HTML into the next LLM observation.
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
# File 'lib/pikuri/tool/scraper/simple.rb', line 104 def self.fetch(url, limit: MAX_REDIRECTS) raise FetchError, "too many redirects fetching #{url}" if limit.zero? response = begin Faraday.new(request: { open_timeout: OPEN_TIMEOUT, timeout: READ_TIMEOUT }).get(url) do |req| req.headers['User-Agent'] = USER_AGENT req.headers['Accept'] = ACCEPT end rescue Faraday::Error => e raise FetchError, "#{e.class.name.split('::').last} fetching #{url}: #{e.}" end case response.status when 200..299 Fetched.new(body: response.body, content_type: normalize_content_type(response.headers['content-type']), url: url) when 300..399 location = response.headers['location'] raise FetchError, "HTTP #{response.status} from #{url} with no Location header" if location.nil? || location.empty? fetch(URI.join(url, location).to_s, limit: limit - 1) else raise FetchError, "HTTP #{response.status} fetching #{url}: #{excerpt(response.body)}" end end |
.visit(url) ⇒ String
Fetch url and render its main content as Markdown.
No caching here — every call hits the network. Callers that want to memoize results should wrap this method themselves (see WebScrape.visit, which does exactly that).
The dispatcher’s output is String#strip‘d so the LLM never sees a body that opens or closes with blank lines — common with pdf-reader’s page-feed whitespace and with text bodies that carry a trailing newline. Interior whitespace is preserved because Markdown paragraph breaks and source-code indentation are load-bearing.
80 81 82 |
# File 'lib/pikuri/tool/scraper/simple.rb', line 80 def self.visit(url) dispatch(fetch(url)).strip end |