Module: Pikuri::Tool::Scraper::Simple

Defined in:
lib/pikuri/tool/scraper/simple.rb

Overview

Plain HTTP scraper: GET the URL with a real-browser User-Agent, follow redirects, and dispatch the response body to the parser matching its Content-Type. HTML and XHTML route to HTML.extract; application/pdf routes to PDF.extract; any other text/* type (plain text, Markdown, source files, …) is passed through verbatim since the LLM can already read it; the remaining types raise FetchError so the LLM observes the failure instead of receiving an empty rendering.

Split into a thin HTTP fetch (Simple.fetch) and a content-type dispatcher (Simple.visit) so tests can drive each piece in isolation. “Simple” because everything happens in one Faraday GET — no headless browser, no JS execution.

Defined Under Namespace

Classes: Fetched

Constant Summary collapse

USER_AGENT =

Returns User-Agent sent with each request; many sites reject requests with no UA or an obvious bot UA.

Returns:

  • (String)

    User-Agent sent with each request; many sites reject requests with no UA or an obvious bot UA

'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ' \
'(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
ACCEPT =

Returns Accept header sent with each request. Lists every content-type the dispatcher in visit knows how to render, so servers that content-negotiate hand back something we can use. The trailing text/*;q=0.8 covers the verbatim pass-through arm (plain text, Markdown, source files, …) at a lower preference than rendered HTML/PDF.

Returns:

  • (String)

    Accept header sent with each request. Lists every content-type the dispatcher in visit knows how to render, so servers that content-negotiate hand back something we can use. The trailing text/*;q=0.8 covers the verbatim pass-through arm (plain text, Markdown, source files, …) at a lower preference than rendered HTML/PDF.

'text/html,application/xhtml+xml,application/pdf,text/*;q=0.8'
MAX_REDIRECTS =

Returns maximum number of HTTP redirects to follow before giving up.

Returns:

  • (Integer)

    maximum number of HTTP redirects to follow before giving up

5
OPEN_TIMEOUT =

Returns connect timeout in seconds for the underlying Faraday request.

Returns:

  • (Integer)

    connect timeout in seconds for the underlying Faraday request

10
READ_TIMEOUT =

Returns read timeout in seconds for the underlying Faraday request.

Returns:

  • (Integer)

    read timeout in seconds for the underlying Faraday request

20
ERROR_BODY_EXCERPT =

Returns maximum number of characters of an error response body to include in a FetchError message. The body is often a multi-kilobyte HTML challenge page (Cloudflare, WAF interstitial, etc.); a short excerpt tells the LLM what kind of page came back without flooding the next observation.

Returns:

  • (Integer)

    maximum number of characters of an error response body to include in a FetchError message. The body is often a multi-kilobyte HTML challenge page (Cloudflare, WAF interstitial, etc.); a short excerpt tells the LLM what kind of page came back without flooding the next observation.

200

Class Method Summary collapse

Class Method Details

.dispatch(fetched) ⇒ String

Route a Fetched response to the parser that matches its content-type. Unknown types raise FetchError so the LLM gets a legible observation instead of an empty string.

Parameters:

Returns:

  • (String)

    Markdown representation produced by the matched parser

Raises:

  • (FetchError)

    when no parser matches the response’s content-type



138
139
140
141
142
143
144
145
146
147
148
149
# File 'lib/pikuri/tool/scraper/simple.rb', line 138

def self.dispatch(fetched)
  case fetched.content_type
  when 'text/html', 'application/xhtml+xml'
    HTML.extract(fetched.body)
  when 'application/pdf'
    PDF.extract(fetched.body)
  when %r{\Atext/}
    fetched.body
  else
    raise FetchError, "unsupported content-type #{fetched.content_type.inspect} for #{fetched.url}"
  end
end

.fetch(url, limit: MAX_REDIRECTS) ⇒ Fetched

Download the body of url, manually following up to MAX_REDIRECTS redirects. Faraday is configured with no middleware so behavior here mirrors the rest of the codebase (see Tool::Search::DuckDuckGo.search).

All recoverable failures — HTTP 4xx/5xx, Faraday::Error network blips, exhausted redirect budget, 3xx without a Location —surface as FetchError so the caller has a single exception type to rescue. Error bodies are trimmed to ERROR_BODY_EXCERPT characters with whitespace collapsed, so a Cloudflare-challenge response doesn’t dump kilobytes of inline HTML into the next LLM observation.

Parameters:

  • url (String)

    absolute HTTP(S) URL to fetch

  • limit (Integer) (defaults to: MAX_REDIRECTS)

    redirects remaining; recurses with limit - 1 on each 3xx

Returns:

  • (Fetched)

    body, normalized content-type, and final URL after redirects

Raises:

  • (FetchError)

    on non-2xx/3xx responses, network errors, redirect-loop exhaustion, or 3xx without a Location header



104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# File 'lib/pikuri/tool/scraper/simple.rb', line 104

def self.fetch(url, limit: MAX_REDIRECTS)
  raise FetchError, "too many redirects fetching #{url}" if limit.zero?

  response = begin
    Faraday.new(request: { open_timeout: OPEN_TIMEOUT, timeout: READ_TIMEOUT }).get(url) do |req|
      req.headers['User-Agent'] = USER_AGENT
      req.headers['Accept']     = ACCEPT
    end
  rescue Faraday::Error => e
    raise FetchError, "#{e.class.name.split('::').last} fetching #{url}: #{e.message}"
  end

  case response.status
  when 200..299
    Fetched.new(body: response.body, content_type: normalize_content_type(response.headers['content-type']), url: url)
  when 300..399
    location = response.headers['location']
    raise FetchError, "HTTP #{response.status} from #{url} with no Location header" if location.nil? || location.empty?

    fetch(URI.join(url, location).to_s, limit: limit - 1)
  else
    raise FetchError, "HTTP #{response.status} fetching #{url}: #{excerpt(response.body)}"
  end
end

.visit(url) ⇒ String

Fetch url and render its main content as Markdown.

No caching here — every call hits the network. Callers that want to memoize results should wrap this method themselves (see WebScrape.visit, which does exactly that).

The dispatcher’s output is String#strip‘d so the LLM never sees a body that opens or closes with blank lines — common with pdf-reader’s page-feed whitespace and with text bodies that carry a trailing newline. Interior whitespace is preserved because Markdown paragraph breaks and source-code indentation are load-bearing.

Parameters:

  • url (String)

    absolute HTTP(S) URL of the page to download

Returns:

  • (String)

    full Markdown representation of the page with leading/trailing whitespace trimmed, uncapped otherwise —caller is responsible for any size limiting before feeding the result back to the LLM

Raises:

  • (FetchError)

    on HTTP non-2xx, network failure, redirect loop, a 3xx without a Location header, or a response whose content-type the dispatcher does not recognize



80
81
82
# File 'lib/pikuri/tool/scraper/simple.rb', line 80

def self.visit(url)
  dispatch(fetch(url)).strip
end