Module: Scrapetor::Fetcher
- Defined in:
- lib/scrapetor/fetcher.rb
Overview
Native HTTP/2-capable fetch layer. Wraps the libcurl-backed Scrapetor::Native::Http module. Distinct from Scrapetor::HTTP / Scrapetor.fetch, which is the Net::HTTP-based fallback used by tests, the CLI, and environments without libcurl.
Capabilities depend on the libcurl your gem links against. Inspect Scrapetor::Fetcher.features to see what’s actually wired up (HTTP/2 / brotli / zstd / libz). HTTP/2 + gzip is the typical baseline on macOS; brotli/zstd need a libcurl rebuilt with them.
The connection cache lives on a per-OS-thread libcurl easy handle —repeated fetches to the same host inside a single thread reuse the TLS session and HTTP/2 stream. The fetch itself drops the GVL, so background Ruby threads keep running during the round-trip.
resp = Scrapetor::Fetcher.get("https://api.example.com/items")
resp[:status] # => 200
resp[:http_version] # => "2"
resp[:headers] # => {"content-type" => "application/json", ...}
resp[:body] # => "..."
doc = Scrapetor::Fetcher.fetch("https://example.com/")
# => Scrapetor::Document parsed from the response body, base_url
# set to the final URL after redirects.
Defined Under Namespace
Classes: FetchError, NotAvailableError
Constant Summary collapse
- DEFAULT_USER_AGENT =
"scrapetor/#{Scrapetor::VERSION} (libcurl)"- DEFAULT_RETRY_STATUSES =
Status codes worth retrying. 408 (timeout), 425 (early), 429 (rate-limit), 500–504 (transient upstream). 5xx >= 505 are protocol-level rejections; they don’t usually heal on retry.
[408, 425, 429, 500, 502, 503, 504].freeze
Class Method Summary collapse
- .available? ⇒ Boolean
-
.backoff_for(attempt, base: 0.3, max: 10.0, retry_after: nil) ⇒ Object
Compute the backoff delay for attempt N (1-indexed).
-
.build_body(body, form, json, opts) ⇒ Object
Build the request body from one of :body / :form / :json.
- .delete(url, **opts) ⇒ Object
- .ensure_available! ⇒ Object
- .features ⇒ Object
-
.fetch(url, raise_for_status: true, **opts) ⇒ Object
Fetch + parse.
-
.get(url, **opts) ⇒ Object
Single GET with optional retry + exponential backoff.
- .head(url, **opts) ⇒ Object
-
.multi_each(urls, **opts) ⇒ Object
Streaming variant of multi_get: yields each response as it completes (in completion order, not input order), so the caller can start processing while other transfers are still in flight.
-
.multi_fetch(urls, **opts) ⇒ Object
multi_get + per-response parse, all under one no-GVL window.
-
.multi_get(urls, **opts) ⇒ Object
Single-thread curl_multi bulk fetch — one driver thread, one multi handle, N concurrent transfers multiplexed via curl_multi_perform.
-
.parallel_fetch(urls, **opts) ⇒ Object
Convenience: parallel_get + parse each successful response into a Scrapetor::Document.
-
.parallel_get(urls, **opts) ⇒ Object
N concurrent GETs across pthread workers, each with a persistent libcurl handle (per-OS-thread connection cache).
- .parse_retry_after(headers) ⇒ Object
- .patch(url, body: nil, form: nil, json: nil, **opts) ⇒ Object
-
.post(url, body: nil, form: nil, json: nil, multipart: nil, **opts) ⇒ Object
Method shorthands.
- .put(url, body: nil, form: nil, json: nil, **opts) ⇒ Object
- .retry_eligible?(r, retry_on) ⇒ Boolean
- .retryable_response?(resp, retry_statuses) ⇒ Boolean
-
.revalidate(urls, cache_dir:, **opts) ⇒ Object
Bulk-revalidate cached entries.
- .upload_bytes(bytes, filename:, content_type: "application/octet-stream") ⇒ Object
-
.upload_file(path, filename: nil, content_type: nil) ⇒ Object
Convenience constructors for multipart values.
Class Method Details
.available? ⇒ Boolean
68 69 70 71 |
# File 'lib/scrapetor/fetcher.rb', line 68 def self.available? defined?(Scrapetor::Native::Http::AVAILABLE) && Scrapetor::Native::Http::AVAILABLE end |
.backoff_for(attempt, base: 0.3, max: 10.0, retry_after: nil) ⇒ Object
Compute the backoff delay for attempt N (1-indexed). Exponential with full jitter — the AWS-style ‘random between 0 and 2^n * base’ variant — capped at max_backoff.
49 50 51 52 53 |
# File 'lib/scrapetor/fetcher.rb', line 49 def self.backoff_for(attempt, base: 0.3, max: 10.0, retry_after: nil) return [retry_after.to_f, max].min if retry_after && retry_after.to_f > 0 hi = [base * (2.0**(attempt - 1)), max].min rand * hi end |
.build_body(body, form, json, opts) ⇒ Object
Build the request body from one of :body / :form / :json. Returns [body_string, opts_with_content_type_header_set].
346 347 348 349 350 351 352 353 354 355 356 357 358 359 |
# File 'lib/scrapetor/fetcher.rb', line 346 def self.build_body(body, form, json, opts) headers = (opts[:headers] || {}).dup if json require "json" body = JSON.generate(json) headers["Content-Type"] ||= "application/json" elsif form require "uri" body = URI.encode_www_form(form) headers["Content-Type"] ||= "application/x-www-form-urlencoded" end opts[:headers] = headers unless headers.empty? [body, opts] end |
.delete(url, **opts) ⇒ Object
336 337 338 |
# File 'lib/scrapetor/fetcher.rb', line 336 def self.delete(url, **opts) get(url, **opts.merge(method: :delete)) end |
.ensure_available! ⇒ Object
131 132 133 134 135 136 |
# File 'lib/scrapetor/fetcher.rb', line 131 def self.ensure_available! return if available? raise NotAvailableError, "Scrapetor::Fetcher requires libcurl at build time. " \ "Reinstall after `brew install curl` / `apt-get install libcurl4-openssl-dev`." end |
.features ⇒ Object
73 74 75 76 |
# File 'lib/scrapetor/fetcher.rb', line 73 def self.features ensure_available! Scrapetor::Native::Http.features end |
.fetch(url, raise_for_status: true, **opts) ⇒ Object
Fetch + parse. Raises FetchError on non-2xx status by default; pass raise_for_status: false to inspect non-2xx responses.
120 121 122 123 124 125 126 127 128 129 |
# File 'lib/scrapetor/fetcher.rb', line 120 def self.fetch(url, raise_for_status: true, **opts) resp = get(url, **opts) if raise_for_status && (resp[:status] < 200 || resp[:status] >= 400) raise FetchError.new( "Scrapetor::Fetcher.fetch #{url} -> HTTP #{resp[:status]}", status: resp[:status], response: resp ) end Scrapetor.parse(resp[:body], base_url: resp[:final_url]) end |
.get(url, **opts) ⇒ Object
Single GET with optional retry + exponential backoff.
retry: 0 - try once and return whatever happens (default).
retry: N - retry up to N times on transient failure.
backoff: - base backoff in seconds (default 0.3).
max_backoff: - cap on a single sleep (default 10.0).
retry_on: - statuses to retry (default [408, 425, 429, 500..504]).
Network failures (IOError from libcurl: connect refused, DNS, TLS, timeout) are also retried. The wait between attempts honours Retry-After response headers (numeric form) when present and otherwise uses exponential backoff with full jitter.
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/scrapetor/fetcher.rb', line 90 def self.get(url, **opts) ensure_available! retries = opts.delete(:retry) || 0 base = opts.delete(:backoff) || 0.3 max_backoff = opts.delete(:max_backoff) || 10.0 retry_on = opts.delete(:retry_on) || DEFAULT_RETRY_STATUSES opts[:user_agent] ||= DEFAULT_USER_AGENT attempt = 0 last_err = nil loop do begin resp = Scrapetor::Native::Http.get(url.to_s, opts) return resp unless retries > attempt && retryable_response?(resp, retry_on) ra = parse_retry_after(resp[:headers]) sleep backoff_for(attempt + 1, base: base, max: max_backoff, retry_after: ra) rescue IOError => e last_err = e raise e unless retries > attempt sleep backoff_for(attempt + 1, base: base, max: max_backoff) end attempt += 1 end rescue IOError raise last_err if last_err raise end |
.head(url, **opts) ⇒ Object
340 341 342 |
# File 'lib/scrapetor/fetcher.rb', line 340 def self.head(url, **opts) get(url, **opts.merge(method: :head)) end |
.multi_each(urls, **opts) ⇒ Object
Streaming variant of multi_get: yields each response as it completes (in completion order, not input order), so the caller can start processing while other transfers are still in flight. Pass parse: true to also parse the body in the worker thread.
Scrapetor::Fetcher.multi_each(urls, threads: 8) do |r|
puts r[:final_url], r[:status]
# later transfers may still be on the wire here
end
289 290 291 292 293 294 295 296 297 298 299 |
# File 'lib/scrapetor/fetcher.rb', line 289 def self.multi_each(urls, **opts) return enum_for(:multi_each, urls, **opts) unless block_given? ensure_available! urls = Array(urls).map(&:to_s) return if urls.empty? opts[:user_agent] ||= DEFAULT_USER_AGENT batch = Scrapetor::Native::Http::MultiBatch.new(urls, opts) while (r = batch.next) yield r end end |
.multi_fetch(urls, **opts) ⇒ Object
multi_get + per-response parse, all under one no-GVL window. Returns Array<Scrapetor::Document | nil>, in input order. Failed entries are nil. Best for high-fan-out crawls where you want parsed Documents back and the I/O outweighs the per-page CPU cost.
266 267 268 269 270 271 272 273 274 275 276 277 278 |
# File 'lib/scrapetor/fetcher.rb', line 266 def self.multi_fetch(urls, **opts) urls = Array(urls).map(&:to_s) return [] if urls.empty? ensure_available! opts[:user_agent] ||= DEFAULT_USER_AGENT results = Scrapetor::Native::Http.multi_fetch(urls, opts.merge(parse: true)) results.map do |r| next nil if r[:error] native = r[:document] next Scrapetor.parse(r[:body], base_url: r[:final_url]) unless native Scrapetor::Document.new("", base_url: r[:final_url], native: native) end end |
.multi_get(urls, **opts) ⇒ Object
Single-thread curl_multi bulk fetch — one driver thread, one multi handle, N concurrent transfers multiplexed via curl_multi_perform. Complements parallel_get:
parallel_get - N pthread workers, each running blocking easy.
Best when each transfer has CPU work after the
fetch (decode + parse) since the GVL is released
across the full batch and CPU scales with cores.
multi_get - one driver thread, N concurrent in-flight.
Best for I/O-dominated high-fan-out (hundreds of
URLs across many hosts) where pthread setup
overhead outweighs the in-flight count.
Both share the same global CURLSH so connections / DNS / TLS sessions pool across them.
217 218 219 220 221 222 223 |
# File 'lib/scrapetor/fetcher.rb', line 217 def self.multi_get(urls, **opts) ensure_available! urls = Array(urls).map(&:to_s) return [] if urls.empty? opts[:user_agent] ||= DEFAULT_USER_AGENT Scrapetor::Native::Http.multi_fetch(urls, opts) end |
.parallel_fetch(urls, **opts) ⇒ Object
Convenience: parallel_get + parse each successful response into a Scrapetor::Document. Failed entries return nil.
The parse runs INSIDE the same no-GVL pthread worker that did the fetch (parse: true on the C side), so the whole batch —network + decode + transcode + dom_parse + index build — runs multi-core under a single GVL release. The main thread only wraps already-parsed documents.
369 370 371 372 373 374 375 376 377 378 379 380 381 |
# File 'lib/scrapetor/fetcher.rb', line 369 def self.parallel_fetch(urls, **opts) urls = Array(urls).map(&:to_s) return [] if urls.empty? ensure_available! opts[:user_agent] ||= DEFAULT_USER_AGENT results = Scrapetor::Native::Http.parallel_fetch(urls, opts.merge(parse: true)) results.map do |r| next nil if r[:error] native = r[:document] next Scrapetor.parse(r[:body], base_url: r[:final_url]) unless native Scrapetor::Document.new("", base_url: r[:final_url], native: native) end end |
.parallel_get(urls, **opts) ⇒ Object
N concurrent GETs across pthread workers, each with a persistent libcurl handle (per-OS-thread connection cache). The full batch runs under one GVL release — other Ruby threads stay live throughout.
results = Scrapetor::Fetcher.parallel_get(urls, threads: 8,
timeout_ms: 5_000)
# results is Array<Hash>; successful entries carry
# :status, :headers, :body, :final_url, :http_version
# failed entries carry { error: { url:, error: } } only.
N concurrent GETs with optional retry. The native batch returns all results in one pass; failed entries (transient status or network error) get a second batch dispatch under retry, with the previous-attempt’s wait honoured globally — i.e. one sleep between attempts rather than per-URL — so the pool keeps moving.
Per-URL Retry-After headers are read on retryable responses and the maximum of them governs the next inter-attempt sleep, so a rate-limited host pulls the whole pool to its backoff rather than thrashing the rest in parallel.
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
# File 'lib/scrapetor/fetcher.rb', line 158 def self.parallel_get(urls, **opts) ensure_available! urls = Array(urls).map(&:to_s) return [] if urls.empty? retries = opts.delete(:retry) || 0 base = opts.delete(:backoff) || 0.3 max_backoff = opts.delete(:max_backoff) || 10.0 retry_on = opts.delete(:retry_on) || DEFAULT_RETRY_STATUSES opts[:user_agent] ||= DEFAULT_USER_AGENT results = Array.new(urls.size) pending = (0...urls.size).to_a attempt = 0 loop do batch = pending.map { |i| urls[i] } batch_res = Scrapetor::Native::Http.parallel_fetch(batch, opts) next_pending = [] next_retry_after = nil pending.each_with_index do |orig_i, pos| r = batch_res[pos] if attempt < retries && retry_eligible?(r, retry_on) ra = r[:headers] ? parse_retry_after(r[:headers]) : nil next_retry_after = ra if ra && (next_retry_after.nil? || ra > next_retry_after) next_pending << orig_i else results[orig_i] = r end end break if next_pending.empty? attempt += 1 sleep backoff_for(attempt, base: base, max: max_backoff, retry_after: next_retry_after) pending = next_pending end results end |
.parse_retry_after(headers) ⇒ Object
59 60 61 62 63 64 65 66 |
# File 'lib/scrapetor/fetcher.rb', line 59 def self.parse_retry_after(headers) v = headers && (headers["retry-after"] || headers["Retry-After"]) return nil unless v # Either integer seconds or HTTP-date. We only honour the # integer form (the date form is rare and parsing it adds a # dependency on Time.httpdate that the caller may not need). v.to_s.strip.match?(/\A\d+\z/) ? v.to_i : nil end |
.patch(url, body: nil, form: nil, json: nil, **opts) ⇒ Object
331 332 333 334 |
# File 'lib/scrapetor/fetcher.rb', line 331 def self.patch(url, body: nil, form: nil, json: nil, **opts) body, opts = build_body(body, form, json, opts) get(url, **opts.merge(method: :patch, body: body)) end |
.post(url, body: nil, form: nil, json: nil, multipart: nil, **opts) ⇒ Object
Method shorthands. Each is just a ‘.get` invocation with the corresponding method, plus the body sugar that POST/PUT/PATCH almost always need.
304 305 306 307 308 309 310 311 312 |
# File 'lib/scrapetor/fetcher.rb', line 304 def self.post(url, body: nil, form: nil, json: nil, multipart: nil, **opts) if multipart opts[:multipart] = multipart get(url, **opts.merge(method: :post)) else body, opts = build_body(body, form, json, opts) get(url, **opts.merge(method: :post, body: body)) end end |
.put(url, body: nil, form: nil, json: nil, **opts) ⇒ Object
326 327 328 329 |
# File 'lib/scrapetor/fetcher.rb', line 326 def self.put(url, body: nil, form: nil, json: nil, **opts) body, opts = build_body(body, form, json, opts) get(url, **opts.merge(method: :put, body: body)) end |
.retry_eligible?(r, retry_on) ⇒ Boolean
196 197 198 199 |
# File 'lib/scrapetor/fetcher.rb', line 196 def self.retry_eligible?(r, retry_on) return true if r[:error] # network-level r[:status] && retry_on.any? { |s| s == r[:status] } end |
.retryable_response?(resp, retry_statuses) ⇒ Boolean
55 56 57 |
# File 'lib/scrapetor/fetcher.rb', line 55 def self.retryable_response?(resp, retry_statuses) retry_statuses.any? { |s| s == resp[:status] } end |
.revalidate(urls, cache_dir:, **opts) ⇒ Object
Bulk-revalidate cached entries. Issues HEAD with If-None-Match / If-Modified-Since for every URL whose cache entry exists; the server’s 304 / 200 verdict classifies each:
:fresh - server said 304; cache still valid.
:changed - server returned a new 2xx; cache rewritten.
:missing - server returned 4xx (gone / not found).
:error - transport failure.
Returns a Hash[url => Symbol]. Optimal for crawls of N pages over moderate intervals: HEAD round-trips are cheap and run all-concurrent under curl_multi.
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 |
# File 'lib/scrapetor/fetcher.rb', line 237 def self.revalidate(urls, cache_dir:, **opts) ensure_available! urls = Array(urls).map(&:to_s) return {} if urls.empty? opts[:user_agent] ||= DEFAULT_USER_AGENT results = Scrapetor::Native::Http.multi_fetch(urls, opts.merge(method: :head, cache_dir: cache_dir)) out = {} results.each_with_index do |r, i| url = urls[i] out[url] = if r[:error] :error elsif r[:headers] && r[:headers]["x-scrapetor-cache"] == "hit" :fresh elsif r[:status] && r[:status] >= 400 :missing else :changed end end out end |
.upload_bytes(bytes, filename:, content_type: "application/octet-stream") ⇒ Object
322 323 324 |
# File 'lib/scrapetor/fetcher.rb', line 322 def self.upload_bytes(bytes, filename:, content_type: "application/octet-stream") { data: bytes, filename: filename, content_type: content_type } end |
.upload_file(path, filename: nil, content_type: nil) ⇒ Object
Convenience constructors for multipart values.
315 316 317 318 319 320 |
# File 'lib/scrapetor/fetcher.rb', line 315 def self.upload_file(path, filename: nil, content_type: nil) h = { path: path.to_s } h[:filename] = filename if filename h[:content_type] = content_type if content_type h end |