Module: Scrapetor::Fetcher

Defined in:
lib/scrapetor/fetcher.rb

Overview

Native HTTP/2-capable fetch layer. Wraps the libcurl-backed Scrapetor::Native::Http module. Distinct from Scrapetor::HTTP / Scrapetor.fetch, which is the Net::HTTP-based fallback used by tests, the CLI, and environments without libcurl.

Capabilities depend on the libcurl your gem links against. Inspect Scrapetor::Fetcher.features to see what’s actually wired up (HTTP/2 / brotli / zstd / libz). HTTP/2 + gzip is the typical baseline on macOS; brotli/zstd need a libcurl rebuilt with them.

The connection cache lives on a per-OS-thread libcurl easy handle —repeated fetches to the same host inside a single thread reuse the TLS session and HTTP/2 stream. The fetch itself drops the GVL, so background Ruby threads keep running during the round-trip.

resp = Scrapetor::Fetcher.get("https://api.example.com/items")
resp[:status]        # => 200
resp[:http_version]  # => "2"
resp[:headers]       # => {"content-type" => "application/json", ...}
resp[:body]          # => "..."

doc = Scrapetor::Fetcher.fetch("https://example.com/")
# => Scrapetor::Document parsed from the response body, base_url
#    set to the final URL after redirects.

Defined Under Namespace

Classes: FetchError, NotAvailableError

Constant Summary collapse

DEFAULT_USER_AGENT =
"scrapetor/#{Scrapetor::VERSION} (libcurl)"
DEFAULT_RETRY_STATUSES =

Status codes worth retrying. 408 (timeout), 425 (early), 429 (rate-limit), 500–504 (transient upstream). 5xx >= 505 are protocol-level rejections; they don’t usually heal on retry.

[408, 425, 429, 500, 502, 503, 504].freeze

Class Method Summary collapse

Class Method Details

.available?Boolean

Returns:

  • (Boolean)


68
69
70
71
# File 'lib/scrapetor/fetcher.rb', line 68

def self.available?
  defined?(Scrapetor::Native::Http::AVAILABLE) &&
    Scrapetor::Native::Http::AVAILABLE
end

.backoff_for(attempt, base: 0.3, max: 10.0, retry_after: nil) ⇒ Object

Compute the backoff delay for attempt N (1-indexed). Exponential with full jitter — the AWS-style ‘random between 0 and 2^n * base’ variant — capped at max_backoff.



49
50
51
52
53
# File 'lib/scrapetor/fetcher.rb', line 49

def self.backoff_for(attempt, base: 0.3, max: 10.0, retry_after: nil)
  return [retry_after.to_f, max].min if retry_after && retry_after.to_f > 0
  hi = [base * (2.0**(attempt - 1)), max].min
  rand * hi
end

.build_body(body, form, json, opts) ⇒ Object

Build the request body from one of :body / :form / :json. Returns [body_string, opts_with_content_type_header_set].



346
347
348
349
350
351
352
353
354
355
356
357
358
359
# File 'lib/scrapetor/fetcher.rb', line 346

def self.build_body(body, form, json, opts)
  headers = (opts[:headers] || {}).dup
  if json
    require "json"
    body = JSON.generate(json)
    headers["Content-Type"] ||= "application/json"
  elsif form
    require "uri"
    body = URI.encode_www_form(form)
    headers["Content-Type"] ||= "application/x-www-form-urlencoded"
  end
  opts[:headers] = headers unless headers.empty?
  [body, opts]
end

.delete(url, **opts) ⇒ Object



336
337
338
# File 'lib/scrapetor/fetcher.rb', line 336

def self.delete(url, **opts)
  get(url, **opts.merge(method: :delete))
end

.ensure_available!Object

Raises:



131
132
133
134
135
136
# File 'lib/scrapetor/fetcher.rb', line 131

def self.ensure_available!
  return if available?
  raise NotAvailableError,
        "Scrapetor::Fetcher requires libcurl at build time. " \
        "Reinstall after `brew install curl` / `apt-get install libcurl4-openssl-dev`."
end

.featuresObject



73
74
75
76
# File 'lib/scrapetor/fetcher.rb', line 73

def self.features
  ensure_available!
  Scrapetor::Native::Http.features
end

.fetch(url, raise_for_status: true, **opts) ⇒ Object

Fetch + parse. Raises FetchError on non-2xx status by default; pass raise_for_status: false to inspect non-2xx responses.



120
121
122
123
124
125
126
127
128
129
# File 'lib/scrapetor/fetcher.rb', line 120

def self.fetch(url, raise_for_status: true, **opts)
  resp = get(url, **opts)
  if raise_for_status && (resp[:status] < 200 || resp[:status] >= 400)
    raise FetchError.new(
      "Scrapetor::Fetcher.fetch #{url} -> HTTP #{resp[:status]}",
      status: resp[:status], response: resp
    )
  end
  Scrapetor.parse(resp[:body], base_url: resp[:final_url])
end

.get(url, **opts) ⇒ Object

Single GET with optional retry + exponential backoff.

retry: 0     - try once and return whatever happens (default).
retry: N     - retry up to N times on transient failure.
backoff:     - base backoff in seconds (default 0.3).
max_backoff: - cap on a single sleep (default 10.0).
retry_on:    - statuses to retry (default [408, 425, 429, 500..504]).

Network failures (IOError from libcurl: connect refused, DNS, TLS, timeout) are also retried. The wait between attempts honours Retry-After response headers (numeric form) when present and otherwise uses exponential backoff with full jitter.



90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/scrapetor/fetcher.rb', line 90

def self.get(url, **opts)
  ensure_available!
  retries        = opts.delete(:retry) || 0
  base           = opts.delete(:backoff) || 0.3
  max_backoff    = opts.delete(:max_backoff) || 10.0
  retry_on       = opts.delete(:retry_on) || DEFAULT_RETRY_STATUSES
  opts[:user_agent] ||= DEFAULT_USER_AGENT
  attempt = 0
  last_err = nil
  loop do
    begin
      resp = Scrapetor::Native::Http.get(url.to_s, opts)
      return resp unless retries > attempt && retryable_response?(resp, retry_on)
      ra = parse_retry_after(resp[:headers])
      sleep backoff_for(attempt + 1, base: base, max: max_backoff,
                        retry_after: ra)
    rescue IOError => e
      last_err = e
      raise e unless retries > attempt
      sleep backoff_for(attempt + 1, base: base, max: max_backoff)
    end
    attempt += 1
  end
rescue IOError
  raise last_err if last_err
  raise
end

.head(url, **opts) ⇒ Object



340
341
342
# File 'lib/scrapetor/fetcher.rb', line 340

def self.head(url, **opts)
  get(url, **opts.merge(method: :head))
end

.multi_each(urls, **opts) ⇒ Object

Streaming variant of multi_get: yields each response as it completes (in completion order, not input order), so the caller can start processing while other transfers are still in flight. Pass parse: true to also parse the body in the worker thread.

Scrapetor::Fetcher.multi_each(urls, threads: 8) do |r|
  puts r[:final_url], r[:status]
  # later transfers may still be on the wire here
end


289
290
291
292
293
294
295
296
297
298
299
# File 'lib/scrapetor/fetcher.rb', line 289

def self.multi_each(urls, **opts)
  return enum_for(:multi_each, urls, **opts) unless block_given?
  ensure_available!
  urls = Array(urls).map(&:to_s)
  return if urls.empty?
  opts[:user_agent] ||= DEFAULT_USER_AGENT
  batch = Scrapetor::Native::Http::MultiBatch.new(urls, opts)
  while (r = batch.next)
    yield r
  end
end

.multi_fetch(urls, **opts) ⇒ Object

multi_get + per-response parse, all under one no-GVL window. Returns Array<Scrapetor::Document | nil>, in input order. Failed entries are nil. Best for high-fan-out crawls where you want parsed Documents back and the I/O outweighs the per-page CPU cost.



266
267
268
269
270
271
272
273
274
275
276
277
278
# File 'lib/scrapetor/fetcher.rb', line 266

def self.multi_fetch(urls, **opts)
  urls = Array(urls).map(&:to_s)
  return [] if urls.empty?
  ensure_available!
  opts[:user_agent] ||= DEFAULT_USER_AGENT
  results = Scrapetor::Native::Http.multi_fetch(urls, opts.merge(parse: true))
  results.map do |r|
    next nil if r[:error]
    native = r[:document]
    next Scrapetor.parse(r[:body], base_url: r[:final_url]) unless native
    Scrapetor::Document.new("", base_url: r[:final_url], native: native)
  end
end

.multi_get(urls, **opts) ⇒ Object

Single-thread curl_multi bulk fetch — one driver thread, one multi handle, N concurrent transfers multiplexed via curl_multi_perform. Complements parallel_get:

parallel_get  - N pthread workers, each running blocking easy.
                Best when each transfer has CPU work after the
                fetch (decode + parse) since the GVL is released
                across the full batch and CPU scales with cores.

multi_get     - one driver thread, N concurrent in-flight.
                Best for I/O-dominated high-fan-out (hundreds of
                URLs across many hosts) where pthread setup
                overhead outweighs the in-flight count.

Both share the same global CURLSH so connections / DNS / TLS sessions pool across them.



217
218
219
220
221
222
223
# File 'lib/scrapetor/fetcher.rb', line 217

def self.multi_get(urls, **opts)
  ensure_available!
  urls = Array(urls).map(&:to_s)
  return [] if urls.empty?
  opts[:user_agent] ||= DEFAULT_USER_AGENT
  Scrapetor::Native::Http.multi_fetch(urls, opts)
end

.parallel_fetch(urls, **opts) ⇒ Object

Convenience: parallel_get + parse each successful response into a Scrapetor::Document. Failed entries return nil.

The parse runs INSIDE the same no-GVL pthread worker that did the fetch (parse: true on the C side), so the whole batch —network + decode + transcode + dom_parse + index build — runs multi-core under a single GVL release. The main thread only wraps already-parsed documents.



369
370
371
372
373
374
375
376
377
378
379
380
381
# File 'lib/scrapetor/fetcher.rb', line 369

def self.parallel_fetch(urls, **opts)
  urls = Array(urls).map(&:to_s)
  return [] if urls.empty?
  ensure_available!
  opts[:user_agent] ||= DEFAULT_USER_AGENT
  results = Scrapetor::Native::Http.parallel_fetch(urls, opts.merge(parse: true))
  results.map do |r|
    next nil if r[:error]
    native = r[:document]
    next Scrapetor.parse(r[:body], base_url: r[:final_url]) unless native
    Scrapetor::Document.new("", base_url: r[:final_url], native: native)
  end
end

.parallel_get(urls, **opts) ⇒ Object

N concurrent GETs across pthread workers, each with a persistent libcurl handle (per-OS-thread connection cache). The full batch runs under one GVL release — other Ruby threads stay live throughout.

results = Scrapetor::Fetcher.parallel_get(urls, threads: 8,
                                           timeout_ms: 5_000)
# results is Array<Hash>; successful entries carry
#   :status, :headers, :body, :final_url, :http_version
# failed entries carry { error: { url:, error: } } only.

N concurrent GETs with optional retry. The native batch returns all results in one pass; failed entries (transient status or network error) get a second batch dispatch under retry, with the previous-attempt’s wait honoured globally — i.e. one sleep between attempts rather than per-URL — so the pool keeps moving.

Per-URL Retry-After headers are read on retryable responses and the maximum of them governs the next inter-attempt sleep, so a rate-limited host pulls the whole pool to its backoff rather than thrashing the rest in parallel.



158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
# File 'lib/scrapetor/fetcher.rb', line 158

def self.parallel_get(urls, **opts)
  ensure_available!
  urls = Array(urls).map(&:to_s)
  return [] if urls.empty?

  retries     = opts.delete(:retry) || 0
  base        = opts.delete(:backoff) || 0.3
  max_backoff = opts.delete(:max_backoff) || 10.0
  retry_on    = opts.delete(:retry_on) || DEFAULT_RETRY_STATUSES
  opts[:user_agent] ||= DEFAULT_USER_AGENT

  results = Array.new(urls.size)
  pending = (0...urls.size).to_a
  attempt = 0
  loop do
    batch = pending.map { |i| urls[i] }
    batch_res = Scrapetor::Native::Http.parallel_fetch(batch, opts)
    next_pending = []
    next_retry_after = nil
    pending.each_with_index do |orig_i, pos|
      r = batch_res[pos]
      if attempt < retries && retry_eligible?(r, retry_on)
        ra = r[:headers] ? parse_retry_after(r[:headers]) : nil
        next_retry_after = ra if ra && (next_retry_after.nil? || ra > next_retry_after)
        next_pending << orig_i
      else
        results[orig_i] = r
      end
    end
    break if next_pending.empty?
    attempt += 1
    sleep backoff_for(attempt, base: base, max: max_backoff,
                      retry_after: next_retry_after)
    pending = next_pending
  end
  results
end

.parse_retry_after(headers) ⇒ Object



59
60
61
62
63
64
65
66
# File 'lib/scrapetor/fetcher.rb', line 59

def self.parse_retry_after(headers)
  v = headers && (headers["retry-after"] || headers["Retry-After"])
  return nil unless v
  # Either integer seconds or HTTP-date. We only honour the
  # integer form (the date form is rare and parsing it adds a
  # dependency on Time.httpdate that the caller may not need).
  v.to_s.strip.match?(/\A\d+\z/) ? v.to_i : nil
end

.patch(url, body: nil, form: nil, json: nil, **opts) ⇒ Object



331
332
333
334
# File 'lib/scrapetor/fetcher.rb', line 331

def self.patch(url, body: nil, form: nil, json: nil, **opts)
  body, opts = build_body(body, form, json, opts)
  get(url, **opts.merge(method: :patch, body: body))
end

.post(url, body: nil, form: nil, json: nil, multipart: nil, **opts) ⇒ Object

Method shorthands. Each is just a ‘.get` invocation with the corresponding method, plus the body sugar that POST/PUT/PATCH almost always need.



304
305
306
307
308
309
310
311
312
# File 'lib/scrapetor/fetcher.rb', line 304

def self.post(url, body: nil, form: nil, json: nil, multipart: nil, **opts)
  if multipart
    opts[:multipart] = multipart
    get(url, **opts.merge(method: :post))
  else
    body, opts = build_body(body, form, json, opts)
    get(url, **opts.merge(method: :post, body: body))
  end
end

.put(url, body: nil, form: nil, json: nil, **opts) ⇒ Object



326
327
328
329
# File 'lib/scrapetor/fetcher.rb', line 326

def self.put(url, body: nil, form: nil, json: nil, **opts)
  body, opts = build_body(body, form, json, opts)
  get(url, **opts.merge(method: :put, body: body))
end

.retry_eligible?(r, retry_on) ⇒ Boolean

Returns:

  • (Boolean)


196
197
198
199
# File 'lib/scrapetor/fetcher.rb', line 196

def self.retry_eligible?(r, retry_on)
  return true if r[:error]                       # network-level
  r[:status] && retry_on.any? { |s| s == r[:status] }
end

.retryable_response?(resp, retry_statuses) ⇒ Boolean

Returns:

  • (Boolean)


55
56
57
# File 'lib/scrapetor/fetcher.rb', line 55

def self.retryable_response?(resp, retry_statuses)
  retry_statuses.any? { |s| s == resp[:status] }
end

.revalidate(urls, cache_dir:, **opts) ⇒ Object

Bulk-revalidate cached entries. Issues HEAD with If-None-Match / If-Modified-Since for every URL whose cache entry exists; the server’s 304 / 200 verdict classifies each:

:fresh    - server said 304; cache still valid.
:changed  - server returned a new 2xx; cache rewritten.
:missing  - server returned 4xx (gone / not found).
:error    - transport failure.

Returns a Hash[url => Symbol]. Optimal for crawls of N pages over moderate intervals: HEAD round-trips are cheap and run all-concurrent under curl_multi.



237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# File 'lib/scrapetor/fetcher.rb', line 237

def self.revalidate(urls, cache_dir:, **opts)
  ensure_available!
  urls = Array(urls).map(&:to_s)
  return {} if urls.empty?
  opts[:user_agent] ||= DEFAULT_USER_AGENT
  results = Scrapetor::Native::Http.multi_fetch(urls,
    opts.merge(method: :head, cache_dir: cache_dir))
  out = {}
  results.each_with_index do |r, i|
    url = urls[i]
    out[url] =
      if r[:error]
        :error
      elsif r[:headers] && r[:headers]["x-scrapetor-cache"] == "hit"
        :fresh
      elsif r[:status] && r[:status] >= 400
        :missing
      else
        :changed
      end
  end
  out
end

.upload_bytes(bytes, filename:, content_type: "application/octet-stream") ⇒ Object



322
323
324
# File 'lib/scrapetor/fetcher.rb', line 322

def self.upload_bytes(bytes, filename:, content_type: "application/octet-stream")
  { data: bytes, filename: filename, content_type: content_type }
end

.upload_file(path, filename: nil, content_type: nil) ⇒ Object

Convenience constructors for multipart values.



315
316
317
318
319
320
# File 'lib/scrapetor/fetcher.rb', line 315

def self.upload_file(path, filename: nil, content_type: nil)
  h = { path: path.to_s }
  h[:filename] = filename if filename
  h[:content_type] = content_type if content_type
  h
end