Module: Scrapetor::HTTP
- Defined in:
- lib/scrapetor/http.rb
Overview
Convenience HTTP fetcher built on ‘Net::HTTP` (Ruby stdlib — no external runtime dep).
doc = Scrapetor.fetch("https://example.com/products")
doc.css(".product").map { |p| p.at(".title").text }
Handles 3xx redirects, sets a sensible User-Agent, applies the response’s encoding to the parsed document, and uses the request URL as ‘base_url` for absolute-URL helpers.
For production scraping you’ll usually want a real HTTP client (HTTPX, Typhoeus, Faraday) with connection pooling, retries, and cookie storage. ‘Scrapetor.fetch` is intentionally minimal — it’s here so simple scripts and the CLI don’t need extra deps.
Defined Under Namespace
Classes: FetchError, Response, TooManyRedirects
Constant Summary collapse
- DEFAULT_HEADERS =
{ "User-Agent" => "Scrapetor/#{Scrapetor::VERSION} (+https://scrapetor.org)", "Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language" => "en-US,en;q=0.5", "Accept-Encoding" => "identity" }.freeze
- MAX_REDIRECTS =
5
Class Method Summary collapse
-
.fetch(url, **opts) ⇒ Object
Fetch + parse + return a ‘Scrapetor::Document` whose `base_url` is the final URL after redirects.
-
.fetch_extract(url, schema, **opts) ⇒ Object
Fetch + extract.
- .get(url, headers: {}, follow_redirects: true, max_redirects: MAX_REDIRECTS, open_timeout: 10, read_timeout: 30) ⇒ Object
Class Method Details
.fetch(url, **opts) ⇒ Object
Fetch + parse + return a ‘Scrapetor::Document` whose `base_url` is the final URL after redirects.
69 70 71 72 |
# File 'lib/scrapetor/http.rb', line 69 def self.fetch(url, **opts) resp = get(url, **opts) Scrapetor.parse(resp.body, base_url: resp.final_url.to_s) end |
.fetch_extract(url, schema, **opts) ⇒ Object
Fetch + extract.
75 76 77 78 |
# File 'lib/scrapetor/http.rb', line 75 def self.fetch_extract(url, schema, **opts) resp = get(url, **opts) Scrapetor.parse(resp.body, base_url: resp.final_url.to_s).extract(schema) end |
.get(url, headers: {}, follow_redirects: true, max_redirects: MAX_REDIRECTS, open_timeout: 10, read_timeout: 30) ⇒ Object
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# File 'lib/scrapetor/http.rb', line 34 def self.get(url, headers: {}, follow_redirects: true, max_redirects: MAX_REDIRECTS, open_timeout: 10, read_timeout: 30) uri = URI(url.to_s) raise FetchError, "unsupported scheme: #{uri.scheme.inspect}" unless %w[http https].include?(uri.scheme) hops = 0 loop do req = Net::HTTP::Get.new(uri.request_uri) DEFAULT_HEADERS.each { |k, v| req[k] = v } headers.each { |k, v| req[k.to_s] = v.to_s } net = Net::HTTP.new(uri.host, uri.port) net.use_ssl = (uri.scheme == "https") net.open_timeout = open_timeout net.read_timeout = read_timeout resp = net.start { |h| h.request(req) } case resp when Net::HTTPSuccess return Response.new(resp, uri) when Net::HTTPRedirection raise TooManyRedirects, "exceeded #{max_redirects} redirects" if hops >= max_redirects raise FetchError, "redirect with no Location header" unless resp["location"] uri = URI.join(uri.to_s, resp["location"]) hops += 1 next if follow_redirects return Response.new(resp, uri) else raise FetchError, "HTTP #{resp.code} #{resp.} for #{uri}" end end end |