Class: Scrapetor::Session
- Inherits:
-
Object
- Object
- Scrapetor::Session
- Defined in:
- lib/scrapetor/session.rb
Overview
Stateful HTTP session. Wraps Scrapetor::Fetcher with:
- persistent cookie jar (libcurl COOKIEJAR/COOKIEFILE)
- default headers merged into every request
- basic / bearer auth applied automatically
- per-host rate limiting (polite throttle)
- default retry/backoff
- auto charset transcoding of HTML bodies to UTF-8
session = Scrapetor::Session.new(
cookies: true, # ephemeral tempfile jar
user_agent: "MyBot/1.0",
rate_limit: 0.5, # min seconds between same-host requests
retry: 3,
headers: { "Accept-Language" => "en-US" },
)
doc = session.fetch("https://example.com/login")
session.post("https://example.com/login", form: { user: "x", pass: "y" })
doc = session.fetch("https://example.com/dashboard")
Cookies set during the login persist for the dashboard call.
Constant Summary collapse
- DEFAULT_HEADERS =
{ "Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language" => "en-US,en;q=0.5", }.freeze
Instance Attribute Summary collapse
-
#cookie_jar_path ⇒ Object
readonly
Returns the value of attribute cookie_jar_path.
Class Method Summary collapse
Instance Method Summary collapse
- #close ⇒ Object
-
#fetch(url, **opts) ⇒ Object
GET + parse to a Document.
-
#initialize(cookies: true, user_agent: nil, headers: {}, basic_auth: nil, bearer_token: nil, proxy: nil, ca_path: nil, rate_limit: nil, retry: 0, backoff: 0.3, max_backoff: 10.0, timeout_ms: 30_000, follow_redirects: true, insecure: false, transcode_charset: true) ⇒ Session
constructor
A new instance of Session.
-
#parallel_get(urls, **opts) ⇒ Object
parallel_get respects the session’s defaults (cookies, headers, auth, per-host rate limit).
Constructor Details
#initialize(cookies: true, user_agent: nil, headers: {}, basic_auth: nil, bearer_token: nil, proxy: nil, ca_path: nil, rate_limit: nil, retry: 0, backoff: 0.3, max_backoff: 10.0, timeout_ms: 30_000, follow_redirects: true, insecure: false, transcode_charset: true) ⇒ Session
Returns a new instance of Session.
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/scrapetor/session.rb', line 36 def initialize(cookies: true, user_agent: nil, headers: {}, basic_auth: nil, bearer_token: nil, proxy: nil, ca_path: nil, rate_limit: nil, retry: 0, backoff: 0.3, max_backoff: 10.0, timeout_ms: 30_000, follow_redirects: true, insecure: false, transcode_charset: true) Scrapetor::Fetcher.ensure_available! @cookie_jar_path = case when String then when true then ephemeral_jar_path when false, nil then nil else raise ArgumentError, "cookies: must be String/true/false" end @defaults = { user_agent: user_agent || Scrapetor::Fetcher::DEFAULT_USER_AGENT, headers: DEFAULT_HEADERS.merge(headers), basic_auth: basic_auth, bearer_token: bearer_token, proxy: proxy, ca_path: ca_path, retry: binding.local_variable_get(:retry), backoff: backoff, max_backoff: max_backoff, timeout_ms: timeout_ms, follow_redirects: follow_redirects, insecure: insecure, }.compact @defaults[:transcode_utf8] = transcode_charset @defaults[:rate_limit_ms] = (rate_limit * 1000).to_i if rate_limit end |
Instance Attribute Details
#cookie_jar_path ⇒ Object (readonly)
Returns the value of attribute cookie_jar_path.
34 35 36 |
# File 'lib/scrapetor/session.rb', line 34 def @cookie_jar_path end |
Class Method Details
.make_jar_finalizer(path) ⇒ Object
121 122 123 |
# File 'lib/scrapetor/session.rb', line 121 def self.make_jar_finalizer(path) proc { File.delete(path) if File.exist?(path) rescue nil } end |
Instance Method Details
#close ⇒ Object
104 105 106 107 108 |
# File 'lib/scrapetor/session.rb', line 104 def close File.delete(@cookie_jar_path) if @cookie_jar_path && File.exist?(@cookie_jar_path) && @ephemeral rescue StandardError # tempfile may have already been GC'd; ignore end |
#fetch(url, **opts) ⇒ Object
GET + parse to a Document.
85 86 87 88 89 90 91 92 |
# File 'lib/scrapetor/session.rb', line 85 def fetch(url, **opts) resp = get(url, **opts) raise Scrapetor::Fetcher::FetchError.new( "Session.fetch #{url} -> HTTP #{resp[:status]}", status: resp[:status], response: resp ) if resp[:status] < 200 || resp[:status] >= 400 Scrapetor.parse(resp[:body], base_url: resp[:final_url]) end |
#parallel_get(urls, **opts) ⇒ Object
parallel_get respects the session’s defaults (cookies, headers, auth, per-host rate limit). The native batch honours rate_limit_ms per-host via a shared C-side throttle table, so N parallel workers hitting one host all queue at that gate while different hosts run concurrently.
99 100 101 102 |
# File 'lib/scrapetor/session.rb', line 99 def parallel_get(urls, **opts) merged = merge_opts(opts) Scrapetor::Fetcher.parallel_get(urls, **merged) end |