Class: Scrapetor::Session

Inherits:

Object

Object
Scrapetor::Session

show all

Defined in:: lib/scrapetor/session.rb

Overview

Stateful HTTP session. Wraps Scrapetor::Fetcher with:

- persistent cookie jar (libcurl COOKIEJAR/COOKIEFILE)
- default headers merged into every request
- basic / bearer auth applied automatically
- per-host rate limiting (polite throttle)
- default retry/backoff
- auto charset transcoding of HTML bodies to UTF-8

session = Scrapetor::Session.new(
  cookies:     true,          # ephemeral tempfile jar
  user_agent:  "MyBot/1.0",
  rate_limit:  0.5,           # min seconds between same-host requests
  retry:       3,
  headers:     { "Accept-Language" => "en-US" },
)
doc = session.fetch("https://example.com/login")
session.post("https://example.com/login", form: { user: "x", pass: "y" })
doc = session.fetch("https://example.com/dashboard")

Cookies set during the login persist for the dashboard call.

Constant Summary collapse

DEFAULT_HEADERS =

{
  "Accept"          => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language" => "en-US,en;q=0.5",
}.freeze

Instance Attribute Summary collapse

#cookie_jar_path ⇒ Object readonly

Returns the value of attribute cookie_jar_path.

Class Method Summary collapse

.make_jar_finalizer(path) ⇒ Object

Instance Method Summary collapse

#close ⇒ Object
#fetch(url, **opts) ⇒ Object

GET + parse to a Document.
#initialize(cookies: true, user_agent: nil, headers: {}, basic_auth: nil, bearer_token: nil, proxy: nil, ca_path: nil, rate_limit: nil, retry: 0, backoff: 0.3, max_backoff: 10.0, timeout_ms: 30_000, follow_redirects: true, insecure: false, transcode_charset: true) ⇒ Session constructor

A new instance of Session.
#parallel_get(urls, **opts) ⇒ Object

parallel_get respects the session’s defaults (cookies, headers, auth, per-host rate limit).

Constructor Details

#initialize(cookies: true, user_agent: nil, headers: {}, basic_auth: nil, bearer_token: nil, proxy: nil, ca_path: nil, rate_limit: nil, retry: 0, backoff: 0.3, max_backoff: 10.0, timeout_ms: 30_000, follow_redirects: true, insecure: false, transcode_charset: true) ⇒ `Session`

Returns a new instance of Session.

# File 'lib/scrapetor/session.rb', line 36

def initialize(cookies: true,
               user_agent: nil,
               headers: {},
               basic_auth: nil,
               bearer_token: nil,
               proxy: nil,
               ca_path: nil,
               rate_limit: nil,
               retry: 0,
               backoff: 0.3,
               max_backoff: 10.0,
               timeout_ms: 30_000,
               follow_redirects: true,
               insecure: false,
               transcode_charset: true)
  Scrapetor::Fetcher.ensure_available!
  @cookie_jar_path =
    case cookies
    when String then cookies
    when true   then ephemeral_jar_path
    when false, nil then nil
    else raise ArgumentError, "cookies: must be String/true/false"
    end
  @defaults = {
    user_agent: user_agent || Scrapetor::Fetcher::DEFAULT_USER_AGENT,
    headers: DEFAULT_HEADERS.merge(headers),
    basic_auth: basic_auth,
    bearer_token: bearer_token,
    proxy: proxy,
    ca_path: ca_path,
    retry: binding.local_variable_get(:retry),
    backoff: backoff,
    max_backoff: max_backoff,
    timeout_ms: timeout_ms,
    follow_redirects: follow_redirects,
    insecure: insecure,
  }.compact
  @defaults[:transcode_utf8] = transcode_charset
  @defaults[:rate_limit_ms] = (rate_limit * 1000).to_i if rate_limit
end

Instance Attribute Details

#cookie_jar_path ⇒ `Object` (readonly)

Returns the value of attribute cookie_jar_path.



34
35
36

# File 'lib/scrapetor/session.rb', line 34

def cookie_jar_path
  @cookie_jar_path
end

Class Method Details

.make_jar_finalizer(path) ⇒ `Object`



121
122
123

# File 'lib/scrapetor/session.rb', line 121

def self.make_jar_finalizer(path)
  proc { File.delete(path) if File.exist?(path) rescue nil }
end

Instance Method Details

#close ⇒ `Object`

# File 'lib/scrapetor/session.rb', line 104

def close
  File.delete(@cookie_jar_path) if @cookie_jar_path && File.exist?(@cookie_jar_path) && @ephemeral
rescue StandardError
  # tempfile may have already been GC'd; ignore
end

#fetch(url, **opts) ⇒ `Object`

GET + parse to a Document.

Raises:

(Scrapetor::Fetcher::FetchError)

# File 'lib/scrapetor/session.rb', line 85

def fetch(url, **opts)
  resp = get(url, **opts)
  raise Scrapetor::Fetcher::FetchError.new(
    "Session.fetch #{url} -> HTTP #{resp[:status]}",
    status: resp[:status], response: resp
  ) if resp[:status] < 200 || resp[:status] >= 400
  Scrapetor.parse(resp[:body], base_url: resp[:final_url])
end

#parallel_get(urls, **opts) ⇒ `Object`

parallel_get respects the session’s defaults (cookies, headers, auth, per-host rate limit). The native batch honours rate_limit_ms per-host via a shared C-side throttle table, so N parallel workers hitting one host all queue at that gate while different hosts run concurrently.

# File 'lib/scrapetor/session.rb', line 99

def parallel_get(urls, **opts)
  merged = merge_opts(opts)
  Scrapetor::Fetcher.parallel_get(urls, **merged)
end