Class: Scrapetor::Session

Inherits:
Object
  • Object
show all
Defined in:
lib/scrapetor/session.rb

Overview

Stateful HTTP session. Wraps Scrapetor::Fetcher with:

- persistent cookie jar (libcurl COOKIEJAR/COOKIEFILE)
- default headers merged into every request
- basic / bearer auth applied automatically
- per-host rate limiting (polite throttle)
- default retry/backoff
- auto charset transcoding of HTML bodies to UTF-8

session = Scrapetor::Session.new(
  cookies:     true,          # ephemeral tempfile jar
  user_agent:  "MyBot/1.0",
  rate_limit:  0.5,           # min seconds between same-host requests
  retry:       3,
  headers:     { "Accept-Language" => "en-US" },
)
doc = session.fetch("https://example.com/login")
session.post("https://example.com/login", form: { user: "x", pass: "y" })
doc = session.fetch("https://example.com/dashboard")

Cookies set during the login persist for the dashboard call.

Constant Summary collapse

DEFAULT_HEADERS =
{
  "Accept"          => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language" => "en-US,en;q=0.5",
}.freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(cookies: true, user_agent: nil, headers: {}, basic_auth: nil, bearer_token: nil, proxy: nil, ca_path: nil, rate_limit: nil, retry: 0, backoff: 0.3, max_backoff: 10.0, timeout_ms: 30_000, follow_redirects: true, insecure: false, transcode_charset: true) ⇒ Session

Returns a new instance of Session.



36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/scrapetor/session.rb', line 36

def initialize(cookies: true,
               user_agent: nil,
               headers: {},
               basic_auth: nil,
               bearer_token: nil,
               proxy: nil,
               ca_path: nil,
               rate_limit: nil,
               retry: 0,
               backoff: 0.3,
               max_backoff: 10.0,
               timeout_ms: 30_000,
               follow_redirects: true,
               insecure: false,
               transcode_charset: true)
  Scrapetor::Fetcher.ensure_available!
  @cookie_jar_path =
    case cookies
    when String then cookies
    when true   then ephemeral_jar_path
    when false, nil then nil
    else raise ArgumentError, "cookies: must be String/true/false"
    end
  @defaults = {
    user_agent: user_agent || Scrapetor::Fetcher::DEFAULT_USER_AGENT,
    headers: DEFAULT_HEADERS.merge(headers),
    basic_auth: basic_auth,
    bearer_token: bearer_token,
    proxy: proxy,
    ca_path: ca_path,
    retry: binding.local_variable_get(:retry),
    backoff: backoff,
    max_backoff: max_backoff,
    timeout_ms: timeout_ms,
    follow_redirects: follow_redirects,
    insecure: insecure,
  }.compact
  @defaults[:transcode_utf8] = transcode_charset
  @defaults[:rate_limit_ms] = (rate_limit * 1000).to_i if rate_limit
end

Instance Attribute Details

Returns the value of attribute cookie_jar_path.



34
35
36
# File 'lib/scrapetor/session.rb', line 34

def cookie_jar_path
  @cookie_jar_path
end

Class Method Details

.make_jar_finalizer(path) ⇒ Object



121
122
123
# File 'lib/scrapetor/session.rb', line 121

def self.make_jar_finalizer(path)
  proc { File.delete(path) if File.exist?(path) rescue nil }
end

Instance Method Details

#closeObject



104
105
106
107
108
# File 'lib/scrapetor/session.rb', line 104

def close
  File.delete(@cookie_jar_path) if @cookie_jar_path && File.exist?(@cookie_jar_path) && @ephemeral
rescue StandardError
  # tempfile may have already been GC'd; ignore
end

#fetch(url, **opts) ⇒ Object

GET + parse to a Document.



85
86
87
88
89
90
91
92
# File 'lib/scrapetor/session.rb', line 85

def fetch(url, **opts)
  resp = get(url, **opts)
  raise Scrapetor::Fetcher::FetchError.new(
    "Session.fetch #{url} -> HTTP #{resp[:status]}",
    status: resp[:status], response: resp
  ) if resp[:status] < 200 || resp[:status] >= 400
  Scrapetor.parse(resp[:body], base_url: resp[:final_url])
end

#parallel_get(urls, **opts) ⇒ Object

parallel_get respects the session’s defaults (cookies, headers, auth, per-host rate limit). The native batch honours rate_limit_ms per-host via a shared C-side throttle table, so N parallel workers hitting one host all queue at that gate while different hosts run concurrently.



99
100
101
102
# File 'lib/scrapetor/session.rb', line 99

def parallel_get(urls, **opts)
  merged = merge_opts(opts)
  Scrapetor::Fetcher.parallel_get(urls, **merged)
end