Class: Grubby

Inherits:

Mechanize

Object
Mechanize
Grubby

show all

Defined in:: lib/grubby.rb

Defined Under Namespace

Classes: JsonParser, JsonScraper, PageScraper, Scraper

Constant Summary collapse

VERSION =

GRUBBY_VERSION

Class Attribute Summary collapse

.logger ⇒ Logger

Logger used by Grubby.

Instance Attribute Summary collapse

#journal ⇒ Pathname^?

Journal file used to ensure only-once processing of resources by #fulfill across multiple program runs.
#time_between_requests ⇒ Integer, ...

The minimum amount of time enforced between requests, in seconds.

Instance Method Summary collapse

#fulfill(uri, purpose = "") {|resource| ... } ⇒ Object^?

Ensures only-once processing of the resource indicated by uri for the specified purpose.
#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ Mechanize::Page, ...

Calls #get with each of mirror_uris until a successful (“200 OK”) response is received, and returns that #get result.
#initialize(journal = nil) ⇒ Grubby constructor

A new instance of Grubby.
#ok?(uri, query_params = {}, headers = {}) ⇒ Boolean

Calls #head and returns true if a response code “200” is received, false otherwise.

Constructor Details

#initialize(journal = nil) ⇒ `Grubby`

Returns a new instance of Grubby.

Parameters:

journal (Pathname, String) (defaults to: nil) —

Optional journal file used to ensure only-once processing of resources by #fulfill across multiple program runs

# File 'lib/grubby.rb', line 54

def initialize(journal = nil)
  super()

  # Prevent "memory leaks", and prevent mistakenly blank urls from
  # resolving.  (Blank urls resolve as a path relative to the last
  # history entry.  Without this setting, an erroneous `agent.get("")`
  # could sometimes successfully fetch a page.)
  self.max_history = 0

  # Prevent files of unforeseen content type from being buffered into
  # memory by default, in case they are very large.  However, increase
  # the threshold for what is considered "large", to prevent
  # unnecessary writes to disk.
  #
  # References:
  #   - http://docs.seattlerb.org/mechanize/Mechanize/PluggableParser.html
  #   - http://docs.seattlerb.org/mechanize/Mechanize/Download.html
  #   - http://docs.seattlerb.org/mechanize/Mechanize/File.html
  self.max_file_buffer = 1_000_000 # only applies to Mechanize::Download
  self.pluggable_parser.default = Mechanize::Download
  self.pluggable_parser["text/plain"] = Mechanize::File
  self.pluggable_parser["application/json"] = Grubby::JsonParser

  # Set up configurable rate limiting, and choose a reasonable default
  # rate limit.
  self.pre_connect_hooks << Proc.new{ self.send(:sleep_between_requests) }
  self.post_connect_hooks << Proc.new do |agent, uri, response, body|
    self.send(:mark_last_request_time, (Time.now unless response.code.to_s.start_with?("3")))
  end
  self.time_between_requests = 1.0

  self.journal = journal
end

Class Attribute Details

.logger ⇒ `Logger`

Logger used by Grubby.

Returns:

(Logger)

# File 'lib/grubby.rb', line 27

def logger
  @logger ||= Logger.new($stderr).tap do |logger|
    logger.formatter = -> (severity, time, progname, msg) do
      "[#{time.strftime "%Y-%m-%d %H:%M:%S"}] #{severity} #{msg}\n"
    end
  end
end

Instance Attribute Details

#journal ⇒ `Pathname`^?

Journal file used to ensure only-once processing of resources by #fulfill across multiple program runs.

Returns:

(Pathname, nil)



49
50
51

# File 'lib/grubby.rb', line 49

def journal
  @journal
end

#time_between_requests ⇒ `Integer`, ...

The minimum amount of time enforced between requests, in seconds. If the value is a Range, a random number within the Range is chosen for each request.

Returns:

(Integer, Float, Range<Integer>, Range<Float>)



43
44
45

# File 'lib/grubby.rb', line 43

def time_between_requests
  @time_between_requests
end

Instance Method Details

#fulfill(uri, purpose = "") {|resource| ... } ⇒ `Object`^?

Ensures only-once processing of the resource indicated by uri for the specified purpose. The given block is executed and the result is returned if and only if the Grubby instance has not recorded a previous call to fulfill for the same resource and purpose.

Note that the resource is identified by both its URI and its content hash. The latter prevents superfluous and rearranged URI query string parameters from interfering with only-once processing.

If #journal is set, and if the block does not raise an exception, the resource and purpose are logged to the journal file. This enables only-once processing across multiple program runs. It also provides a means to resume batch processing after an unexpected termination.

Examples:

grubby = Grubby.new

grubby.fulfill("https://example.com/posts") do |page|
  "first time"
end
# == "first time"

grubby.fulfill("https://example.com/posts") do |page|
  "already seen" # not evaluated
end
# == nil

grubby.fulfill("https://example.com/posts?page=1") do |page|
  "already seen content hash" # not evaluated
end
# == nil

grubby.fulfill("https://example.com/posts", "again!") do |page|
  "already seen, but new purpose"
end
# == "already seen, but new purpose"

Parameters:

uri (URI, String)
purpose (String) (defaults to: "")

Yield Parameters:

resource (Mechanize::Page, Mechanize::File, Mechanize::Download, ...)

Yield Returns:

(Object)

Returns:

(Object, nil)

Raises:

(Mechanize::ResponseCodeError) —

if fetching the resource results in error (see Mechanize#get)

# File 'lib/grubby.rb', line 210

def fulfill(uri, purpose = "")
  series = []

  uri = uri.to_absolute_uri
  return unless add_fulfilled(uri, purpose, series)

  normalized_uri = normalize_uri(uri)
  return unless add_fulfilled(normalized_uri, purpose, series)

  Grubby.logger.info("Fetch #{normalized_uri}")
  resource = get(normalized_uri)
  unprocessed = add_fulfilled(resource.uri, purpose, series) &
    add_fulfilled("content hash: #{resource.content_hash}", purpose, series)

  result = yield resource if unprocessed

  CSV.open(journal, "a") do |csv|
    series.each{|entry| csv << entry }
  end if journal

  result
end

#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ `Mechanize::Page`, ...

Calls #get with each of mirror_uris until a successful (“200 OK”) response is received, and returns that #get result. Rescues and logs Mechanize::ResponseCodeError failures for all but the last mirror.

Examples:

grubby = Grubby.new

urls = [
  "https://httpstat.us/404",
  "https://httpstat.us/500",
  "https://httpstat.us/200?foo",
  "https://httpstat.us/200?bar",
]

grubby.get_mirrored(urls).uri  # == URI("https://httpstat.us/200?foo")

grubby.get_mirrored(urls.take(2))  # raise Mechanize::ResponseCodeError

Parameters:

mirror_uris (Array<URI>, Array<String>)

Returns:

(Mechanize::Page, Mechanize::File, Mechanize::Download, ...)

Raises:

(Mechanize::ResponseCodeError) —

if all mirror_uris fail

# File 'lib/grubby.rb', line 149

def get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {})
  i = 0
  begin
    get(mirror_uris[i], parameters, referer, headers)
  rescue Mechanize::ResponseCodeError => e
    i += 1
    if i >= mirror_uris.length
      raise
    else
      Grubby.logger.debug("Mirror failed (code #{e.response_code}): #{mirror_uris[i - 1]}")
      Grubby.logger.debug("Try mirror: #{mirror_uris[i]}")
      retry
    end
  end
end

#ok?(uri, query_params = {}, headers = {}) ⇒ `Boolean`

Calls #head and returns true if a response code “200” is received, false otherwise. Unlike #head, error response codes (e.g. “404”, “500”) do not result in a Mechanize::ResponseCodeError being raised.

Parameters:

uri (URI, String)

Returns:

(Boolean)

# File 'lib/grubby.rb', line 118

def ok?(uri, query_params = {}, headers = {})
  begin
    head(uri, query_params, headers).code == "200"
  rescue Mechanize::ResponseCodeError
    false
  end
end

Class: Grubby

Defined Under Namespace

Constant Summary collapse

Class Attribute Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(journal = nil) ⇒ Grubby

Class Attribute Details

.logger ⇒ Logger

Instance Attribute Details

#journal ⇒ Pathname?

#time_between_requests ⇒ Integer, ...

Instance Method Details

#fulfill(uri, purpose = "") {|resource| ... } ⇒ Object?

Examples:

#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ Mechanize::Page, ...

Examples:

#ok?(uri, query_params = {}, headers = {}) ⇒ Boolean

#initialize(journal = nil) ⇒ `Grubby`

.logger ⇒ `Logger`

#journal ⇒ `Pathname`^?

#time_between_requests ⇒ `Integer`, ...

#fulfill(uri, purpose = "") {|resource| ... } ⇒ `Object`^?

#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ `Mechanize::Page`, ...

#ok?(uri, query_params = {}, headers = {}) ⇒ `Boolean`