Class: Grubby
- Inherits:
-
Mechanize
- Object
- Mechanize
- Grubby
- Defined in:
- lib/grubby.rb
Defined Under Namespace
Classes: JsonParser, JsonScraper, PageScraper, Scraper
Constant Summary collapse
- VERSION =
GRUBBY_VERSION
Class Attribute Summary collapse
-
.logger ⇒ Logger
Logger used by Grubby.
Instance Attribute Summary collapse
-
#journal ⇒ Pathname?
Journal file used to ensure only-once processing of resources by #fulfill across multiple program runs.
-
#time_between_requests ⇒ Integer, ...
The minimum amount of time enforced between requests, in seconds.
Instance Method Summary collapse
-
#fulfill(uri, purpose = "") {|resource| ... } ⇒ Object?
Ensures only-once processing of the resource indicated by
urifor the specifiedpurpose. -
#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ Mechanize::Page, ...
Calls
#getwith each ofmirror_urisuntil a successful (“200 OK”) response is received, and returns that#getresult. -
#initialize(journal = nil) ⇒ Grubby
constructor
A new instance of Grubby.
-
#ok?(uri, query_params = {}, headers = {}) ⇒ Boolean
Calls
#headand returns true if a response code “200” is received, false otherwise.
Constructor Details
#initialize(journal = nil) ⇒ Grubby
Returns a new instance of Grubby.
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
# File 'lib/grubby.rb', line 54 def initialize(journal = nil) super() # Prevent "memory leaks", and prevent mistakenly blank urls from # resolving. (Blank urls resolve as a path relative to the last # history entry. Without this setting, an erroneous `agent.get("")` # could sometimes successfully fetch a page.) self.max_history = 0 # Prevent files of unforeseen content type from being buffered into # memory by default, in case they are very large. However, increase # the threshold for what is considered "large", to prevent # unnecessary writes to disk. # # References: # - http://docs.seattlerb.org/mechanize/Mechanize/PluggableParser.html # - http://docs.seattlerb.org/mechanize/Mechanize/Download.html # - http://docs.seattlerb.org/mechanize/Mechanize/File.html self.max_file_buffer = 1_000_000 # only applies to Mechanize::Download self.pluggable_parser.default = Mechanize::Download self.pluggable_parser["text/plain"] = Mechanize::File self.pluggable_parser["application/json"] = Grubby::JsonParser # Set up configurable rate limiting, and choose a reasonable default # rate limit. self.pre_connect_hooks << Proc.new{ self.send(:sleep_between_requests) } self.post_connect_hooks << Proc.new do |agent, uri, response, body| self.send(:mark_last_request_time, (Time.now unless response.code.to_s.start_with?("3"))) end self.time_between_requests = 1.0 self.journal = journal end |
Class Attribute Details
.logger ⇒ Logger
Logger used by Grubby.
27 28 29 30 31 32 33 |
# File 'lib/grubby.rb', line 27 def logger @logger ||= Logger.new($stderr).tap do |logger| logger.formatter = -> (severity, time, progname, msg) do "[#{time.strftime "%Y-%m-%d %H:%M:%S"}] #{severity} #{msg}\n" end end end |
Instance Attribute Details
#journal ⇒ Pathname?
Journal file used to ensure only-once processing of resources by #fulfill across multiple program runs.
49 50 51 |
# File 'lib/grubby.rb', line 49 def journal @journal end |
#time_between_requests ⇒ Integer, ...
The minimum amount of time enforced between requests, in seconds. If the value is a Range, a random number within the Range is chosen for each request.
43 44 45 |
# File 'lib/grubby.rb', line 43 def time_between_requests @time_between_requests end |
Instance Method Details
#fulfill(uri, purpose = "") {|resource| ... } ⇒ Object?
Ensures only-once processing of the resource indicated by uri for the specified purpose. The given block is executed and the result is returned if and only if the Grubby instance has not recorded a previous call to fulfill for the same resource and purpose.
Note that the resource is identified by both its URI and its content hash. The latter prevents superfluous and rearranged URI query string parameters from interfering with only-once processing.
If #journal is set, and if the block does not raise an exception, the resource and purpose are logged to the journal file. This enables only-once processing across multiple program runs. It also provides a means to resume batch processing after an unexpected termination.
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
# File 'lib/grubby.rb', line 210 def fulfill(uri, purpose = "") series = [] uri = uri.to_absolute_uri return unless add_fulfilled(uri, purpose, series) normalized_uri = normalize_uri(uri) return unless add_fulfilled(normalized_uri, purpose, series) Grubby.logger.info("Fetch #{normalized_uri}") resource = get(normalized_uri) unprocessed = add_fulfilled(resource.uri, purpose, series) & add_fulfilled("content hash: #{resource.content_hash}", purpose, series) result = yield resource if unprocessed CSV.open(journal, "a") do |csv| series.each{|entry| csv << entry } end if journal result end |
#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ Mechanize::Page, ...
Calls #get with each of mirror_uris until a successful (“200 OK”) response is received, and returns that #get result. Rescues and logs Mechanize::ResponseCodeError failures for all but the last mirror.
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
# File 'lib/grubby.rb', line 149 def get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) i = 0 begin get(mirror_uris[i], parameters, referer, headers) rescue Mechanize::ResponseCodeError => e i += 1 if i >= mirror_uris.length raise else Grubby.logger.debug("Mirror failed (code #{e.response_code}): #{mirror_uris[i - 1]}") Grubby.logger.debug("Try mirror: #{mirror_uris[i]}") retry end end end |
#ok?(uri, query_params = {}, headers = {}) ⇒ Boolean
Calls #head and returns true if a response code “200” is received, false otherwise. Unlike #head, error response codes (e.g. “404”, “500”) do not result in a Mechanize::ResponseCodeError being raised.
118 119 120 121 122 123 124 |
# File 'lib/grubby.rb', line 118 def ok?(uri, query_params = {}, headers = {}) begin head(uri, query_params, headers).code == "200" rescue Mechanize::ResponseCodeError false end end |