Class: Scrapetor::Robots
- Inherits:
-
Object
- Object
- Scrapetor::Robots
- Defined in:
- lib/scrapetor/robots.rb
Overview
robots.txt parser + path-match decider.
r = Scrapetor::Robots.fetch_for("https://example.com")
r.allowed?("https://example.com/private")
r.crawl_delay
r.sitemaps
Implements the de-facto Google / RFC 9309 longest-match semantics: the most-specific (longest pattern) Allow/Disallow rule wins. User-agent matching is case-insensitive prefix; ‘*’ is the fallback.
Defined Under Namespace
Classes: Rule
Instance Attribute Summary collapse
-
#sitemaps ⇒ Object
readonly
Returns the value of attribute sitemaps.
Class Method Summary collapse
Instance Method Summary collapse
- #allowed?(url) ⇒ Boolean
- #crawl_delay ⇒ Object
- #disallowed?(url) ⇒ Boolean
-
#initialize(body, user_agent: "*") ⇒ Robots
constructor
A new instance of Robots.
Constructor Details
#initialize(body, user_agent: "*") ⇒ Robots
Returns a new instance of Robots.
21 22 23 24 25 26 27 |
# File 'lib/scrapetor/robots.rb', line 21 def initialize(body, user_agent: "*") @ua = user_agent @groups = {} # ua_pattern (lowercased) => Array<Rule> @delays = {} # ua_pattern => Float @sitemaps = [] parse!(body.to_s) end |
Instance Attribute Details
#sitemaps ⇒ Object (readonly)
Returns the value of attribute sitemaps.
19 20 21 |
# File 'lib/scrapetor/robots.rb', line 19 def sitemaps @sitemaps end |
Class Method Details
.fetch_for(origin, user_agent: "*", **opts) ⇒ Object
61 62 63 64 65 66 67 |
# File 'lib/scrapetor/robots.rb', line 61 def self.fetch_for(origin, user_agent: "*", **opts) uri = URI(origin.to_s) url = "#{uri.scheme}://#{uri.host}#{uri.port == uri.default_port ? "" : ":#{uri.port}"}/robots.txt" resp = Scrapetor::Fetcher.get(url, raise_for_status: false, **opts) body = resp[:status] == 200 ? resp[:body] : "" new(body, user_agent: user_agent) end |
Instance Method Details
#allowed?(url) ⇒ Boolean
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/scrapetor/robots.rb', line 29 def allowed?(url) s = url.to_s path = if s.start_with?("/") s else uri = URI(s) (uri.path.empty? ? "/" : uri.path) + (uri.query ? "?#{uri.query}" : "") end rules = applicable_rules return true if rules.empty? # Find the longest matching pattern (Google convention; RFC 9309 # also says the most specific match wins). best = nil rules.each do |r| next unless path_matches?(path, r.pattern) if best.nil? || r.pattern.length > best.pattern.length best = r end end best.nil? || best.type == :allow end |
#crawl_delay ⇒ Object
56 57 58 59 |
# File 'lib/scrapetor/robots.rb', line 56 def crawl_delay ua = ua_for(@ua) @delays[ua] || @delays["*"] end |
#disallowed?(url) ⇒ Boolean
52 53 54 |
# File 'lib/scrapetor/robots.rb', line 52 def disallowed?(url) !allowed?(url) end |