Class: Scrapetor::Robots

Inherits:
Object
  • Object
show all
Defined in:
lib/scrapetor/robots.rb

Overview

robots.txt parser + path-match decider.

r = Scrapetor::Robots.fetch_for("https://example.com")
r.allowed?("https://example.com/private")
r.crawl_delay
r.sitemaps

Implements the de-facto Google / RFC 9309 longest-match semantics: the most-specific (longest pattern) Allow/Disallow rule wins. User-agent matching is case-insensitive prefix; ‘*’ is the fallback.

Defined Under Namespace

Classes: Rule

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(body, user_agent: "*") ⇒ Robots

Returns a new instance of Robots.



21
22
23
24
25
26
27
# File 'lib/scrapetor/robots.rb', line 21

def initialize(body, user_agent: "*")
  @ua = user_agent
  @groups = {}      # ua_pattern (lowercased) => Array<Rule>
  @delays = {}      # ua_pattern => Float
  @sitemaps = []
  parse!(body.to_s)
end

Instance Attribute Details

#sitemapsObject (readonly)

Returns the value of attribute sitemaps.



19
20
21
# File 'lib/scrapetor/robots.rb', line 19

def sitemaps
  @sitemaps
end

Class Method Details

.fetch_for(origin, user_agent: "*", **opts) ⇒ Object



61
62
63
64
65
66
67
# File 'lib/scrapetor/robots.rb', line 61

def self.fetch_for(origin, user_agent: "*", **opts)
  uri = URI(origin.to_s)
  url = "#{uri.scheme}://#{uri.host}#{uri.port == uri.default_port ? "" : ":#{uri.port}"}/robots.txt"
  resp = Scrapetor::Fetcher.get(url, raise_for_status: false, **opts)
  body = resp[:status] == 200 ? resp[:body] : ""
  new(body, user_agent: user_agent)
end

Instance Method Details

#allowed?(url) ⇒ Boolean

Returns:

  • (Boolean)


29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# File 'lib/scrapetor/robots.rb', line 29

def allowed?(url)
  s = url.to_s
  path =
    if s.start_with?("/")
      s
    else
      uri = URI(s)
      (uri.path.empty? ? "/" : uri.path) + (uri.query ? "?#{uri.query}" : "")
    end
  rules = applicable_rules
  return true if rules.empty?
  # Find the longest matching pattern (Google convention; RFC 9309
  # also says the most specific match wins).
  best = nil
  rules.each do |r|
    next unless path_matches?(path, r.pattern)
    if best.nil? || r.pattern.length > best.pattern.length
      best = r
    end
  end
  best.nil? || best.type == :allow
end

#crawl_delayObject



56
57
58
59
# File 'lib/scrapetor/robots.rb', line 56

def crawl_delay
  ua = ua_for(@ua)
  @delays[ua] || @delays["*"]
end

#disallowed?(url) ⇒ Boolean

Returns:

  • (Boolean)


52
53
54
# File 'lib/scrapetor/robots.rb', line 52

def disallowed?(url)
  !allowed?(url)
end