Class: Scrapetor::Robots

Inherits:

Object

Object
Scrapetor::Robots

show all

Defined in:: lib/scrapetor/robots.rb

Overview

robots.txt parser + path-match decider.

r = Scrapetor::Robots.fetch_for("https://example.com")
r.allowed?("https://example.com/private")
r.crawl_delay
r.sitemaps

Implements the de-facto Google / RFC 9309 longest-match semantics: the most-specific (longest pattern) Allow/Disallow rule wins. User-agent matching is case-insensitive prefix; ‘*’ is the fallback.

Defined Under Namespace

Classes: Rule

Instance Attribute Summary collapse

#sitemaps ⇒ Object readonly

Returns the value of attribute sitemaps.

Class Method Summary collapse

.fetch_for(origin, user_agent: "*", **opts) ⇒ Object

Instance Method Summary collapse

#allowed?(url) ⇒ Boolean
#crawl_delay ⇒ Object
#disallowed?(url) ⇒ Boolean
#initialize(body, user_agent: "*") ⇒ Robots constructor

A new instance of Robots.

Constructor Details

#initialize(body, user_agent: "*") ⇒ `Robots`

Returns a new instance of Robots.

# File 'lib/scrapetor/robots.rb', line 21

def initialize(body, user_agent: "*")
  @ua = user_agent
  @groups = {}      # ua_pattern (lowercased) => Array<Rule>
  @delays = {}      # ua_pattern => Float
  @sitemaps = []
  parse!(body.to_s)
end

Instance Attribute Details

#sitemaps ⇒ `Object` (readonly)

Returns the value of attribute sitemaps.



19
20
21

# File 'lib/scrapetor/robots.rb', line 19

def sitemaps
  @sitemaps
end

Class Method Details

.fetch_for(origin, user_agent: "*", **opts) ⇒ `Object`

# File 'lib/scrapetor/robots.rb', line 61

def self.fetch_for(origin, user_agent: "*", **opts)
  uri = URI(origin.to_s)
  url = "#{uri.scheme}://#{uri.host}#{uri.port == uri.default_port ? "" : ":#{uri.port}"}/robots.txt"
  resp = Scrapetor::Fetcher.get(url, raise_for_status: false, **opts)
  body = resp[:status] == 200 ? resp[:body] : ""
  new(body, user_agent: user_agent)
end

Instance Method Details

#allowed?(url) ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/scrapetor/robots.rb', line 29

def allowed?(url)
  s = url.to_s
  path =
    if s.start_with?("/")
      s
    else
      uri = URI(s)
      (uri.path.empty? ? "/" : uri.path) + (uri.query ? "?#{uri.query}" : "")
    end
  rules = applicable_rules
  return true if rules.empty?
  # Find the longest matching pattern (Google convention; RFC 9309
  # also says the most specific match wins).
  best = nil
  rules.each do |r|
    next unless path_matches?(path, r.pattern)
    if best.nil? || r.pattern.length > best.pattern.length
      best = r
    end
  end
  best.nil? || best.type == :allow
end

#crawl_delay ⇒ `Object`

# File 'lib/scrapetor/robots.rb', line 56

def crawl_delay
  ua = ua_for(@ua)
  @delays[ua] || @delays["*"]
end

#disallowed?(url) ⇒ `Boolean`