Module: Scrapetor::Sitemap

Defined in:
lib/scrapetor/sitemap.rb

Overview

Sitemap.xml ingestion. Handles both <urlset> (URL listings) and <sitemapindex> (nested sitemap references), streaming so a huge sitemap doesn’t have to fit in memory at once.

Scrapetor::Sitemap.urls("https://example.com/sitemap.xml") do |url, meta|
  puts url, meta[:lastmod], meta[:priority]
end

Or, return an array:

Scrapetor::Sitemap.urls("https://example.com/sitemap.xml").to_a

Class Method Summary collapse

Class Method Details

.open_source(source) ⇒ Object



45
46
47
48
49
50
# File 'lib/scrapetor/sitemap.rb', line 45

def self.open_source(source)
  return source if source.respond_to?(:read)
  return StringIO.new(source) if source.is_a?(String) && !source.start_with?("http")
  resp = Scrapetor::Fetcher.get(source.to_s)
  StringIO.new(resp[:body])
end

.urls(source, depth: 0, max_depth: 5, &block) ⇒ Object

Stream-iterate every URL in the sitemap. Recurses into <sitemapindex> entries automatically. Yields (url, meta) where meta carries :lastmod / :changefreq / :priority when present.

Raises:

  • (ArgumentError)


21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# File 'lib/scrapetor/sitemap.rb', line 21

def self.urls(source, depth: 0, max_depth: 5, &block)
  return enum_for(:urls, source, depth: depth, max_depth: max_depth) unless block
  raise ArgumentError, "sitemap recursion too deep" if depth > max_depth
  io = open_source(source)
  Scrapetor.stream(io, outer: "url") do |doc|
    loc = doc.at_css("loc")&.text&.strip
    next unless loc && !loc.empty?
    meta = {
      lastmod:    doc.at_css("lastmod")&.text&.strip,
      changefreq: doc.at_css("changefreq")&.text&.strip,
      priority:   doc.at_css("priority")&.text&.strip,
    }
    yield loc, meta
  end
  # If the file was a sitemapindex instead, the <url> stream above
  # found nothing. Re-open and scan for <sitemap><loc>.
  child_io = open_source(source)
  Scrapetor.stream(child_io, outer: "sitemap") do |doc|
    child_loc = doc.at_css("loc")&.text&.strip
    next unless child_loc
    urls(child_loc, depth: depth + 1, max_depth: max_depth, &block)
  end
end