Module: Scrapetor::Sitemap
- Defined in:
- lib/scrapetor/sitemap.rb
Overview
Sitemap.xml ingestion. Handles both <urlset> (URL listings) and <sitemapindex> (nested sitemap references), streaming so a huge sitemap doesn’t have to fit in memory at once.
Scrapetor::Sitemap.urls("https://example.com/sitemap.xml") do |url, |
puts url, [:lastmod], [:priority]
end
Or, return an array:
Scrapetor::Sitemap.urls("https://example.com/sitemap.xml").to_a
Class Method Summary collapse
- .open_source(source) ⇒ Object
-
.urls(source, depth: 0, max_depth: 5, &block) ⇒ Object
Stream-iterate every URL in the sitemap.
Class Method Details
.open_source(source) ⇒ Object
45 46 47 48 49 50 |
# File 'lib/scrapetor/sitemap.rb', line 45 def self.open_source(source) return source if source.respond_to?(:read) return StringIO.new(source) if source.is_a?(String) && !source.start_with?("http") resp = Scrapetor::Fetcher.get(source.to_s) StringIO.new(resp[:body]) end |
.urls(source, depth: 0, max_depth: 5, &block) ⇒ Object
Stream-iterate every URL in the sitemap. Recurses into <sitemapindex> entries automatically. Yields (url, meta) where meta carries :lastmod / :changefreq / :priority when present.
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/scrapetor/sitemap.rb', line 21 def self.urls(source, depth: 0, max_depth: 5, &block) return enum_for(:urls, source, depth: depth, max_depth: max_depth) unless block raise ArgumentError, "sitemap recursion too deep" if depth > max_depth io = open_source(source) Scrapetor.stream(io, outer: "url") do |doc| loc = doc.at_css("loc")&.text&.strip next unless loc && !loc.empty? = { lastmod: doc.at_css("lastmod")&.text&.strip, changefreq: doc.at_css("changefreq")&.text&.strip, priority: doc.at_css("priority")&.text&.strip, } yield loc, end # If the file was a sitemapindex instead, the <url> stream above # found nothing. Re-open and scan for <sitemap><loc>. child_io = open_source(source) Scrapetor.stream(child_io, outer: "sitemap") do |doc| child_loc = doc.at_css("loc")&.text&.strip next unless child_loc urls(child_loc, depth: depth + 1, max_depth: max_depth, &block) end end |