Module: Scrapetor::Pagination
- Defined in:
- lib/scrapetor/pagination.rb
Overview
Pagination helper. Walks a page sequence by detecting the “next page” URL from the document — in priority order:
1. <link rel="next" href="..."> in <head>
2. a[rel~="next"] (most common pattern; HTML spec compliant)
3. The configured CSS selector via :next_link
Stops when no next link is found, when max_pages is reached, or when the next URL hasn’t changed (defensive against malformed next links pointing at self).
Scrapetor::Pagination.each_page("https://example.com/listings") do |doc, url|
doc.css(".product").each { |p| ... }
end
Yields (doc, url) for each page in order. When :http is set to a Scrapetor::Fetcher / Session-like object, it’s used for fetches; otherwise Scrapetor::Fetcher (HTTP/2 via libcurl) is used by default, with a Net::HTTP fallback if libcurl isn’t available.
Constant Summary collapse
- DEFAULT_MAX_PAGES =
50- DEFAULT_DELAY =
0.0
Class Method Summary collapse
- .absolutize(href, base) ⇒ Object
- .each_page(start_url, max_pages: DEFAULT_MAX_PAGES, delay: DEFAULT_DELAY, http: nil, next_link: nil) ⇒ Object
- .fetch_page(url, http) ⇒ Object
-
.next_page_url(doc, current_url, custom_selector = nil) ⇒ Object
Inspect a document and return the next page URL, or nil if this is the last page.
Class Method Details
.absolutize(href, base) ⇒ Object
97 98 99 100 101 102 |
# File 'lib/scrapetor/pagination.rb', line 97 def self.absolutize(href, base) return nil if href.nil? || href.empty? URI.join(base, href).to_s rescue URI::InvalidURIError nil end |
.each_page(start_url, max_pages: DEFAULT_MAX_PAGES, delay: DEFAULT_DELAY, http: nil, next_link: nil) ⇒ Object
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/scrapetor/pagination.rb', line 29 def self.each_page(start_url, max_pages: DEFAULT_MAX_PAGES, delay: DEFAULT_DELAY, http: nil, next_link: nil) return enum_for(:each_page, start_url, max_pages: max_pages, delay: delay, http: http, next_link: next_link) unless block_given? url = start_url.to_s visited = {} page_no = 0 while url && page_no < max_pages break if visited[url] visited[url] = true page_no += 1 doc = fetch_page(url, http) yield doc, url nxt = next_page_url(doc, url, next_link) sleep delay if delay > 0 && nxt url = nxt end nil end |
.fetch_page(url, http) ⇒ Object
87 88 89 90 91 92 93 94 95 |
# File 'lib/scrapetor/pagination.rb', line 87 def self.fetch_page(url, http) if http && http.respond_to?(:fetch) http.fetch(url) elsif defined?(Scrapetor::Fetcher) && Scrapetor::Fetcher.available? Scrapetor::Fetcher.fetch(url) else Scrapetor.fetch(url) end end |
.next_page_url(doc, current_url, custom_selector = nil) ⇒ Object
Inspect a document and return the next page URL, or nil if this is the last page. Honours <link rel=next> > a > a custom selector via :next_link.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
# File 'lib/scrapetor/pagination.rb', line 57 def self.next_page_url(doc, current_url, custom_selector = nil) # 1. <link rel="next"> if (link = doc.at_css('link[rel~="next"]')) href = link["href"] || link[:href] return absolutize(href, current_url) if href && !href.empty? end # 2. a[rel~="next"] doc.css('a[rel~="next"]').each do |a| href = a["href"] || a[:href] next unless href && !href.empty? abs = absolutize(href, current_url) return abs if abs && abs != current_url end # 3. Custom selector — first link element under the match. if custom_selector node = doc.at_css(custom_selector) if node # Walk up if user gave us a link target like ".next-link" # already pointing at an <a>, or treat as the wrapper and # grab the first <a> within. link_node = node.respond_to?(:name) && node.name.casecmp?("a") ? node : node.at_css("a") if link_node href = link_node["href"] || link_node[:href] return absolutize(href, current_url) if href && !href.empty? end end end nil end |