Class: Scrapetor::Document
- Inherits:
-
Object
- Object
- Scrapetor::Document
- Defined in:
- lib/scrapetor/document.rb
Instance Attribute Summary collapse
-
#base_url ⇒ Object
readonly
Returns the value of attribute base_url.
-
#encoding ⇒ Object
readonly
Returns the value of attribute encoding.
Instance Method Summary collapse
- #all_elements ⇒ Object
-
#at(selector, *_extra) ⇒ Object
(also: #at_css)
Accepts the Nokogiri-compatible signature ‘doc.at(sel, ns_or_handler)`.
- #at_xpath(expr) ⇒ Object
- #backing ⇒ Object
-
#batch_css(selectors) ⇒ Object
Run an array of CSS selectors in ONE Ruby/C boundary crossing.
- #body ⇒ Object
- #cache_selector(selector) ⇒ Object
-
#class_index ⇒ Object
Phase-2 hooks: structural indexes.
-
#css(*selectors) ⇒ Object
(also: #search)
CSS query entry point.
- #css_single(selector) ⇒ Object
-
#errors ⇒ Object
Nokogiri-compat predicates.
-
#extract(schema = nil, &block) ⇒ Object
Single-result extract on the document scope.
-
#extract_css(map) ⇒ Object
Hash form: ‘{ name => selector, … }` -> `{ name => result, … }`.
-
#extract_each(outer_selector, fields) ⇒ Object
Iterate matches of ‘outer_selector` across the whole document and build a Hash per match using `fields` (a => selector map).
- #head ⇒ Object
- #html ⇒ Object
- #html? ⇒ Boolean
- #html_str ⇒ Object
- #id_index ⇒ Object
-
#initialize(html, base_url: nil, build_indexes: false, encoding: :auto, native: nil) ⇒ Document
constructor
A new instance of Document.
-
#json_ld ⇒ Object
Structured-data extractors — for SEO/RAG/structured-content pipelines.
- #microdata ⇒ Object
- #opengraph ⇒ Object
- #page_type ⇒ Object
- #rdfa ⇒ Object
- #root ⇒ Object
- #run_selector(selector, scope) ⇒ Object
- #schema_org(type: nil) ⇒ Object
- #selector_cache_size ⇒ Object
- #stats ⇒ Object
- #tag_index ⇒ Object
- #text ⇒ Object (also: #content, #inner_text)
- #title ⇒ Object
- #to_html ⇒ Object (also: #to_s)
- #traverse(&block) ⇒ Object
- #twitter_card ⇒ Object
- #xml? ⇒ Boolean
-
#xpath(expr) ⇒ Object
Evaluate an XPath expression against this document.
Constructor Details
#initialize(html, base_url: nil, build_indexes: false, encoding: :auto, native: nil) ⇒ Document
Returns a new instance of Document.
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# File 'lib/scrapetor/document.rb', line 7 def initialize(html, base_url: nil, build_indexes: false, encoding: :auto, native: nil) @base_url = base_url raw = html.to_s if encoding == :auto @encoding = Scrapetor::Encoding.detect(raw) @html_str = Scrapetor::Encoding.to_utf8(raw) else @encoding = encoding.to_s @html_str = raw.dup.force_encoding(@encoding).encode("UTF-8", invalid: :replace, undef: :replace, replace: "") end @backing = nil # parsed lazily; native extract bypasses this entirely @selector_cache = {} @indexes_built = false @class_index = nil @id_index = nil @tag_index = nil # Hot-path slots (populated by backing()): keeping these # initialised silences "instance variable not initialized" and # makes the fast-path test a simple nil check. @native_doc = nil @native_wrapper = nil @plan_cache = nil @lazy_ids = nil # If a pre-parsed native handle was passed in (persistent-cache # hit), wrap it directly and skip the lazy-parse path. @prebuilt_native = native build_indexes! if build_indexes end |
Instance Attribute Details
#base_url ⇒ Object (readonly)
Returns the value of attribute base_url.
5 6 7 |
# File 'lib/scrapetor/document.rb', line 5 def base_url @base_url end |
#encoding ⇒ Object (readonly)
Returns the value of attribute encoding.
5 6 7 |
# File 'lib/scrapetor/document.rb', line 5 def encoding @encoding end |
Instance Method Details
#all_elements ⇒ Object
419 420 421 422 |
# File 'lib/scrapetor/document.rb', line 419 def all_elements build_indexes! unless @indexes_built @all_elements end |
#at(selector, *_extra) ⇒ Object Also known as: at_css
Accepts the Nokogiri-compatible signature ‘doc.at(sel, ns_or_handler)`. The extra args (namespace prefix, handler) only matter for XPath land — CSS selectors ignore them — so we accept varargs and discard everything past the selector. Without this, callers that pass `doc.at(sel, namespaces_hash)` (or similar Bing-style patterns) hit `ArgumentError: wrong number of arguments`.
190 191 192 193 194 195 |
# File 'lib/scrapetor/document.rb', line 190 def at(selector, *_extra) result = backing.at_css(selector) return nil if result.nil? return result if result.is_a?(String) Node.new(self, result) end |
#at_xpath(expr) ⇒ Object
211 212 213 214 |
# File 'lib/scrapetor/document.rb', line 211 def at_xpath(expr) result = xpath(expr) result.is_a?(Array) ? result.first : result end |
#backing ⇒ Object
381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 |
# File 'lib/scrapetor/document.rb', line 381 def backing return @backing if @backing @backing = if defined?(Scrapetor::Native::DocumentWrapper) && Scrapetor::Native::AVAILABLE_DOM native = @prebuilt_native || Scrapetor::Native::Document.parse(@html_str) @prebuilt_native = nil Scrapetor::Native::DocumentWrapper.new(native) else Dom::Parser.parse(@html_str) end # Cache the hot-path slots so Document#css can skip the indirection. if defined?(Scrapetor::Native::DocumentWrapper) && @backing.is_a?(Scrapetor::Native::DocumentWrapper) @native_doc = @backing.native @native_wrapper = @backing @plan_cache = @backing.instance_variable_get(:@compile_cache) @lazy_ids = Scrapetor::Native::DocumentWrapper::LazyIds end @backing end |
#batch_css(selectors) ⇒ Object
Run an array of CSS selectors in ONE Ruby/C boundary crossing. On selector-heavy workloads (SERP-style pages with ~30 selectors per scrape) this amortises the per-query Ruby overhead across all of them — N selectors cost roughly one selector worth of Ruby dispatch, not N. Returns an Array of NodeSets (or Arrays-of-strings, for ‘::text` / `::attr(name)` selectors) parallel to the input.
title_ns, price_strs, hrefs = doc.batch_css(
["h1.title", ".price::text", "a::attr(href)"]
)
120 121 122 123 124 125 126 127 |
# File 'lib/scrapetor/document.rb', line 120 def batch_css(selectors) bk = backing unless bk.respond_to?(:batch_css) # Pure-Ruby Dom fallback — no native engine. Loop manually. return selectors.map { |s| css(s) } end bk.batch_css(self, selectors) end |
#body ⇒ Object
246 247 248 249 |
# File 'lib/scrapetor/document.rb', line 246 def body n = backing.at_css("body") n && Node.new(self, n) end |
#cache_selector(selector) ⇒ Object
429 430 431 |
# File 'lib/scrapetor/document.rb', line 429 def cache_selector(selector) @selector_cache[selector] ||= Selector.compile(selector) end |
#class_index ⇒ Object
Phase-2 hooks: structural indexes. Built on demand. The native backend will replace these with arena-resident indexes.
404 405 406 407 |
# File 'lib/scrapetor/document.rb', line 404 def class_index build_indexes! unless @indexes_built @class_index end |
#css(*selectors) ⇒ Object Also known as: search
CSS query entry point. Inlined hot path for the >95% case: a selector with no ‘::` pseudo-element and a cache-hit native plan. That bypasses backing.lazy_css, peel_pseudo_element, and the method dispatch chain, dropping the per-call Ruby overhead to a single Hash#[] + Struct.new + NodeSet.new.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/scrapetor/document.rb', line 45 def css(*selectors) # Nokogiri-compat: `doc.css(sel1, sel2, ...)` accepts multiple # selectors and returns the union of matches across all of them. # Drop trailing non-string arguments (Nokogiri also accepts an # XPath namespaces hash here — that's a no-op for CSS). selectors = selectors.reject { |a| !a.is_a?(String) } raise ArgumentError, "Document#css requires at least one selector" if selectors.empty? return css_single(selectors.first) if selectors.size == 1 seen = {} union = [] string_result = nil selectors.each do |sel| result = css_single(sel) if result.is_a?(Array) string_result = true result.each { |s| union << s } else # NodeSet — pull backing items and dedupe. string_result = false if string_result.nil? result.each do |node| bk = node.respond_to?(:backing_node) ? node.backing_node : node key = bk.object_id next if seen[key] seen[key] = true union << bk end end end string_result ? union : NodeSet.new(self, union) end |
#css_single(selector) ⇒ Object
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
# File 'lib/scrapetor/document.rb', line 77 def css_single(selector) # Fast path: native backing, no mutations applied yet, plain # String selector, no pseudo-element. One Hash lookup + one C # call + two allocations. After any mutation the wrapper flips # into dom_mode and we route through the slow path so reads see # the user's edits — checking @native_wrapper.dom_mode? is one # ivar read, negligible vs the saving when we stay native. if @native_doc && !@native_wrapper.dom_mode? && selector.is_a?(String) && !selector.include?("::") plan = @plan_cache[selector] if plan return NodeSet.new(self, @lazy_ids.new(@native_wrapper, @native_doc, @native_doc.run_chain(plan, nil))) elsif !@plan_cache.key?(selector) plan = Scrapetor::Native.compile_selector_chain(selector) @plan_cache[selector] = plan || false if plan return NodeSet.new(self, @lazy_ids.new(@native_wrapper, @native_doc, @native_doc.run_chain(plan, nil))) end end end # Slow path: pseudo-element, comma, fallback, post-mutation, or # non-native backing. bk = backing result = bk.respond_to?(:lazy_css) ? bk.lazy_css(selector) : bk.css(selector) if result.is_a?(Array) && (result.first.is_a?(String) || (result.empty? && pseudo_element?(selector))) return result end if @lazy_ids && result.is_a?(@lazy_ids) return NodeSet.new(self, result) end NodeSet.new(self, result.to_a) end |
#errors ⇒ Object
Nokogiri-compat predicates.
267 268 269 |
# File 'lib/scrapetor/document.rb', line 267 def errors [] end |
#extract(schema = nil, &block) ⇒ Object
Single-result extract on the document scope. One C call covers field compilation, plan lookup, and result assembly.
143 144 145 146 147 148 149 150 151 152 153 |
# File 'lib/scrapetor/document.rb', line 143 def extract(map) bk = backing if defined?(Scrapetor::Native::DocumentWrapper) && bk.is_a?(Scrapetor::Native::DocumentWrapper) && !bk.dom_mode? r = bk.native.extract_one_h(nil, map, bk) return r unless r.equal?(true) end out = {} map.each_pair { |k, sel| out[k] = at_css(sel) } out end |
#extract_css(map) ⇒ Object
Hash form: ‘{ name => selector, … }` -> `{ name => result, … }`. The classic scrape pattern in two lines. Same one-boundary cost as batch_css.
132 133 134 135 136 137 138 139 |
# File 'lib/scrapetor/document.rb', line 132 def extract_css(map) keys = map.keys selectors = map.values results = batch_css(selectors) out = {} keys.each_with_index { |k, i| out[k] = results[i] } out end |
#extract_each(outer_selector, fields) ⇒ Object
Iterate matches of ‘outer_selector` across the whole document and build a Hash per match using `fields` (a => selector map). Returns Array<Hash>. The inner selectors run scoped to each match, so a `result.at_css(field)`-style parser becomes:
doc.extract_each(".result", {
title: ".title::text",
price: ".price::text",
href: "a::attr(href)",
})
When the document is native-backed and every selector compiles cleanly, the whole iteration runs in a single C call — one outer plan + N inner plans times M matches, zero Ruby↔C round-trips on the hot path. Falls back to the per-row Ruby loop only when a selector compiles to nil (rare; the engine covers nearly every CSS Selectors L4 shape natively after the audit-driven coverage work).
173 174 175 176 177 178 179 180 181 182 |
# File 'lib/scrapetor/document.rb', line 173 def extract_each(outer_selector, fields) bk = backing if defined?(Scrapetor::Native::DocumentWrapper) && bk.is_a?(Scrapetor::Native::DocumentWrapper) && !bk.dom_mode? outer_str = outer_selector.is_a?(String) ? outer_selector : outer_selector.to_s r = bk.native.extract_each_h(outer_str, nil, fields, bk) return r unless r.equal?(true) end css(outer_selector).map { |node| node.extract(fields) } end |
#head ⇒ Object
251 252 253 254 |
# File 'lib/scrapetor/document.rb', line 251 def head n = backing.at_css("head") n && Node.new(self, n) end |
#html ⇒ Object
256 257 258 259 |
# File 'lib/scrapetor/document.rb', line 256 def html n = backing.at_css("html") || backing Node.new(self, n) end |
#html? ⇒ Boolean
271 272 273 |
# File 'lib/scrapetor/document.rb', line 271 def html? true end |
#html_str ⇒ Object
36 37 38 |
# File 'lib/scrapetor/document.rb', line 36 def html_str @html_str end |
#id_index ⇒ Object
409 410 411 412 |
# File 'lib/scrapetor/document.rb', line 409 def id_index build_indexes! unless @indexes_built @id_index end |
#json_ld ⇒ Object
Structured-data extractors — for SEO/RAG/structured-content pipelines.
281 282 283 |
# File 'lib/scrapetor/document.rb', line 281 def json_ld Scrapetor::StructuredData.json_ld(self) end |
#microdata ⇒ Object
297 298 299 |
# File 'lib/scrapetor/document.rb', line 297 def microdata Scrapetor::Microdata.extract(self) end |
#opengraph ⇒ Object
285 286 287 |
# File 'lib/scrapetor/document.rb', line 285 def opengraph Scrapetor::StructuredData.opengraph(self) end |
#page_type ⇒ Object
305 306 307 |
# File 'lib/scrapetor/document.rb', line 305 def page_type Scrapetor::PageType.detect(self) end |
#rdfa ⇒ Object
301 302 303 |
# File 'lib/scrapetor/document.rb', line 301 def rdfa Scrapetor::RDFa.extract(self) end |
#root ⇒ Object
230 231 232 233 |
# File 'lib/scrapetor/document.rb', line 230 def root el = backing.at_css("html") || backing Node.new(self, el) end |
#run_selector(selector, scope) ⇒ Object
424 425 426 427 |
# File 'lib/scrapetor/document.rb', line 424 def run_selector(selector, scope) plan = @selector_cache[selector] ||= Selector.compile(selector) Selector.execute(self, plan, scope) end |
#schema_org(type: nil) ⇒ Object
293 294 295 |
# File 'lib/scrapetor/document.rb', line 293 def schema_org(type: nil) Scrapetor::StructuredData.schema_org(self, type: type) end |
#selector_cache_size ⇒ Object
433 434 435 |
# File 'lib/scrapetor/document.rb', line 433 def selector_cache_size @selector_cache.size end |
#stats ⇒ Object
371 372 373 374 375 376 377 378 379 |
# File 'lib/scrapetor/document.rb', line 371 def stats { classes: @class_index ? @class_index.size : 0, ids: @id_index ? @id_index.size : 0, tags: @tag_index ? @tag_index.size : 0, selector_cache_size: @selector_cache.size, indexes_built: @indexes_built } end |
#tag_index ⇒ Object
414 415 416 417 |
# File 'lib/scrapetor/document.rb', line 414 def tag_index build_indexes! unless @indexes_built @tag_index end |
#text ⇒ Object Also known as: content, inner_text
235 236 237 |
# File 'lib/scrapetor/document.rb', line 235 def text backing.text end |
#title ⇒ Object
241 242 243 244 |
# File 'lib/scrapetor/document.rb', line 241 def title n = backing.at_css("title") n && n.text end |
#to_html ⇒ Object Also known as: to_s
261 262 263 |
# File 'lib/scrapetor/document.rb', line 261 def to_html backing.to_html end |
#traverse(&block) ⇒ Object
216 217 218 219 220 |
# File 'lib/scrapetor/document.rb', line 216 def traverse(&block) return enum_for(:traverse) unless block_given? backing.traverse { |n| yield(n.respond_to?(:element?) ? Node.new(self, n) : n) } if backing.respond_to?(:traverse) self end |
#twitter_card ⇒ Object
289 290 291 |
# File 'lib/scrapetor/document.rb', line 289 def twitter_card Scrapetor::StructuredData.twitter_card(self) end |
#xml? ⇒ Boolean
275 276 277 |
# File 'lib/scrapetor/document.rb', line 275 def xml? false end |
#xpath(expr) ⇒ Object
Evaluate an XPath expression against this document. Implements the common XPath 1.0 subset via Scrapetor::XPath (descendant / child / parent axes, tag / @attr / text() node tests, position + attr-presence + attr-equality + contains() + starts-with() + text() predicates). Returns an Array of Scrapetor::Node when the expression ends at element nodes, or an Array of String for ‘/@attr` and `/text()` terminations. See lib/scrapetor/xpath.rb for the full supported grammar.
207 208 209 |
# File 'lib/scrapetor/document.rb', line 207 def xpath(expr) Scrapetor::XPath.evaluate(self, expr) end |