Class: Scrapetor::Document

Inherits:
Object
  • Object
show all
Defined in:
lib/scrapetor/document.rb

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(html, base_url: nil, build_indexes: false, encoding: :auto, native: nil) ⇒ Document

Returns a new instance of Document.



7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# File 'lib/scrapetor/document.rb', line 7

def initialize(html, base_url: nil, build_indexes: false, encoding: :auto, native: nil)
  @base_url = base_url
  raw = html.to_s
  if encoding == :auto
    @encoding = Scrapetor::Encoding.detect(raw)
    @html_str = Scrapetor::Encoding.to_utf8(raw)
  else
    @encoding = encoding.to_s
    @html_str = raw.dup.force_encoding(@encoding).encode("UTF-8", invalid: :replace, undef: :replace, replace: "")
  end
  @backing = nil # parsed lazily; native extract bypasses this entirely
  @selector_cache = {}
  @indexes_built = false
  @class_index = nil
  @id_index = nil
  @tag_index = nil
  # Hot-path slots (populated by backing()): keeping these
  # initialised silences "instance variable not initialized" and
  # makes the fast-path test a simple nil check.
  @native_doc     = nil
  @native_wrapper = nil
  @plan_cache     = nil
  @lazy_ids       = nil
  # If a pre-parsed native handle was passed in (persistent-cache
  # hit), wrap it directly and skip the lazy-parse path.
  @prebuilt_native = native
  build_indexes! if build_indexes
end

Instance Attribute Details

#base_urlObject (readonly)

Returns the value of attribute base_url.



5
6
7
# File 'lib/scrapetor/document.rb', line 5

def base_url
  @base_url
end

#encodingObject (readonly)

Returns the value of attribute encoding.



5
6
7
# File 'lib/scrapetor/document.rb', line 5

def encoding
  @encoding
end

Instance Method Details

#all_elementsObject



419
420
421
422
# File 'lib/scrapetor/document.rb', line 419

def all_elements
  build_indexes! unless @indexes_built
  @all_elements
end

#at(selector, *_extra) ⇒ Object Also known as: at_css

Accepts the Nokogiri-compatible signature ‘doc.at(sel, ns_or_handler)`. The extra args (namespace prefix, handler) only matter for XPath land — CSS selectors ignore them — so we accept varargs and discard everything past the selector. Without this, callers that pass `doc.at(sel, namespaces_hash)` (or similar Bing-style patterns) hit `ArgumentError: wrong number of arguments`.



190
191
192
193
194
195
# File 'lib/scrapetor/document.rb', line 190

def at(selector, *_extra)
  result = backing.at_css(selector)
  return nil if result.nil?
  return result if result.is_a?(String)
  Node.new(self, result)
end

#at_xpath(expr) ⇒ Object



211
212
213
214
# File 'lib/scrapetor/document.rb', line 211

def at_xpath(expr)
  result = xpath(expr)
  result.is_a?(Array) ? result.first : result
end

#backingObject



381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
# File 'lib/scrapetor/document.rb', line 381

def backing
  return @backing if @backing
  @backing =
    if defined?(Scrapetor::Native::DocumentWrapper) && Scrapetor::Native::AVAILABLE_DOM
      native = @prebuilt_native || Scrapetor::Native::Document.parse(@html_str)
      @prebuilt_native = nil
      Scrapetor::Native::DocumentWrapper.new(native)
    else
      Dom::Parser.parse(@html_str)
    end
  # Cache the hot-path slots so Document#css can skip the indirection.
  if defined?(Scrapetor::Native::DocumentWrapper) &&
     @backing.is_a?(Scrapetor::Native::DocumentWrapper)
    @native_doc     = @backing.native
    @native_wrapper = @backing
    @plan_cache     = @backing.instance_variable_get(:@compile_cache)
    @lazy_ids       = Scrapetor::Native::DocumentWrapper::LazyIds
  end
  @backing
end

#batch_css(selectors) ⇒ Object

Run an array of CSS selectors in ONE Ruby/C boundary crossing. On selector-heavy workloads (SERP-style pages with ~30 selectors per scrape) this amortises the per-query Ruby overhead across all of them — N selectors cost roughly one selector worth of Ruby dispatch, not N. Returns an Array of NodeSets (or Arrays-of-strings, for ‘::text` / `::attr(name)` selectors) parallel to the input.

title_ns, price_strs, hrefs = doc.batch_css(
  ["h1.title", ".price::text", "a::attr(href)"]
)


120
121
122
123
124
125
126
127
# File 'lib/scrapetor/document.rb', line 120

def batch_css(selectors)
  bk = backing
  unless bk.respond_to?(:batch_css)
    # Pure-Ruby Dom fallback — no native engine. Loop manually.
    return selectors.map { |s| css(s) }
  end
  bk.batch_css(self, selectors)
end

#bodyObject



246
247
248
249
# File 'lib/scrapetor/document.rb', line 246

def body
  n = backing.at_css("body")
  n && Node.new(self, n)
end

#cache_selector(selector) ⇒ Object



429
430
431
# File 'lib/scrapetor/document.rb', line 429

def cache_selector(selector)
  @selector_cache[selector] ||= Selector.compile(selector)
end

#class_indexObject

Phase-2 hooks: structural indexes. Built on demand. The native backend will replace these with arena-resident indexes.



404
405
406
407
# File 'lib/scrapetor/document.rb', line 404

def class_index
  build_indexes! unless @indexes_built
  @class_index
end

#css(*selectors) ⇒ Object Also known as: search

CSS query entry point. Inlined hot path for the >95% case: a selector with no ‘::` pseudo-element and a cache-hit native plan. That bypasses backing.lazy_css, peel_pseudo_element, and the method dispatch chain, dropping the per-call Ruby overhead to a single Hash#[] + Struct.new + NodeSet.new.

Raises:

  • (ArgumentError)


45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/scrapetor/document.rb', line 45

def css(*selectors)
  # Nokogiri-compat: `doc.css(sel1, sel2, ...)` accepts multiple
  # selectors and returns the union of matches across all of them.
  # Drop trailing non-string arguments (Nokogiri also accepts an
  # XPath namespaces hash here — that's a no-op for CSS).
  selectors = selectors.reject { |a| !a.is_a?(String) }
  raise ArgumentError, "Document#css requires at least one selector" if selectors.empty?
  return css_single(selectors.first) if selectors.size == 1

  seen = {}
  union = []
  string_result = nil
  selectors.each do |sel|
    result = css_single(sel)
    if result.is_a?(Array)
      string_result = true
      result.each { |s| union << s }
    else
      # NodeSet — pull backing items and dedupe.
      string_result = false if string_result.nil?
      result.each do |node|
        bk = node.respond_to?(:backing_node) ? node.backing_node : node
        key = bk.object_id
        next if seen[key]
        seen[key] = true
        union << bk
      end
    end
  end
  string_result ? union : NodeSet.new(self, union)
end

#css_single(selector) ⇒ Object



77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# File 'lib/scrapetor/document.rb', line 77

def css_single(selector)
  # Fast path: native backing, no mutations applied yet, plain
  # String selector, no pseudo-element. One Hash lookup + one C
  # call + two allocations. After any mutation the wrapper flips
  # into dom_mode and we route through the slow path so reads see
  # the user's edits — checking @native_wrapper.dom_mode? is one
  # ivar read, negligible vs the saving when we stay native.
  if @native_doc && !@native_wrapper.dom_mode? && selector.is_a?(String) && !selector.include?("::")
    plan = @plan_cache[selector]
    if plan
      return NodeSet.new(self, @lazy_ids.new(@native_wrapper, @native_doc, @native_doc.run_chain(plan, nil)))
    elsif !@plan_cache.key?(selector)
      plan = Scrapetor::Native.compile_selector_chain(selector)
      @plan_cache[selector] = plan || false
      if plan
        return NodeSet.new(self, @lazy_ids.new(@native_wrapper, @native_doc, @native_doc.run_chain(plan, nil)))
      end
    end
  end
  # Slow path: pseudo-element, comma, fallback, post-mutation, or
  # non-native backing.
  bk = backing
  result = bk.respond_to?(:lazy_css) ? bk.lazy_css(selector) : bk.css(selector)
  if result.is_a?(Array) && (result.first.is_a?(String) || (result.empty? && pseudo_element?(selector)))
    return result
  end
  if @lazy_ids && result.is_a?(@lazy_ids)
    return NodeSet.new(self, result)
  end
  NodeSet.new(self, result.to_a)
end

#errorsObject

Nokogiri-compat predicates.



267
268
269
# File 'lib/scrapetor/document.rb', line 267

def errors
  []
end

#extract(schema = nil, &block) ⇒ Object

Single-result extract on the document scope. One C call covers field compilation, plan lookup, and result assembly.



143
144
145
146
147
148
149
150
151
152
153
# File 'lib/scrapetor/document.rb', line 143

def extract(map)
  bk = backing
  if defined?(Scrapetor::Native::DocumentWrapper) &&
     bk.is_a?(Scrapetor::Native::DocumentWrapper) && !bk.dom_mode?
    r = bk.native.extract_one_h(nil, map, bk)
    return r unless r.equal?(true)
  end
  out = {}
  map.each_pair { |k, sel| out[k] = at_css(sel) }
  out
end

#extract_css(map) ⇒ Object

Hash form: ‘{ name => selector, … }` -> `{ name => result, … }`. The classic scrape pattern in two lines. Same one-boundary cost as batch_css.



132
133
134
135
136
137
138
139
# File 'lib/scrapetor/document.rb', line 132

def extract_css(map)
  keys = map.keys
  selectors = map.values
  results = batch_css(selectors)
  out = {}
  keys.each_with_index { |k, i| out[k] = results[i] }
  out
end

#extract_each(outer_selector, fields) ⇒ Object

Iterate matches of ‘outer_selector` across the whole document and build a Hash per match using `fields` (a => selector map). Returns Array<Hash>. The inner selectors run scoped to each match, so a `result.at_css(field)`-style parser becomes:

doc.extract_each(".result", {
  title: ".title::text",
  price: ".price::text",
  href:  "a::attr(href)",
})

When the document is native-backed and every selector compiles cleanly, the whole iteration runs in a single C call — one outer plan + N inner plans times M matches, zero Ruby↔C round-trips on the hot path. Falls back to the per-row Ruby loop only when a selector compiles to nil (rare; the engine covers nearly every CSS Selectors L4 shape natively after the audit-driven coverage work).



173
174
175
176
177
178
179
180
181
182
# File 'lib/scrapetor/document.rb', line 173

def extract_each(outer_selector, fields)
  bk = backing
  if defined?(Scrapetor::Native::DocumentWrapper) &&
     bk.is_a?(Scrapetor::Native::DocumentWrapper) && !bk.dom_mode?
    outer_str = outer_selector.is_a?(String) ? outer_selector : outer_selector.to_s
    r = bk.native.extract_each_h(outer_str, nil, fields, bk)
    return r unless r.equal?(true)
  end
  css(outer_selector).map { |node| node.extract(fields) }
end

#headObject



251
252
253
254
# File 'lib/scrapetor/document.rb', line 251

def head
  n = backing.at_css("head")
  n && Node.new(self, n)
end

#htmlObject



256
257
258
259
# File 'lib/scrapetor/document.rb', line 256

def html
  n = backing.at_css("html") || backing
  Node.new(self, n)
end

#html?Boolean

Returns:

  • (Boolean)


271
272
273
# File 'lib/scrapetor/document.rb', line 271

def html?
  true
end

#html_strObject



36
37
38
# File 'lib/scrapetor/document.rb', line 36

def html_str
  @html_str
end

#id_indexObject



409
410
411
412
# File 'lib/scrapetor/document.rb', line 409

def id_index
  build_indexes! unless @indexes_built
  @id_index
end

#json_ldObject

Structured-data extractors — for SEO/RAG/structured-content pipelines.



281
282
283
# File 'lib/scrapetor/document.rb', line 281

def json_ld
  Scrapetor::StructuredData.json_ld(self)
end

#microdataObject



297
298
299
# File 'lib/scrapetor/document.rb', line 297

def microdata
  Scrapetor::Microdata.extract(self)
end

#opengraphObject



285
286
287
# File 'lib/scrapetor/document.rb', line 285

def opengraph
  Scrapetor::StructuredData.opengraph(self)
end

#page_typeObject



305
306
307
# File 'lib/scrapetor/document.rb', line 305

def page_type
  Scrapetor::PageType.detect(self)
end

#rdfaObject



301
302
303
# File 'lib/scrapetor/document.rb', line 301

def rdfa
  Scrapetor::RDFa.extract(self)
end

#rootObject



230
231
232
233
# File 'lib/scrapetor/document.rb', line 230

def root
  el = backing.at_css("html") || backing
  Node.new(self, el)
end

#run_selector(selector, scope) ⇒ Object



424
425
426
427
# File 'lib/scrapetor/document.rb', line 424

def run_selector(selector, scope)
  plan = @selector_cache[selector] ||= Selector.compile(selector)
  Selector.execute(self, plan, scope)
end

#schema_org(type: nil) ⇒ Object



293
294
295
# File 'lib/scrapetor/document.rb', line 293

def schema_org(type: nil)
  Scrapetor::StructuredData.schema_org(self, type: type)
end

#selector_cache_sizeObject



433
434
435
# File 'lib/scrapetor/document.rb', line 433

def selector_cache_size
  @selector_cache.size
end

#statsObject



371
372
373
374
375
376
377
378
379
# File 'lib/scrapetor/document.rb', line 371

def stats
  {
    classes: @class_index ? @class_index.size : 0,
    ids: @id_index ? @id_index.size : 0,
    tags: @tag_index ? @tag_index.size : 0,
    selector_cache_size: @selector_cache.size,
    indexes_built: @indexes_built
  }
end

#tag_indexObject



414
415
416
417
# File 'lib/scrapetor/document.rb', line 414

def tag_index
  build_indexes! unless @indexes_built
  @tag_index
end

#textObject Also known as: content, inner_text



235
236
237
# File 'lib/scrapetor/document.rb', line 235

def text
  backing.text
end

#titleObject



241
242
243
244
# File 'lib/scrapetor/document.rb', line 241

def title
  n = backing.at_css("title")
  n && n.text
end

#to_htmlObject Also known as: to_s



261
262
263
# File 'lib/scrapetor/document.rb', line 261

def to_html
  backing.to_html
end

#traverse(&block) ⇒ Object



216
217
218
219
220
# File 'lib/scrapetor/document.rb', line 216

def traverse(&block)
  return enum_for(:traverse) unless block_given?
  backing.traverse { |n| yield(n.respond_to?(:element?) ? Node.new(self, n) : n) } if backing.respond_to?(:traverse)
  self
end

#twitter_cardObject



289
290
291
# File 'lib/scrapetor/document.rb', line 289

def twitter_card
  Scrapetor::StructuredData.twitter_card(self)
end

#xml?Boolean

Returns:

  • (Boolean)


275
276
277
# File 'lib/scrapetor/document.rb', line 275

def xml?
  false
end

#xpath(expr) ⇒ Object

Evaluate an XPath expression against this document. Implements the common XPath 1.0 subset via Scrapetor::XPath (descendant / child / parent axes, tag / @attr / text() node tests, position + attr-presence + attr-equality + contains() + starts-with() + text() predicates). Returns an Array of Scrapetor::Node when the expression ends at element nodes, or an Array of String for ‘/@attr` and `/text()` terminations. See lib/scrapetor/xpath.rb for the full supported grammar.



207
208
209
# File 'lib/scrapetor/document.rb', line 207

def xpath(expr)
  Scrapetor::XPath.evaluate(self, expr)
end