Class: Scrapetor::Document

Inherits:

Object

Object
Scrapetor::Document

show all

Defined in:: lib/scrapetor/document.rb

Instance Attribute Summary collapse

#base_url ⇒ Object readonly

Returns the value of attribute base_url.
#encoding ⇒ Object readonly

Returns the value of attribute encoding.

Instance Method Summary collapse

#all_elements ⇒ Object
#at(selector, *_extra) ⇒ Object (also: #at_css)

Accepts the Nokogiri-compatible signature ‘doc.at(sel, ns_or_handler)`.
#at_xpath(expr) ⇒ Object
#backing ⇒ Object
#batch_css(selectors) ⇒ Object

Run an array of CSS selectors in ONE Ruby/C boundary crossing.
#body ⇒ Object
#cache_selector(selector) ⇒ Object
#class_index ⇒ Object

Phase-2 hooks: structural indexes.
#css(*selectors) ⇒ Object (also: #search)

CSS query entry point.
#css_single(selector) ⇒ Object
#errors ⇒ Object

Nokogiri-compat predicates.
#extract(schema = nil, &block) ⇒ Object

Single-result extract on the document scope.
#extract_css(map) ⇒ Object

Hash form: ‘{ name => selector, … }` -> `{ name => result, … }`.
#extract_each(outer_selector, fields) ⇒ Object

Iterate matches of ‘outer_selector` across the whole document and build a Hash per match using `fields` (a => selector map).
#head ⇒ Object
#html ⇒ Object
#html? ⇒ Boolean
#html_str ⇒ Object
#id_index ⇒ Object
#initialize(html, base_url: nil, build_indexes: false, encoding: :auto, native: nil) ⇒ Document constructor

A new instance of Document.
#json_ld ⇒ Object

Structured-data extractors — for SEO/RAG/structured-content pipelines.
#microdata ⇒ Object
#opengraph ⇒ Object
#page_type ⇒ Object
#rdfa ⇒ Object
#root ⇒ Object
#run_selector(selector, scope) ⇒ Object
#schema_org(type: nil) ⇒ Object
#selector_cache_size ⇒ Object
#stats ⇒ Object
#tag_index ⇒ Object
#text ⇒ Object (also: #content, #inner_text)
#title ⇒ Object
#to_html ⇒ Object (also: #to_s)
#traverse(&block) ⇒ Object
#twitter_card ⇒ Object
#xml? ⇒ Boolean
#xpath(expr) ⇒ Object

Evaluate an XPath expression against this document.

Constructor Details

#initialize(html, base_url: nil, build_indexes: false, encoding: :auto, native: nil) ⇒ `Document`

Returns a new instance of Document.

# File 'lib/scrapetor/document.rb', line 7

def initialize(html, base_url: nil, build_indexes: false, encoding: :auto, native: nil)
  @base_url = base_url
  raw = html.to_s
  if encoding == :auto
    @encoding = Scrapetor::Encoding.detect(raw)
    @html_str = Scrapetor::Encoding.to_utf8(raw)
  else
    @encoding = encoding.to_s
    @html_str = raw.dup.force_encoding(@encoding).encode("UTF-8", invalid: :replace, undef: :replace, replace: "")
  end
  @backing = nil # parsed lazily; native extract bypasses this entirely
  @selector_cache = {}
  @indexes_built = false
  @class_index = nil
  @id_index = nil
  @tag_index = nil
  # Hot-path slots (populated by backing()): keeping these
  # initialised silences "instance variable not initialized" and
  # makes the fast-path test a simple nil check.
  @native_doc     = nil
  @native_wrapper = nil
  @plan_cache     = nil
  @lazy_ids       = nil
  # If a pre-parsed native handle was passed in (persistent-cache
  # hit), wrap it directly and skip the lazy-parse path.
  @prebuilt_native = native
  build_indexes! if build_indexes
end

Instance Attribute Details

#base_url ⇒ `Object` (readonly)

Returns the value of attribute base_url.



5
6
7

# File 'lib/scrapetor/document.rb', line 5

def base_url
  @base_url
end

#encoding ⇒ `Object` (readonly)

Returns the value of attribute encoding.



5
6
7

# File 'lib/scrapetor/document.rb', line 5

def encoding
  @encoding
end

Instance Method Details

#all_elements ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 419

def all_elements
  build_indexes! unless @indexes_built
  @all_elements
end

#at(selector, *_extra) ⇒ `Object` Also known as: at_css

Accepts the Nokogiri-compatible signature ‘doc.at(sel, ns_or_handler)`. The extra args (namespace prefix, handler) only matter for XPath land — CSS selectors ignore them — so we accept varargs and discard everything past the selector. Without this, callers that pass `doc.at(sel, namespaces_hash)` (or similar Bing-style patterns) hit `ArgumentError: wrong number of arguments`.

# File 'lib/scrapetor/document.rb', line 190

def at(selector, *_extra)
  result = backing.at_css(selector)
  return nil if result.nil?
  return result if result.is_a?(String)
  Node.new(self, result)
end

#at_xpath(expr) ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 211

def at_xpath(expr)
  result = xpath(expr)
  result.is_a?(Array) ? result.first : result
end

#backing ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 381

def backing
  return @backing if @backing
  @backing =
    if defined?(Scrapetor::Native::DocumentWrapper) && Scrapetor::Native::AVAILABLE_DOM
      native = @prebuilt_native || Scrapetor::Native::Document.parse(@html_str)
      @prebuilt_native = nil
      Scrapetor::Native::DocumentWrapper.new(native)
    else
      Dom::Parser.parse(@html_str)
    end
  # Cache the hot-path slots so Document#css can skip the indirection.
  if defined?(Scrapetor::Native::DocumentWrapper) &&
     @backing.is_a?(Scrapetor::Native::DocumentWrapper)
    @native_doc     = @backing.native
    @native_wrapper = @backing
    @plan_cache     = @backing.instance_variable_get(:@compile_cache)
    @lazy_ids       = Scrapetor::Native::DocumentWrapper::LazyIds
  end
  @backing
end

#batch_css(selectors) ⇒ `Object`

Run an array of CSS selectors in ONE Ruby/C boundary crossing. On selector-heavy workloads (SERP-style pages with ~30 selectors per scrape) this amortises the per-query Ruby overhead across all of them — N selectors cost roughly one selector worth of Ruby dispatch, not N. Returns an Array of NodeSets (or Arrays-of-strings, for ‘::text` / `::attr(name)` selectors) parallel to the input.

title_ns, price_strs, hrefs = doc.batch_css(
  ["h1.title", ".price::text", "a::attr(href)"]
)

# File 'lib/scrapetor/document.rb', line 120

def batch_css(selectors)
  bk = backing
  unless bk.respond_to?(:batch_css)
    # Pure-Ruby Dom fallback — no native engine. Loop manually.
    return selectors.map { |s| css(s) }
  end
  bk.batch_css(self, selectors)
end

#body ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 246

def body
  n = backing.at_css("body")
  n && Node.new(self, n)
end

#cache_selector(selector) ⇒ `Object`



429
430
431

# File 'lib/scrapetor/document.rb', line 429

def cache_selector(selector)
  @selector_cache[selector] ||= Selector.compile(selector)
end

#class_index ⇒ `Object`

Phase-2 hooks: structural indexes. Built on demand. The native backend will replace these with arena-resident indexes.

# File 'lib/scrapetor/document.rb', line 404

def class_index
  build_indexes! unless @indexes_built
  @class_index
end

#css(*selectors) ⇒ `Object` Also known as: search

CSS query entry point. Inlined hot path for the >95% case: a selector with no ‘::` pseudo-element and a cache-hit native plan. That bypasses backing.lazy_css, peel_pseudo_element, and the method dispatch chain, dropping the per-call Ruby overhead to a single Hash#[] + Struct.new + NodeSet.new.

Raises:

(ArgumentError)

# File 'lib/scrapetor/document.rb', line 45

def css(*selectors)
  # Nokogiri-compat: `doc.css(sel1, sel2, ...)` accepts multiple
  # selectors and returns the union of matches across all of them.
  # Drop trailing non-string arguments (Nokogiri also accepts an
  # XPath namespaces hash here — that's a no-op for CSS).
  selectors = selectors.reject { |a| !a.is_a?(String) }
  raise ArgumentError, "Document#css requires at least one selector" if selectors.empty?
  return css_single(selectors.first) if selectors.size == 1

  seen = {}
  union = []
  string_result = nil
  selectors.each do |sel|
    result = css_single(sel)
    if result.is_a?(Array)
      string_result = true
      result.each { |s| union << s }
    else
      # NodeSet — pull backing items and dedupe.
      string_result = false if string_result.nil?
      result.each do |node|
        bk = node.respond_to?(:backing_node) ? node.backing_node : node
        key = bk.object_id
        next if seen[key]
        seen[key] = true
        union << bk
      end
    end
  end
  string_result ? union : NodeSet.new(self, union)
end

#css_single(selector) ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 77

def css_single(selector)
  # Fast path: native backing, no mutations applied yet, plain
  # String selector, no pseudo-element. One Hash lookup + one C
  # call + two allocations. After any mutation the wrapper flips
  # into dom_mode and we route through the slow path so reads see
  # the user's edits — checking @native_wrapper.dom_mode? is one
  # ivar read, negligible vs the saving when we stay native.
  if @native_doc && !@native_wrapper.dom_mode? && selector.is_a?(String) && !selector.include?("::")
    plan = @plan_cache[selector]
    if plan
      return NodeSet.new(self, @lazy_ids.new(@native_wrapper, @native_doc, @native_doc.run_chain(plan, nil)))
    elsif !@plan_cache.key?(selector)
      plan = Scrapetor::Native.compile_selector_chain(selector)
      @plan_cache[selector] = plan || false
      if plan
        return NodeSet.new(self, @lazy_ids.new(@native_wrapper, @native_doc, @native_doc.run_chain(plan, nil)))
      end
    end
  end
  # Slow path: pseudo-element, comma, fallback, post-mutation, or
  # non-native backing.
  bk = backing
  result = bk.respond_to?(:lazy_css) ? bk.lazy_css(selector) : bk.css(selector)
  if result.is_a?(Array) && (result.first.is_a?(String) || (result.empty? && pseudo_element?(selector)))
    return result
  end
  if @lazy_ids && result.is_a?(@lazy_ids)
    return NodeSet.new(self, result)
  end
  NodeSet.new(self, result.to_a)
end

#errors ⇒ `Object`

Nokogiri-compat predicates.



267
268
269

# File 'lib/scrapetor/document.rb', line 267

def errors
  []
end

#extract(schema = nil, &block) ⇒ `Object`

Single-result extract on the document scope. One C call covers field compilation, plan lookup, and result assembly.

# File 'lib/scrapetor/document.rb', line 143

def extract(map)
  bk = backing
  if defined?(Scrapetor::Native::DocumentWrapper) &&
     bk.is_a?(Scrapetor::Native::DocumentWrapper) && !bk.dom_mode?
    r = bk.native.extract_one_h(nil, map, bk)
    return r unless r.equal?(true)
  end
  out = {}
  map.each_pair { |k, sel| out[k] = at_css(sel) }
  out
end

#extract_css(map) ⇒ `Object`

Hash form: ‘{ name => selector, … }` -> `{ name => result, … }`. The classic scrape pattern in two lines. Same one-boundary cost as batch_css.

# File 'lib/scrapetor/document.rb', line 132

def extract_css(map)
  keys = map.keys
  selectors = map.values
  results = batch_css(selectors)
  out = {}
  keys.each_with_index { |k, i| out[k] = results[i] }
  out
end

#extract_each(outer_selector, fields) ⇒ `Object`

Iterate matches of ‘outer_selector` across the whole document and build a Hash per match using `fields` (a => selector map). Returns Array<Hash>. The inner selectors run scoped to each match, so a `result.at_css(field)`-style parser becomes:

doc.extract_each(".result", {
  title: ".title::text",
  price: ".price::text",
  href:  "a::attr(href)",
})

When the document is native-backed and every selector compiles cleanly, the whole iteration runs in a single C call — one outer plan + N inner plans times M matches, zero Ruby↔C round-trips on the hot path. Falls back to the per-row Ruby loop only when a selector compiles to nil (rare; the engine covers nearly every CSS Selectors L4 shape natively after the audit-driven coverage work).

# File 'lib/scrapetor/document.rb', line 173

def extract_each(outer_selector, fields)
  bk = backing
  if defined?(Scrapetor::Native::DocumentWrapper) &&
     bk.is_a?(Scrapetor::Native::DocumentWrapper) && !bk.dom_mode?
    outer_str = outer_selector.is_a?(String) ? outer_selector : outer_selector.to_s
    r = bk.native.extract_each_h(outer_str, nil, fields, bk)
    return r unless r.equal?(true)
  end
  css(outer_selector).map { |node| node.extract(fields) }
end

#head ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 251

def head
  n = backing.at_css("head")
  n && Node.new(self, n)
end

#html ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 256

def html
  n = backing.at_css("html") || backing
  Node.new(self, n)
end

#html? ⇒ `Boolean`

Returns:

(Boolean)



271
272
273

# File 'lib/scrapetor/document.rb', line 271

def html?
  true
end

#html_str ⇒ `Object`



36
37
38

# File 'lib/scrapetor/document.rb', line 36

def html_str
  @html_str
end

#id_index ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 409

def id_index
  build_indexes! unless @indexes_built
  @id_index
end

#json_ld ⇒ `Object`

Structured-data extractors — for SEO/RAG/structured-content pipelines.



281
282
283

# File 'lib/scrapetor/document.rb', line 281

def json_ld
  Scrapetor::StructuredData.json_ld(self)
end

#microdata ⇒ `Object`



297
298
299

# File 'lib/scrapetor/document.rb', line 297

def microdata
  Scrapetor::Microdata.extract(self)
end

#opengraph ⇒ `Object`



285
286
287

# File 'lib/scrapetor/document.rb', line 285

def opengraph
  Scrapetor::StructuredData.opengraph(self)
end

#page_type ⇒ `Object`



305
306
307

# File 'lib/scrapetor/document.rb', line 305

def page_type
  Scrapetor::PageType.detect(self)
end

#rdfa ⇒ `Object`



301
302
303

# File 'lib/scrapetor/document.rb', line 301

def rdfa
  Scrapetor::RDFa.extract(self)
end

#root ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 230

def root
  el = backing.at_css("html") || backing
  Node.new(self, el)
end

#run_selector(selector, scope) ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 424

def run_selector(selector, scope)
  plan = @selector_cache[selector] ||= Selector.compile(selector)
  Selector.execute(self, plan, scope)
end

#schema_org(type: nil) ⇒ `Object`



293
294
295

# File 'lib/scrapetor/document.rb', line 293

def schema_org(type: nil)
  Scrapetor::StructuredData.schema_org(self, type: type)
end

#selector_cache_size ⇒ `Object`



433
434
435

# File 'lib/scrapetor/document.rb', line 433

def selector_cache_size
  @selector_cache.size
end

#stats ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 371

def stats
  {
    classes: @class_index ? @class_index.size : 0,
    ids: @id_index ? @id_index.size : 0,
    tags: @tag_index ? @tag_index.size : 0,
    selector_cache_size: @selector_cache.size,
    indexes_built: @indexes_built
  }
end

#tag_index ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 414

def tag_index
  build_indexes! unless @indexes_built
  @tag_index
end

#text ⇒ `Object` Also known as: content, inner_text



235
236
237

# File 'lib/scrapetor/document.rb', line 235

def text
  backing.text
end

#title ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 241

def title
  n = backing.at_css("title")
  n && n.text
end

#to_html ⇒ `Object` Also known as: to_s



261
262
263

# File 'lib/scrapetor/document.rb', line 261

def to_html
  backing.to_html
end

#traverse(&block) ⇒ `Object`

# File 'lib/scrapetor/document.rb', line 216

def traverse(&block)
  return enum_for(:traverse) unless block_given?
  backing.traverse { |n| yield(n.respond_to?(:element?) ? Node.new(self, n) : n) } if backing.respond_to?(:traverse)
  self
end

#twitter_card ⇒ `Object`



289
290
291

# File 'lib/scrapetor/document.rb', line 289

def twitter_card
  Scrapetor::StructuredData.twitter_card(self)
end

#xml? ⇒ `Boolean`

Returns:

(Boolean)



275
276
277

# File 'lib/scrapetor/document.rb', line 275

def xml?
  false
end

#xpath(expr) ⇒ `Object`

Evaluate an XPath expression against this document. Implements the common XPath 1.0 subset via Scrapetor::XPath (descendant / child / parent axes, tag / @attr / text() node tests, position + attr-presence + attr-equality + contains() + starts-with() + text() predicates). Returns an Array of Scrapetor::Node when the expression ends at element nodes, or an Array of String for ‘/@attr` and `/text()` terminations. See lib/scrapetor/xpath.rb for the full supported grammar.



207
208
209

# File 'lib/scrapetor/document.rb', line 207

def xpath(expr)
  Scrapetor::XPath.evaluate(self, expr)
end

Class: Scrapetor::Document

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(html, base_url: nil, build_indexes: false, encoding: :auto, native: nil) ⇒ Document

Instance Attribute Details

#base_url ⇒ Object (readonly)

#encoding ⇒ Object (readonly)

Instance Method Details

#all_elements ⇒ Object

#at(selector, *_extra) ⇒ Object Also known as: at_css

#at_xpath(expr) ⇒ Object

#backing ⇒ Object

#batch_css(selectors) ⇒ Object

#body ⇒ Object

#cache_selector(selector) ⇒ Object

#class_index ⇒ Object

#css(*selectors) ⇒ Object Also known as: search

#css_single(selector) ⇒ Object

#errors ⇒ Object

#extract(schema = nil, &block) ⇒ Object

#extract_css(map) ⇒ Object

#extract_each(outer_selector, fields) ⇒ Object

#head ⇒ Object

#html ⇒ Object

#html? ⇒ Boolean

#html_str ⇒ Object

#id_index ⇒ Object

#json_ld ⇒ Object

#microdata ⇒ Object

#opengraph ⇒ Object

#page_type ⇒ Object

#rdfa ⇒ Object

#root ⇒ Object

#run_selector(selector, scope) ⇒ Object

#schema_org(type: nil) ⇒ Object

#selector_cache_size ⇒ Object

#stats ⇒ Object

#tag_index ⇒ Object

#text ⇒ Object Also known as: content, inner_text

#title ⇒ Object

#to_html ⇒ Object Also known as: to_s

#traverse(&block) ⇒ Object

#twitter_card ⇒ Object

#xml? ⇒ Boolean

#xpath(expr) ⇒ Object