Module: Scrapetor::XPath

Defined in:
lib/scrapetor/xpath.rb

Overview

Full XPath 1.0 expression engine.

Pipeline:

1. Tokenizer  -> array of [:type, value] tokens
2. Parser     -> AST (recursive-descent, full XPath 1.0 grammar)
3. Evaluator  -> walks the AST against a Scrapetor::Document/Node

Axis traversals dispatch to native C primitives on the arena DOM (‘node_following_sibling_ids`, `node_ancestor_ids`, `node_following_ids`, `node_preceding_ids`, `node_descendant_comment_ids`, …) so the hot path stays in C even though the AST walk runs in Ruby.

Compiled ASTs are cached on the module (LRU-bounded) so repeated queries — typical in scraping pipelines that run the same parser against thousands of pages — only pay the tokenize/parse cost once.

Defined Under Namespace

Modules: CssTranslator, Tokenizer Classes: Evaluator, ParseError, Parser, UnsupportedError

Constant Summary collapse

AST_CACHE_CAP =
1024

Class Method Summary collapse

Class Method Details

.cache_compile(expr) ⇒ Object



44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/scrapetor/xpath.rb', line 44

def self.cache_compile(expr)
  cached = @ast_cache[expr]
  return cached if cached
  @ast_cache_mutex.synchronize do
    cached = @ast_cache[expr]
    return cached if cached
    ast = Parser.new(Tokenizer.tokenize(expr), expr).parse_expr
    css = CssTranslator.translate(ast)
    entry = { ast: ast, css: css }
    @ast_cache.shift if @ast_cache.size >= AST_CACHE_CAP
    @ast_cache[expr] = entry
    entry
  end
end

.compile(expr) ⇒ Object



40
41
42
# File 'lib/scrapetor/xpath.rb', line 40

def self.compile(expr)
  cache_compile(expr.to_s)[:ast]
end

.evaluate(context, expr) ⇒ Object



27
28
29
30
31
32
33
34
35
36
37
38
# File 'lib/scrapetor/xpath.rb', line 27

def self.evaluate(context, expr)
  expr_s = expr.to_s
  # Memo the AST + CSS-translation result together so the per-call
  # overhead on the hot path collapses to one Hash lookup. The first
  # call for a new expression pays parse + translate; every later
  # call gets the cached descriptor or `false` (= no CSS fast path).
  entry = @ast_cache[expr_s] || cache_compile(expr_s)
  if (css = entry[:css])
    return run_via_css(context, css)
  end
  Evaluator.new(context).eval_program(entry[:ast])
end

.run_via_css(context, css_descriptor) ⇒ Object

Execute a translated CSS chain. Handles the ::attr / ::text tail forms that CssTranslator emits for ‘/@x` and `/text()` terminations. Returns an Array (XPath shape) regardless of the underlying CSS return type.



63
64
65
66
67
68
69
# File 'lib/scrapetor/xpath.rb', line 63

def self.run_via_css(context, css_descriptor)
  sel = css_descriptor[:sel]
  kind = css_descriptor[:kind] # :nodes / :attr / :text
  result = context.css(sel)
  arr = result.respond_to?(:to_a) ? result.to_a : Array(result)
  arr
end