Module: Scrapetor::XPath
- Defined in:
- lib/scrapetor/xpath.rb
Overview
Full XPath 1.0 expression engine.
Pipeline:
1. Tokenizer -> array of [:type, value] tokens
2. Parser -> AST (recursive-descent, full XPath 1.0 grammar)
3. Evaluator -> walks the AST against a Scrapetor::Document/Node
Axis traversals dispatch to native C primitives on the arena DOM (‘node_following_sibling_ids`, `node_ancestor_ids`, `node_following_ids`, `node_preceding_ids`, `node_descendant_comment_ids`, …) so the hot path stays in C even though the AST walk runs in Ruby.
Compiled ASTs are cached on the module (LRU-bounded) so repeated queries — typical in scraping pipelines that run the same parser against thousands of pages — only pay the tokenize/parse cost once.
Defined Under Namespace
Modules: CssTranslator, Tokenizer Classes: Evaluator, ParseError, Parser, UnsupportedError
Constant Summary collapse
- AST_CACHE_CAP =
1024
Class Method Summary collapse
- .cache_compile(expr) ⇒ Object
- .compile(expr) ⇒ Object
- .evaluate(context, expr) ⇒ Object
-
.run_via_css(context, css_descriptor) ⇒ Object
Execute a translated CSS chain.
Class Method Details
.cache_compile(expr) ⇒ Object
44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/scrapetor/xpath.rb', line 44 def self.cache_compile(expr) cached = @ast_cache[expr] return cached if cached @ast_cache_mutex.synchronize do cached = @ast_cache[expr] return cached if cached ast = Parser.new(Tokenizer.tokenize(expr), expr).parse_expr css = CssTranslator.translate(ast) entry = { ast: ast, css: css } @ast_cache.shift if @ast_cache.size >= AST_CACHE_CAP @ast_cache[expr] = entry entry end end |
.compile(expr) ⇒ Object
40 41 42 |
# File 'lib/scrapetor/xpath.rb', line 40 def self.compile(expr) cache_compile(expr.to_s)[:ast] end |
.evaluate(context, expr) ⇒ Object
27 28 29 30 31 32 33 34 35 36 37 38 |
# File 'lib/scrapetor/xpath.rb', line 27 def self.evaluate(context, expr) expr_s = expr.to_s # Memo the AST + CSS-translation result together so the per-call # overhead on the hot path collapses to one Hash lookup. The first # call for a new expression pays parse + translate; every later # call gets the cached descriptor or `false` (= no CSS fast path). entry = @ast_cache[expr_s] || cache_compile(expr_s) if (css = entry[:css]) return run_via_css(context, css) end Evaluator.new(context).eval_program(entry[:ast]) end |
.run_via_css(context, css_descriptor) ⇒ Object
Execute a translated CSS chain. Handles the ::attr / ::text tail forms that CssTranslator emits for ‘/@x` and `/text()` terminations. Returns an Array (XPath shape) regardless of the underlying CSS return type.
63 64 65 66 67 68 69 |
# File 'lib/scrapetor/xpath.rb', line 63 def self.run_via_css(context, css_descriptor) sel = css_descriptor[:sel] kind = css_descriptor[:kind] # :nodes / :attr / :text result = context.css(sel) arr = result.respond_to?(:to_a) ? result.to_a : Array(result) arr end |