Scrapetor
A Ruby HTML parsing and structured-extraction library. Scrapetor pairs a native C arena DOM with a streaming extraction engine that compiles a schema DSL into a single forward pass over the input - no DOM is materialised, one Ruby boundary crossing per document.
The same gem also exposes a full read/mutate DOM API, encoding
detection, structured-data extractors (JSON-LD, OpenGraph, Schema.org,
Microdata, RDFa, Twitter Cards), a pure-Ruby builder and SAX streamer,
a CLI, and an HTTP fetcher built on Net::HTTP. There are no external
parser dependencies.
Project page: scrapetor.org · Source: github.com/Alaa-abdulridha/scrapetor
Requirements
- Ruby 2.7 or newer
- A C99-capable compiler (
clangorgcc). The native extension is built automatically when the gem is installed.
There are no other runtime dependencies.
Installation
Add to your Gemfile:
gem "scrapetor"
Or install directly:
gem install scrapetor
Quick start
require "scrapetor"
doc = Scrapetor::HTML(html, base_url: "https://example.com/")
result = doc.extract do
field :title, from: "h1.headline", clean: true
repeated ".product-card", as: :products do
field :title, from: ".title", clean: true
field :price, from: ".price", type: :money
field :url, from: "a", attr: :href, type: :url, normalize_url: true
field :image, from: "img", attr: :src, type: :url, normalize_url: true
end
end
Structured-data extractors are built in:
doc.json_ld # Array of parsed <script type="application/ld+json"> blocks
doc.opengraph # Hash of og:* meta values
doc.twitter_card # Hash of twitter:* meta values
doc.schema_org(type: "Product")
doc.microdata # HTML5 itemscope/itemprop trees
doc.page_type # :product_page | :product_listing | :article | …
A minimal HTTP fetcher (uses Net::HTTP; no extra gems):
doc = Scrapetor.fetch("https://example.com/products")
data = Scrapetor.fetch_extract("https://example.com/products", schema)
Production HTTP layer (libcurl, HTTP/2)
If libcurl is available at build time, Scrapetor ships an optional
Scrapetor::Fetcher backed by it. The whole pipeline — TLS,
HTTP/2 multiplexing, gzip/deflate/brotli/zstd decoding, charset
transcoding, retry, ETag cache, per-host throttle — runs in C, with
the GVL released across every network and CPU phase.
# Single GET with HTTP/2 + connection share + retry.
resp = Scrapetor::Fetcher.get(url,
retry: 3, backoff: 0.3, max_backoff: 10,
bearer_token: ENV["TOKEN"],
cache_dir: "~/.cache/scrapetor")
# => { status: 200, headers: {...}, body: "...", final_url: "...",
# http_version: "2" }
# POST + JSON / form / multipart.
Scrapetor::Fetcher.post(url, json: {name: "alice"})
Scrapetor::Fetcher.post(url, form: {user: "x", pass: "y"})
Scrapetor::Fetcher.post(url,
multipart: { name: "avatar",
file: Scrapetor::Fetcher.upload_file("/tmp/pic.png") })
Bulk fetch APIs
Three concurrency models pick different tradeoffs:
# 1. pthread + easy: N workers, each blocking. Best when each
# response has meaningful CPU work after the fetch (decode + parse)
# since the GVL is released across the full batch.
docs = Scrapetor::Fetcher.parallel_fetch(urls, threads: 8)
# 2. curl_multi async: single driver thread, N concurrent in-flight.
# Best for I/O-fan-out (hundreds of URLs across many hosts).
results = Scrapetor::Fetcher.multi_get(urls, max_concurrent: 32)
# 3. streaming multi_each: yields each response in completion order
# so processing starts as soon as the first transfer lands.
Scrapetor::Fetcher.multi_each(urls) do |r|
puts r[:final_url], r[:status] # called as each completes
end
Session: cookies, auth, throttle, retry
session = Scrapetor::Session.new(
cookies: true, # ephemeral cookie jar (path or true)
user_agent: "MyBot/1.0",
rate_limit: 0.5, # min seconds between same-host calls
retry: 3, # default retry for all calls
headers: { "Accept-Language" => "en-US" },
proxy: ENV["HTTP_PROXY"],
)
session.post(login_url, form: {user:, pass:})
doc = session.fetch(dashboard_url) # cookies carry forward
HTTP cache with ETag / Last-Modified
# Cold fetch: server returns 200 + ETag, response cached.
# Warm fetch: scrapetor sends If-None-Match. Server's 304 swaps in
# the cached body and marks headers["x-scrapetor-cache"] = "hit".
Scrapetor::Fetcher.get(url, cache_dir: "~/.cache/scrapetor")
# Bulk revalidation: HEAD every URL in one curl_multi sweep,
# classify each as :fresh / :changed / :missing / :error.
status = Scrapetor::Fetcher.revalidate(urls, cache_dir: "~/.cache/scrapetor")
stale = status.select { |_, v| v == :changed }.keys
Crawl helpers
robots = Scrapetor::Robots.fetch_for("https://example.com")
robots.allowed?("https://example.com/private") # => false
robots.crawl_delay # => 2.0
robots.sitemaps # => [...]
Scrapetor::Sitemap.urls("https://example.com/sitemap.xml") do |url, |
# streams large sitemaps without buffering in memory
# recurses into <sitemapindex> automatically
process(url, )
end
Streaming HTML parser
Bounded-memory parser for huge documents:
Scrapetor.stream(io, outer: "div.result") do |row_doc|
# one row at a time; peak memory ~= max(chunk, longest_row)
yield row_doc.at_css(".title").text, row_doc.at_css(".price").text
end
Accepts tag, tag.class, tag.cls1.cls2, tag#id, and combinations.
Parallel parse for offline corpora
htmls = paths.map { |p| File.read(p) }
docs = Scrapetor.parallel_parse(htmls, threads: 8)
# Real multi-core HTML parsing under one GVL release.
Limits — what Scrapetor does NOT do
Worth being explicit about what's out of scope so you can pick the right tool for the rest of the pipeline.
No JavaScript execution
Scrapetor reads HTML as the server sent it. Pages that build their content client-side via React / Vue / Angular / etc. will look empty to the parser. There's no embedded JS engine and there won't be — that's a different class of tool (headless browser).
Practical paths if you need rendered HTML:
- Pre-render upstream: many SPA hosts can pre-render for crawlers
(
?_escaped_fragment_=, prerender.io, Cloudflare's HTML Rewriter, Vercel/Netlify ISR). Cheapest if available. - Headless browser layer: drive Playwright / Puppeteer / Selenium from Ruby (ferrum, playwright-ruby-client). Have it spit out rendered HTML, then hand that to Scrapetor for fast extract.
- Per-site API mining: most JS-heavy apps load data from a JSON
API that Scrapetor can hit directly via
Fetcher.get.
No TLS fingerprint impersonation
Scrapetor's HTTP layer is plain libcurl. Sites that fingerprint TLS handshakes (Cloudflare, Akamai, DataDome, Imperva) will identify the client as libcurl and may block / challenge accordingly. Scrapetor won't impersonate Chrome's JA3, JA4, HTTP/2 SETTINGS frame order, or header capitalisation.
If you need impersonation:
- curl-impersonate is a fork of libcurl patched to match Chrome / Firefox / Edge fingerprints exactly. You can build it locally and Scrapetor will link against it transparently — the gem's HTTP options are unchanged.
- Reach the JSON API directly with browser-mimicking headers
(
Accept,User-Agent,Sec-*). Many sites only fingerprint the HTML route; the API is more permissive. - Use a residential / mobile proxy with a real browser at the
other end. Scrapetor's
:proxy+:proxy_authoptions handle the proxy plumbing; the impersonation happens upstream.
The HTTP layer DOES support the rest of the production-scraping surface: HTTP/2 multiplexing, retries with full-jitter backoff, per-host throttle, cookie jar + auth, ETag cache, bulk revalidation, multi-handle concurrency. Treat fingerprint impersonation as the one externality you may need to bring yourself.
XPath 1.0: full expression language
Document#xpath / Node#xpath evaluate the complete XPath 1.0 grammar:
all 13 axes (child, descendant, descendant-or-self, parent,
self, following-sibling, preceding-sibling, following,
preceding, ancestor, ancestor-or-self, attribute, namespace),
all node tests (node(), text(), comment(), processing-instruction(),
named, *, qualified-with-prefix), every operator (=, !=, <,
<=, >, >=, +, -, *, div, mod, and, or, |), and
the full standard function library (not, last, position, count,
local-name, name, string, concat, starts-with, contains,
substring, substring-before, substring-after, string-length,
normalize-space, translate, boolean, true, false, lang,
number, sum, floor, ceiling, round, id).
Compiled ASTs cache per unique expression string. The compiler also
detects expressions that map cleanly onto the native CSS chain
(//div[@class='x'], //ul/li[1], //dt/following-sibling::dd,
//div[position() > 50], etc.) and dispatches them directly to the
arena's C selector matcher — same hot path the rest of the library
rides. Sibling, ancestor, and following/preceding axis walks all run
through dedicated C primitives over the DFS-range-encoded arena, so
no Ruby per-step traversal is involved.
HTTP/3 and WebSocket: capability-detected
Scrapetor::Fetcher.features reports whether the linked libcurl
exposes HTTP/3 and WebSocket support. Pass http_version: "3" to
opt into HTTP/3 when available; otherwise the default is HTTP/2
over TLS with HTTP/1.1 fallback. WebSocket frames go through
libcurl 7.86+'s curl_ws_send / curl_ws_recv; the gem doesn't
yet ship a friendly Ruby API for these — patches welcome.
Command-line interface
$ scrapetor extract page.html --schema schema.rb
$ scrapetor info page.html
$ scrapetor jsonld page.html
$ scrapetor opengraph page.html
$ scrapetor microdata page.html
$ scrapetor page-type page.html
Performance
Benchmarks live in benchmark/comprehensive.rb. Every workload asserts
that all three engines produce equivalent output before timing. Numbers
below were measured on Apple Silicon (Ruby 2.7.8); they're reproducible
from this repository.
Parse throughput (build the DOM tree, MB/s).
| Document | Scrapetor | Nokolexbor | Nokogiri |
|---|---|---|---|
| small (170 B) | 37 | 18 | 11 |
| article (2 KB) | 279 | 116 | 31 |
| product (3 KB) | 426 | 140 | 31 |
| listing (36 KB) | 3,933 | 154 | 31 |
| large (2.5 MB) | 18,434 | 136 | 31 |
CSS selector evaluation (one selector against a pre-parsed document, iter/sec).
| Selector | Scrapetor | Nokolexbor | Speedup |
|---|---|---|---|
#main (single id) |
1,272,698 | 65,170 | 19.53x |
article (tag) |
1,244,279 | 65,122 | 19.11x |
.product-card (class) |
1,226,004 | 68,065 | 18.01x |
#main article (id descendant) |
1,086,604 | 35,228 | 30.85x |
img.product-image (tag.class) |
875,901 | 65,707 | 13.33x |
.product-grid > .product-card (child) |
754,486 | 60,924 | 12.38x |
[data-sku="SKU0001"] (attr) |
516,157 | 78,101 | 6.61x |
.product-card .price (descendant) |
405,062 | 42,870 | 9.45x |
Pseudo-classes (:has, :not, :is, :nth-child, :first-child,
:last-child, :nth-of-type, etc.) and pseudo-elements (::text,
::attr(name)) run natively in the same C engine — see the
Selector support section below for the full list.
End-to-end extraction (parse plus run an extraction schema, iter/sec).
| Workload | Scrapetor | Nokolexbor | Nokogiri |
|---|---|---|---|
| listing (50 cards x 4 fields) | 9,360 | 573 | 171 |
| product detail (top + 3 reviews) | 30,101 | 11,636 | 2,022 |
| article (top + tags + sections) | 53,837 | 25,553 | 6,338 |
Allocations per extraction call (live Ruby objects, lower is better).
| Workload | Scrapetor | Nokolexbor | Nokogiri |
|---|---|---|---|
| listing (50 cards x 4 fields) | 363 | 4,710 | 9,501 |
| product detail (top + 3 reviews) | 96 | 140 | 596 |
The full report - including the article workload, selector micro-benchmarks
for every supported selector form, and per-document MB/s figures - is
written to benchmark/RESULTS.md whenever you run ruby -Ilib
benchmark/comprehensive.rb.
Architecture
HTML in ─┬─► Native streaming extract (C)
│ `doc.extract(schema)`
│ Schema runs during tokenisation. No DOM is built.
│ One Ruby/C boundary crossing per document.
│
└─► Native arena DOM (C)
`doc.css(...)`, `doc.at(...)`, mutation, traversal.
Class/id/tag indexes built during the parse pass.
Zero-copy text and attribute spans into the input buffer.
The same C extension (ext/scrapetor/native/) provides both paths. A
pure-Ruby DOM is kept as a fallback for environments where the
extension can't be loaded.
Selector support
The native engine runs the following CSS forms in C:
- tag,
.class,tag.class,#id,tag#id, universal* [attr],[attr=value]and the*=,^=,$=,~=,|=variants- descendant (
A B) and child (A > B) combinators - structural pseudo-classes:
:first-child,:last-child,:only-child,:first-of-type,:last-of-type,:only-of-type,:nth-child(...),:nth-last-child(...),:nth-of-type(...),:nth-last-of-type(...),:empty,:root,:scope - boolean-attribute pseudos:
:checked,:disabled,:enabled,:required,:optional,:read-only,:read-write,:any-link,:link - logical pseudos:
:not(...),:is(...),:matches(...),:where(...),:has(...)(each runs natively when its inner selector is a single atom — typically a class, id, tag, or attribute predicate) - Scrapy/Parsel-style pseudo-elements:
::textand::attr(name)— the engine emits strings directly via a bulk C path so a 100-item result is one boundary crossing, not 100
Sibling combinators (+, ~) and inner selectors more complex than a
single atom — for example :not(div > .x) or :has(.x .y) — transparently
fall back to a pure-Ruby implementation that mirrors Nokogiri's output.
Selector.compile never raises on a syntactically valid CSS selector;
the fallback is automatic.
API reference
See the project documentation at scrapetor.org/docs. The main entry points are:
Scrapetor.parse(html, base_url:)- returns aScrapetor::DocumentScrapetor::HTML(html, base_url)- same, Nokogiri-style aliasScrapetor.schema { … }- schema DSLScrapetor.extract(html, schema)- parse and extractScrapetor.fetch(url)- HTTP GET and parseScrapetor.fetch_extract(url, schema)Scrapetor::Builder.build { |b| b.html { … } }Scrapetor::SAX::Parser.new(handler).parse(html)
Document and Node expose the standard read/mutate API: css, at_css,
xpath, text, content, inner_html, outer_html, [], []=,
attributes, children, parent, add_child, before, after,
replace, remove, add_class, remove_class, and so on.
Compatibility
Scrapetor is tested on Ruby 2.7, 3.0, 3.1, 3.2, and 3.3 on Linux and
macOS (see .github/workflows/ci.yml). The native extension uses only
the stable public Ruby C API.
Development
git clone https://github.com/Alaa-abdulridha/scrapetor
cd scrapetor
bundle install
rake compile
rake test
To run the benchmarks:
ruby -Ilib benchmark/comprehensive.rb # full report → benchmark/RESULTS.md
ruby -Ilib benchmark/parse_extract.rb # listing workload only
ruby -Ilib benchmark/product_detail.rb # product detail workload only
Contributing
Issues, bug reports, and pull requests are welcome on GitHub at
https://github.com/Alaa-abdulridha/scrapetor. Please read
CONTRIBUTING.md before submitting a pull request.
To report a security vulnerability, follow the process in
SECURITY.md.
License
Scrapetor is released under the MIT License. See LICENSE.