spidra-ruby
Official Ruby SDK for the Spidra web scraping and crawling API. Scrape pages, run browser actions, batch-process URLs, and crawl entire sites — all from Ruby, with no external dependencies.
Installation
gem install spidra
Or add it to your Gemfile:
gem "spidra"
Requires Ruby 2.7 or higher.
Quick start
require "spidra"
client = Spidra.new(ENV["SPIDRA_API_KEY"])
job = client.scrape.run(
{ urls: [{ url: "https://example.com/pricing" }],
prompt: "Extract all pricing plans with name, price, and features",
output: "json" }
)
puts job["content"]
Get your API key from app.spidra.io under Settings → API Keys.
Scraping
scrape.run
Submit a job and wait for it to finish. Returns the full result.
job = client.scrape.run(
urls: [{ url: "https://example.com" }],
prompt: "Extract the main headline and subheading"
)
puts job["content"]
Pass poll_interval: and timeout: as keyword arguments to control how long it waits:
job = client.scrape.run(
{ urls: [{ url: "https://example.com" }], prompt: "..." },
poll_interval: 5,
timeout: 60
)
On timeout, run returns { "status" => "timeout", "jobId" => "..." } so you can keep polling with scrape.get.
scrape.submit and scrape.get
Fire and forget — submit a job and check status yourself.
response = client.scrape.submit(
urls: [{ url: "https://example.com" }],
prompt: "Extract the main headline"
)
job_id = response["jobId"]
# Later...
status = client.scrape.get(job_id)
puts status["content"] if status["status"] == "completed"
Scrape parameters
| Parameter | Type | Description |
|---|---|---|
urls |
Array | Up to 3 entries. Each is { url: "..." } with optional actions: |
prompt |
String | What to extract, in plain English |
output |
String | "markdown" (default) or "json" |
schema |
Hash | JSON Schema to enforce a specific output shape |
use_proxy |
Boolean | Route through a residential proxy |
proxy_country |
String | Two-letter country code, e.g. "us", "de", "jp" |
extract_content_only |
Boolean | Strip nav, ads, and boilerplate before extraction |
screenshot |
Boolean | Capture a viewport screenshot |
full_page_screenshot |
Boolean | Capture a full-page screenshot |
cookies |
String | Raw Cookie header for authenticated pages |
Browser actions
Pass an actions: array inside a URL entry to interact with the page before extraction runs.
job = client.scrape.run(
urls: [
{
url: "https://example.com/products",
actions: [
{ type: "click", selector: "#accept-cookies" },
{ type: "wait", duration: 1000 },
{ type: "scroll", to: "80%" }
]
}
],
prompt: "Extract all product names and prices"
)
Batch scraping
Submit up to 50 URLs in one request. They all run in parallel.
batch = client.batch.run(
{ urls: [
"https://shop.example.com/product/1",
"https://shop.example.com/product/2",
"https://shop.example.com/product/3"
],
prompt: "Extract product name, price, and stock status",
output: "json" }
)
puts "#{batch["completedCount"]}/#{batch["totalUrls"]} completed"
batch["items"].each do |item|
if item["status"] == "completed"
puts item["result"].inspect
else
puts "Failed: #{item["url"]} — #{item["error"]}"
end
end
batch.submit and batch.get
response = client.batch.submit(
urls: ["https://example.com/1", "https://example.com/2"],
prompt: "Extract the page title"
)
batch_id = response["batchId"]
result = client.batch.get(batch_id)
puts "#{result["completedCount"]}/#{result["totalUrls"]} done"
Retry failed items
if result["failedCount"] > 0
client.batch.retry(batch_id)
end
Cancel a batch
client.batch.cancel(batch_id)
List past batches
page = client.batch.list(1, 20) # page, limit
page["jobs"].each do |job|
puts "#{job["uuid"]} #{job["status"]} — #{job["completedCount"]}/#{job["totalUrls"]}"
end
Crawling
job = client.crawl.run(
{ base_url: "https://competitor.com/blog",
crawl_instruction: "Follow blog post links only — skip tag and category pages",
transform_instruction: "Extract post title, author, publish date, and a one-sentence summary",
max_pages: 30,
use_proxy: true }
)
job["result"].each do |page|
puts "#{page["url"]}: #{page["data"].inspect}"
end
Crawl jobs often take a few minutes. The default timeout for crawl.run is 300 seconds. Adjust with timeout: n if you expect longer runs.
crawl.submit and crawl.get
response = client.crawl.submit(
base_url: "https://example.com/docs",
crawl_instruction: "Follow all documentation pages",
transform_instruction: "Extract the page title and a short content summary",
max_pages: 50
)
job_id = response["jobId"]
status = client.crawl.get(job_id)
# status["status"]: "waiting" | "active" | "running" | "completed" | "failed"
Downloading raw content
result = client.crawl.pages(job_id)
result["pages"].each do |page|
puts page["url"]
# page["html_url"] — download the raw HTML (expires in 1 hour)
# page["markdown_url"] — download the Markdown version
end
Re-extracting with a new prompt
result = client.crawl.extract(completed_job_id, "Extract product SKUs and prices as JSON")
new_job_id = result["jobId"]
extracted = client.crawl.get(new_job_id)
History and stats
history = client.crawl.history(1, 10)
puts "#{history["total"]} total crawl jobs"
stats = client.crawl.stats
puts "#{stats["total"]} all-time"
Logs
result = client.logs.list(
status: "failed",
searchTerm: "amazon.com",
dateStart: "2024-01-01",
dateEnd: "2024-12-31",
page: 1,
limit: 20
)
result["logs"].each do |log|
puts "#{log["urls"][0]["url"]} — #{log["status"]} (#{log["credits_used"]} credits)"
end
# Full detail for a single log entry
log = client.logs.get(log_uuid)
puts log["result_data"].inspect
Usage statistics
rows = client.usage.get("30d") # "7d" | "30d" | "weekly"
rows.each do |row|
puts "#{row["date"]}: #{row["requests"]} requests, #{row["credits"]} credits"
end
Error handling
require "spidra"
begin
job = client.scrape.run(
urls: [{ url: "https://example.com" }],
prompt: "Extract the headline"
)
rescue Spidra::AuthenticationError
puts "Invalid or missing API key"
rescue Spidra::InsufficientCreditsError
puts "Account is out of credits"
rescue Spidra::RateLimitError
puts "Rate limited — slow down"
rescue Spidra::ServerError => e
puts "Server error (#{e.status}): #{e.}"
rescue Spidra::Error => e
puts "API error #{e.status}: #{e.}"
end
| Exception | HTTP status | When |
|---|---|---|
Spidra::AuthenticationError |
401 | Missing or invalid API key |
Spidra::InsufficientCreditsError |
403 | No credits remaining |
Spidra::RateLimitError |
429 | Too many requests |
Spidra::ServerError |
5xx | Unexpected server-side error |
Spidra::Error |
any | Base class for all Spidra exceptions |
All exceptions expose .status (HTTP status code) and .message.
License
MIT. See LICENSE for details.