spidra-ruby

Official Ruby SDK for the Spidra web scraping and crawling API. Scrape pages, run browser actions, batch-process URLs, and crawl entire sites — all from Ruby, with no external dependencies.

Installation

gem install spidra

Or add it to your Gemfile:

gem "spidra"

Requires Ruby 2.7 or higher.

Quick start

require "spidra"

client = Spidra.new(ENV["SPIDRA_API_KEY"])

job = client.scrape.run(
  { urls: [{ url: "https://example.com/pricing" }],
    prompt: "Extract all pricing plans with name, price, and features",
    output: "json" }
)

puts job["content"]

Get your API key from app.spidra.io under Settings → API Keys.

Scraping

scrape.run

Submit a job and wait for it to finish. Returns the full result.

job = client.scrape.run(
  urls:   [{ url: "https://example.com" }],
  prompt: "Extract the main headline and subheading"
)

puts job["content"]

Pass poll_interval: and timeout: as keyword arguments to control how long it waits:

job = client.scrape.run(
  { urls: [{ url: "https://example.com" }], prompt: "..." },
  poll_interval: 5,
  timeout: 60
)

On timeout, run returns { "status" => "timeout", "jobId" => "..." } so you can keep polling with scrape.get.

scrape.submit and scrape.get

Fire and forget — submit a job and check status yourself.

response = client.scrape.submit(
  urls:   [{ url: "https://example.com" }],
  prompt: "Extract the main headline"
)
job_id = response["jobId"]

# Later...
status = client.scrape.get(job_id)
puts status["content"] if status["status"] == "completed"

Scrape parameters

Parameter Type Description
urls Array Up to 3 entries. Each is { url: "..." } with optional actions:
prompt String What to extract, in plain English
output String "markdown" (default) or "json"
schema Hash JSON Schema to enforce a specific output shape
use_proxy Boolean Route through a residential proxy
proxy_country String Two-letter country code, e.g. "us", "de", "jp"
extract_content_only Boolean Strip nav, ads, and boilerplate before extraction
screenshot Boolean Capture a viewport screenshot
full_page_screenshot Boolean Capture a full-page screenshot
cookies String Raw Cookie header for authenticated pages

Browser actions

Pass an actions: array inside a URL entry to interact with the page before extraction runs.

job = client.scrape.run(
  urls: [
    {
      url:     "https://example.com/products",
      actions: [
        { type: "click",  selector: "#accept-cookies" },
        { type: "wait",   duration: 1000 },
        { type: "scroll", to: "80%" }
      ]
    }
  ],
  prompt: "Extract all product names and prices"
)

Batch scraping

Submit up to 50 URLs in one request. They all run in parallel.

batch = client.batch.run(
  { urls: [
      "https://shop.example.com/product/1",
      "https://shop.example.com/product/2",
      "https://shop.example.com/product/3"
    ],
    prompt: "Extract product name, price, and stock status",
    output: "json" }
)

puts "#{batch["completedCount"]}/#{batch["totalUrls"]} completed"

batch["items"].each do |item|
  if item["status"] == "completed"
    puts item["result"].inspect
  else
    puts "Failed: #{item["url"]}#{item["error"]}"
  end
end

batch.submit and batch.get

response = client.batch.submit(
  urls:   ["https://example.com/1", "https://example.com/2"],
  prompt: "Extract the page title"
)
batch_id = response["batchId"]

result = client.batch.get(batch_id)
puts "#{result["completedCount"]}/#{result["totalUrls"]} done"

Retry failed items

if result["failedCount"] > 0
  client.batch.retry(batch_id)
end

Cancel a batch

client.batch.cancel(batch_id)

List past batches

page = client.batch.list(1, 20) # page, limit

page["jobs"].each do |job|
  puts "#{job["uuid"]} #{job["status"]}#{job["completedCount"]}/#{job["totalUrls"]}"
end

Crawling

job = client.crawl.run(
  { base_url:               "https://competitor.com/blog",
    crawl_instruction:      "Follow blog post links only — skip tag and category pages",
    transform_instruction:  "Extract post title, author, publish date, and a one-sentence summary",
    max_pages:              30,
    use_proxy:              true }
)

job["result"].each do |page|
  puts "#{page["url"]}: #{page["data"].inspect}"
end

Crawl jobs often take a few minutes. The default timeout for crawl.run is 300 seconds. Adjust with timeout: n if you expect longer runs.

crawl.submit and crawl.get

response = client.crawl.submit(
  base_url:              "https://example.com/docs",
  crawl_instruction:     "Follow all documentation pages",
  transform_instruction: "Extract the page title and a short content summary",
  max_pages:             50
)
job_id = response["jobId"]

status = client.crawl.get(job_id)
# status["status"]: "waiting" | "active" | "running" | "completed" | "failed"

Downloading raw content

result = client.crawl.pages(job_id)

result["pages"].each do |page|
  puts page["url"]
  # page["html_url"]     — download the raw HTML (expires in 1 hour)
  # page["markdown_url"] — download the Markdown version
end

Re-extracting with a new prompt

result = client.crawl.extract(completed_job_id, "Extract product SKUs and prices as JSON")
new_job_id = result["jobId"]

extracted = client.crawl.get(new_job_id)

History and stats

history = client.crawl.history(1, 10)
puts "#{history["total"]} total crawl jobs"

stats = client.crawl.stats
puts "#{stats["total"]} all-time"

Logs

result = client.logs.list(
  status:     "failed",
  searchTerm: "amazon.com",
  dateStart:  "2024-01-01",
  dateEnd:    "2024-12-31",
  page:       1,
  limit:      20
)

result["logs"].each do |log|
  puts "#{log["urls"][0]["url"]}#{log["status"]} (#{log["credits_used"]} credits)"
end

# Full detail for a single log entry
log = client.logs.get(log_uuid)
puts log["result_data"].inspect

Usage statistics

rows = client.usage.get("30d") # "7d" | "30d" | "weekly"

rows.each do |row|
  puts "#{row["date"]}: #{row["requests"]} requests, #{row["credits"]} credits"
end

Error handling

require "spidra"

begin
  job = client.scrape.run(
    urls:   [{ url: "https://example.com" }],
    prompt: "Extract the headline"
  )
rescue Spidra::AuthenticationError
  puts "Invalid or missing API key"
rescue Spidra::InsufficientCreditsError
  puts "Account is out of credits"
rescue Spidra::RateLimitError
  puts "Rate limited — slow down"
rescue Spidra::ServerError => e
  puts "Server error (#{e.status}): #{e.message}"
rescue Spidra::Error => e
  puts "API error #{e.status}: #{e.message}"
end
Exception HTTP status When
Spidra::AuthenticationError 401 Missing or invalid API key
Spidra::InsufficientCreditsError 403 No credits remaining
Spidra::RateLimitError 429 Too many requests
Spidra::ServerError 5xx Unexpected server-side error
Spidra::Error any Base class for all Spidra exceptions

All exceptions expose .status (HTTP status code) and .message.

License

MIT. See LICENSE for details.