Purpose
Archaeo is a Ruby client for the Internet Archive’s Wayback Machine APIs.
It provides a model-driven interface for querying archived snapshots, checking availability, saving URLs, fetching archived content, and bulk downloading with resume support.
Installation
gem install archaeo
Or add to your Gemfile:
gem "archaeo"
Quick Start
require "archaeo"
Query Snapshots (CDX API)
cdx = Archaeo::CdxApi.new
# Enumerate all snapshots (auto-paginates via resume key)
cdx.snapshots("example.com").each do |snapshot|
puts snapshot.
puts snapshot.original_url
puts snapshot.archive_url
end
# Find specific snapshots
oldest = cdx.oldest("example.com")
newest = cdx.newest("example.com")
near = cdx.near("example.com", timestamp: "20220101")
# Filter by time
before = cdx.before("example.com", timestamp: "20220101")
after = cdx.after("example.com", timestamp: "20220101")
# Filter by status code, mimetype, or URL pattern
cdx.snapshots("example.com",
filters: [Archaeo::CdxFilter.by_status(200)],
collapse: ["digest"],
match_type: "domain",
sort: "reverse",
)
# Page-based pagination
cdx.snapshots("example.com", page: 0)
# Count pages
cdx.num_pages("example.com")
# Discover all known URLs for a domain
cdx.known_urls("example.com")
Check Availability
api = Archaeo::AvailabilityApi.new
result = api.near("example.com")
result.available? # => true/false
result.archive_url # => "https://web.archive.org/web/..."
result. # => Archaeo::Timestamp
api.available?("example.com") # => true/false
Save a URL (SavePageNow)
Fetch Archived Content
fetcher = Archaeo::Fetcher.new
page = fetcher.fetch("https://example.com/",
timestamp: "20220615000000")
page.content # => "<html>...</html>"
page.content_type # => "text/html"
page.status_code # => 200
page.archive_url # => full archive URL
# Raw (identity) mode -- no Wayback Machine rewriting
page = fetcher.fetch("https://example.com/",
timestamp: "20220615000000",
identity: true)
Fetch Page with Assets
fetcher = Archaeo::Fetcher.new
bundle = fetcher.fetch_page_with_assets("https://example.com/",
timestamp: "20220615000000")
bundle.page # => Archaeo::Page
bundle.assets # => Archaeo::AssetList
bundle.assets.css # => ["https://example.com/style.css", ...]
bundle.assets.js # => ["https://example.com/app.js", ...]
bundle.assets.images
bundle.assets.fonts
bundle.assets.media
Bulk Download with Resume
downloader = Archaeo::BulkDownloader.new(output_dir: "archive")
downloader.download("example.com") do |current, total, snapshot|
puts "[#{current}/#{total}] #{snapshot.original_url}"
end
# Resume interrupted download
downloader.download("example.com", resume: true)
# Filter by date range
downloader.download("example.com",
from: "20220101", to: "20221231")
URL Normalization
Archaeo::UrlNormalizer.normalize(" https://example.com/ ")
# => "https://example.com/"
Archaeo::UrlNormalizer.normalize('"https://example.com/%252F"')
# => "https://example.com/%2F"
Archaeo::UrlNormalizer.with_scheme("example.com")
# => "https://example.com"
CDX Filters
Timestamps
# Create from components
ts = Archaeo::Timestamp.new(year: 2022, month: 6, day: 15)
# Parse from Wayback format
ts = Archaeo::Timestamp.parse("20220615120000")
# From Time object
ts = Archaeo::Timestamp.from_time(Time.now)
# Current time
ts = Archaeo::Timestamp.now
# Format as 14-digit string
ts.to_s # => "20220615000000"
# Comparison
ts1 < ts2 # => true/false
Command-Line Interface
# Show version
archaeo --version
# List snapshots (table, json, or csv format)
archaeo snapshots example.com
archaeo snapshots --format json example.com
archaeo snapshots --format csv --from 20220101 --to 20221231 example.com
# Find closest snapshot
archaeo near example.com 20220101
# Check availability
archaeo available example.com
# Save a URL
archaeo save https://example.com/
# Fetch archived content
archaeo fetch https://example.com/ 20220615120000
# Fetch and save to file
archaeo fetch --output page.html https://example.com/ 20220615120000
# Fetch raw (identity) content
archaeo fetch --identity https://example.com/ 20220615120000
# Download all snapshots
archaeo download example.com --output ./archive
# Resume interrupted download
archaeo download example.com --resume
# Discover all known URLs for a domain
archaeo known_urls example.com
Error Handling
# Blocked site (robots.txt)
Archaeo::BlockedSiteError
# No snapshot found
Archaeo::NoSnapshotFound
# Rate limited by Wayback Machine
Archaeo::RateLimitError
# Maximum retries exceeded
Archaeo::MaximumRetriesExceeded
# SavePageNow session limit
Archaeo::SaveFailed
Architecture
Archaeo follows a model-driven, OOP design:
| Layer | Classes | Purpose |
|---|---|---|
Models |
|
Domain value objects |
URL Processing |
|
URL sanitization, filtering, and rewriting |
Asset Extraction |
|
Parse HTML for resource URLs |
APIs |
|
Query and mutate the archive |
Operations |
|
Download content with resume support |
Infrastructure |
|
HTTP transport with retries, gzip, connection pooling |
All API classes accept an HttpClient via dependency injection for testability.
Development
bundle install
bundle exec rspec
bundle exec rubocop
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/riboseinc/archaeo.
License
MIT License. See LICENSE for details.