Purpose

Archaeo is a Ruby client for the Internet Archive’s Wayback Machine APIs.

It provides a model-driven interface for querying archived snapshots, checking availability, saving URLs, fetching archived content, and bulk downloading with resume support.

Installation

gem install archaeo

Or add to your Gemfile:

gem "archaeo"

Quick Start

require "archaeo"

Query Snapshots (CDX API)

cdx = Archaeo::CdxApi.new

# Enumerate all snapshots (auto-paginates via resume key)
cdx.snapshots("example.com").each do |snapshot|
  puts snapshot.timestamp
  puts snapshot.original_url
  puts snapshot.archive_url
end

# Find specific snapshots
oldest = cdx.oldest("example.com")
newest = cdx.newest("example.com")
near   = cdx.near("example.com", timestamp: "20220101")

# Filter by time
before = cdx.before("example.com", timestamp: "20220101")
after  = cdx.after("example.com", timestamp: "20220101")

# Time range query
cdx.between("example.com", from: "20220101", to: "20221231").each do |snap|
  puts snap.timestamp
end

# Filter by status code, mimetype, or URL pattern
cdx.snapshots("example.com",
  filters: [Archaeo::CdxFilter.by_status(200)],
  collapse: ["digest"],
  match_type: "domain",
  sort: "reverse",
)

# Compose multiple filters
filters = Archaeo::CdxFilter.combine(
  Archaeo::CdxFilter.only_successful,
  Archaeo::CdxFilter.excluding_mimetype("text/css"),
)
cdx.snapshots("example.com", filters: filters)

# Page-based pagination
cdx.snapshots("example.com", page: 0)

# Count pages
cdx.num_pages("example.com")

# Discover all known URLs for a domain
cdx.known_urls("example.com")

Check Availability

api = Archaeo::AvailabilityApi.new

result = api.near("example.com")
result.available?   # => true/false
result.archive_url  # => "https://web.archive.org/web/..."
result.timestamp    # => Archaeo::Timestamp
result.archived_status  # => HTTP status code of the archived page

api.available?("example.com")  # => true/false

Save a URL (SavePageNow)

save = Archaeo::SaveApi.new
result = save.save("https://example.com/")
result.url          # => "https://example.com/"
result.archive_url  # => "https://web.archive.org/web/..."
result.timestamp    # => Archaeo::Timestamp
result.cached?      # => true if already archived

Fetch Archived Content

fetcher = Archaeo::Fetcher.new
page = fetcher.fetch("https://example.com/",
                     timestamp: "20220615000000")

page.content        # => "<html>...</html>"
page.content_type   # => "text/html"
page.status_code    # => 200
page.archive_url    # => full archive URL
page.title          # => "Example Domain"
page.html?          # => true
page.json?          # => false
page.size           # => content length in bytes

# Raw (identity) mode -- no Wayback Machine rewriting
page = fetcher.fetch("https://example.com/",
                     timestamp: "20220615000000",
                     identity: true)

Fetch Page with Assets

fetcher = Archaeo::Fetcher.new
bundle = fetcher.fetch_page_with_assets("https://example.com/",
                                        timestamp: "20220615000000")

bundle.page        # => Archaeo::Page
bundle.assets      # => Archaeo::AssetList
bundle.assets.css  # => ["https://example.com/style.css", ...]
bundle.assets.js   # => ["https://example.com/app.js", ...]
bundle.assets.images
bundle.assets.fonts
bundle.assets.media
bundle.size        # => total count (page + assets)
bundle.asset_count # => number of assets

# Serialize asset list
bundle.assets.to_json
bundle.assets.counts  # => { css: 1, js: 2, image: 3, font: 0, media: 1 }

Bulk Download with Resume

downloader = Archaeo::BulkDownloader.new(output_dir: "archive")
downloader.download("example.com") do |current, total, snapshot|
  puts "[#{current}/#{total}] #{snapshot.original_url}"
end

# Resume interrupted download
downloader.download("example.com", resume: true)

# Filter by date range
downloader.download("example.com",
                    from: "20220101", to: "20221231")

# Parallel downloads
downloader = Archaeo::BulkDownloader.new(
  output_dir: "archive", concurrency: 4,
)
downloader.download("example.com")

URL Normalization

Archaeo::UrlNormalizer.normalize("  https://example.com/  ")
# => "https://example.com/"

Archaeo::UrlNormalizer.normalize('"https://example.com/%252F"')
# => "https://example.com/%2F"

Archaeo::UrlNormalizer.with_scheme("example.com")
# => "https://example.com"

CDX Filters

# Build validated filter expressions
Archaeo::CdxFilter.by_status(200)           # => "statuscode:200"
Archaeo::CdxFilter.excluding_status(404)    # => "!statuscode:404"
Archaeo::CdxFilter.by_mimetype("text/html") # => "mimetype:text/html"
Archaeo::CdxFilter.by_url("example.com")    # => "original:example.com"

# Compose filters
filters = Archaeo::CdxFilter.only_successful
error_filters = Archaeo::CdxFilter.excluding_errors

Snapshot Convenience

snap = cdx.near("example.com", timestamp: "20220101")

# Status predicates
snap.success?       # => true (200)
snap.redirect?      # => true for 3xx
snap.client_error?  # => true for 4xx
snap.server_error?  # => true for 5xx
snap.error?         # => true for 4xx/5xx

# Fetch content directly from a snapshot
page = snap.fetch

# Fetch with assets
bundle = snap.fetch_with_assets

# JSON-serializable representation
snap.as_json  # => Hash with primitive values only

Timestamps

# Create from components
ts = Archaeo::Timestamp.new(year: 2022, month: 6, day: 15)

# Parse from Wayback format
ts = Archaeo::Timestamp.parse("20220615120000")

# From Time object
ts = Archaeo::Timestamp.from_time(Time.now)

# Current time
ts = Archaeo::Timestamp.now

# Format as 14-digit string
ts.to_s  # => "20220615000000"

# Standard time formats
ts.to_iso8601  # => "2022-06-15T00:00:00Z"
ts.to_rfc3339  # => "2022-06-15T00:00:00+00:00"

# Arithmetic
ts + 3600          # => Timestamp one hour later
ts - 3600          # => Timestamp one hour earlier
ts1 - ts2          # => seconds between timestamps

# Comparison
ts1 < ts2   # => true/false

Command-Line Interface

# Show version
archaeo --version

# List snapshots (table, json, or csv format)
archaeo snapshots example.com
archaeo snapshots --format json example.com
archaeo snapshots --format csv --from 20220101 --to 20221231 example.com

# Find closest snapshot
archaeo near example.com 20220101
archaeo near --format json example.com 20220101

# Find oldest/newest
archaeo oldest example.com
archaeo newest --format json example.com

# Check availability (with optional timestamp)
archaeo available example.com
archaeo available --timestamp 20220101 example.com

# Save a URL
archaeo save https://example.com/

# Fetch archived content
archaeo fetch https://example.com/ 20220615120000

# Fetch and save to file
archaeo fetch --output page.html https://example.com/ 20220615120000

# Fetch raw (identity) content
archaeo fetch --identity https://example.com/ 20220615120000

# Download all snapshots
archaeo download example.com --output ./archive

# Parallel downloads
archaeo download --concurrency 4 example.com --output ./archive

# Resume interrupted download
archaeo download example.com --resume

# Discover all known URLs for a domain
archaeo known_urls example.com

Error Handling

# Blocked site (robots.txt)
Archaeo::BlockedSiteError

# No snapshot found
Archaeo::NoSnapshotFound

# Rate limited by Wayback Machine
Archaeo::RateLimitError

# Maximum retries exceeded
Archaeo::MaximumRetriesExceeded

# SavePageNow session limit
Archaeo::SaveFailed

Architecture

Archaeo follows a model-driven, OOP design:

Layer Classes Purpose

Models

Timestamp, ArchiveUrl, Snapshot, Page, PageBundle, SaveResult, AvailabilityResult

Domain value objects

URL Processing

UrlNormalizer, CdxFilter, UrlRewriter

URL sanitization, validated filtering with composition, and rewriting

Asset Extraction

AssetExtractor, AssetList

Parse HTML for resource URLs

APIs

CdxApi, AvailabilityApi, SaveApi

Query and mutate the archive

Operations

Fetcher, BulkDownloader, DownloadState

Download content with resume support

Infrastructure

HttpClient

HTTP transport with retries, gzip, 429/503 handling, connection pooling with eviction

All API classes accept an HttpClient via dependency injection for testability.

Development

bundle install
bundle exec rspec
bundle exec rubocop

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/riboseinc/archaeo.

License

MIT License. See LICENSE for details.