Purpose

Archaeo is a Ruby client for the Internet Archive’s Wayback Machine APIs.

It provides a model-driven interface for querying archived snapshots, checking availability, saving URLs, fetching archived content, bulk downloading with resume support, snapshot comparison, coverage analysis, content tracking, full-text search, WARC format I/O, and more.

Installation

gem install archaeo

Or add to your Gemfile:

gem "archaeo"

Quick Start

require "archaeo"

Query Snapshots (CDX API)

cdx = Archaeo::CdxApi.new

# Enumerate all snapshots (auto-paginates via resume key)
cdx.snapshots("example.com").each do |snapshot|
  puts snapshot.timestamp
  puts snapshot.original_url
  puts snapshot.archive_url
end

# Find specific snapshots
oldest = cdx.oldest("example.com")
newest = cdx.newest("example.com")
near   = cdx.near("example.com", timestamp: "20220101")

# Filter by time
before = cdx.before("example.com", timestamp: "20220101")
after  = cdx.after("example.com", timestamp: "20220101")

# Time range query
cdx.between("example.com", from: "20220101", to: "20221231").each do |snap|
  puts snap.timestamp
end

# Count snapshots
cdx.count("example.com")  # => Integer

# Deduplicated snapshots (collapse by digest)
cdx.unique_snapshots("example.com").each do |snap|
  puts snap.timestamp
end

# Timeline analysis (time-bucketed frequency)
timeline = cdx.timeline("example.com",
                        from: "20220101", to: "20221231",
                        bucket_size: :month)
timeline.to_h     # => { "202201" => 5, "202202" => 3, ... }
timeline.peak     # => ["202201", 5]
timeline.total    # => 42
timeline.span     # => ["202201", "202212"]
timeline.size     # => 12 (number of buckets)

# Filter by status code, mimetype, or URL pattern
cdx.snapshots("example.com",
  filters: [Archaeo::CdxFilter.by_status(200)],
  collapse: ["digest"],
  match_type: "domain",
  sort: "reverse",
)

# Compose multiple filters
filters = Archaeo::CdxFilter.combine(
  Archaeo::CdxFilter.only_successful,
  Archaeo::CdxFilter.excluding_mimetype("text/css"),
)
cdx.snapshots("example.com", filters: filters)

# Convenience filter factories
Archaeo::CdxFilter.only_html              # text/html only
Archaeo::CdxFilter.by_mimetype_prefix("image")  # any image/*
Archaeo::CdxFilter.excluding_redirects    # exclude 3xx

# Page-based pagination
cdx.snapshots("example.com", page: 0)

# Count pages
cdx.num_pages("example.com")

# Discover all known URLs for a domain
cdx.known_urls("example.com")

# Composite snapshot (point-in-time site reconstruction)
cdx.composite_snapshot("example.com", timestamp: "20220615",
                       collapse: ["digest"])
# => picks newest snapshot per URL at or before the given timestamp

# CDX caching (speeds up repeated queries)
cdx = Archaeo::CdxApi.new(cache_dir: ".cache")

# Parallel CDX fetching (thread pool for multi-page queries)
parallel = Archaeo::ParallelCdx.new(concurrency: 4)
snapshots = parallel.snapshots("example.com")

Check Availability

api = Archaeo::AvailabilityApi.new

result = api.near("example.com")
result.available?   # => true/false
result.archive_url  # => "https://web.archive.org/web/..."
result.timestamp    # => Archaeo::Timestamp
result.archived_status  # => HTTP status code of the archived page
result.to_h         # => Hash representation
result.as_json      # => JSON-serializable Hash

api.available?("example.com")  # => true/false

# Batch availability check
results = api.batch_available?(%w[example.com other.com])
# => { "example.com" => AvailabilityResult, ... }

Save a URL (SavePageNow)

save = Archaeo::SaveApi.new
result = save.save("https://example.com/")
result.url          # => "https://example.com/"
result.archive_url  # => "https://web.archive.org/web/..."
result.timestamp    # => Archaeo::Timestamp
result.cached?      # => true if already archived
result.success?     # => true if archive_url is present
result.to_h         # => Hash representation
result.as_json      # => JSON-serializable Hash

# Batch save multiple URLs
results = save.batch_save(%w[https://a.com https://b.com],
                          delay: 2, stop_on_error: false)
results.each { |r| puts "#{r.url}: #{r.success?}" }

# Inspect response details
result.status_code       # => HTTP status from Save API
result.response_url      # => redirect URL if any
result.response_headers  # => Hash of response headers

# With rate limiter
save = Archaeo::SaveApi.new(rate_limiter: Archaeo::RateLimiter.new(min_interval: 1.0))

Fetch Archived Content

fetcher = Archaeo::Fetcher.new
page = fetcher.fetch("https://example.com/",
                     timestamp: "20220615000000")

page.content        # => "<html>...</html>"
page.content_type   # => "text/html"
page.status_code    # => 200
page.archive_url    # => full archive URL
page.title          # => "Example Domain"
page.html?          # => true
page.css?           # => true for text/css
page.json?          # => false
page.size           # => content length in bytes
page.to_h           # => Hash with all fields
page.as_json        # => JSON-serializable Hash
page.inspect        # => "#<Archaeo::Page text/html 1234 bytes>"

# Raw (identity) mode -- no Wayback Machine rewriting
page = fetcher.fetch("https://example.com/",
                     timestamp: "20220615000000",
                     identity: true)

# With digest verification (raises IntegrityError on mismatch)
page = fetcher.fetch("https://example.com/",
                     timestamp: "20220615000000",
                     snapshot: snap)

# Raise on error status (raises FetchError with page attached)
page = fetcher.fetch!("https://example.com/",
                      timestamp: "20220615000000")
# FetchError includes: .status_code, .url, .page

# Page links and meta extraction
page.links      # => [{ href: "...", text: "...", external: true/false }]
page.meta_tags  # => { "description" => "...", "og:title" => "...", "canonical" => "..." }

# Structured content extraction (HTML pages only)
page.headings   # => [{ level: 1, text: "Title" }, { level: 2, text: "Subtitle" }]
page.images     # => [{ src: "photo.jpg", alt: "...", width: 800, height: 600 }]
page.forms      # => [{ action: "/submit", method: "POST", fields: [{ name: "q", type: "text" }] }]
page.scripts    # => [{ src: "app.js", type: "text/javascript" }]

Fetch Page with Assets

fetcher = Archaeo::Fetcher.new
bundle = fetcher.fetch_page_with_assets("https://example.com/",
                                        timestamp: "20220615000000")

bundle.page        # => Archaeo::Page
bundle.assets      # => Archaeo::AssetList
bundle.assets.css  # => ["https://example.com/style.css", ...]
bundle.assets.js   # => ["https://example.com/app.js", ...]
bundle.assets.images
bundle.assets.fonts
bundle.assets.media
bundle.size        # => total count (page + assets)
bundle.asset_count # => number of assets
bundle.to_h        # => Hash representation
bundle.to_json     # => JSON string

# Serialize asset list
bundle.assets.to_json
bundle.assets.counts  # => { css: 1, js: 2, image: 3, font: 0, media: 1 }

# Filter assets by type
css_only = bundle.assets.filter(:css)
images_and_fonts = bundle.assets.filter(:image, :font)

# Merge asset lists (deduplicates)
merged = bundle.assets.merge(other_assets)

# Reconstruct from JSON
restored = Archaeo::AssetList.from_json(json_string)

# Safe type access
bundle.assets.urls_by_type(:image)  # works for any type key

# Domain analysis
bundle.assets.domain_counts
# => { "cdn.example.com" => 3, "fonts.googleapis.com" => 1 }

# Filter downloadable assets (excludes data: and fragment URLs)
downloadable = bundle.assets.downloadable

Bulk Download with Resume

downloader = Archaeo::BulkDownloader.new(output_dir: "archive")
summary = downloader.download("example.com") do |current, total, snapshot|
  puts "[#{current}/#{total}] #{snapshot.original_url}"
end

summary.total          # => total snapshots found
summary.downloaded     # => successfully downloaded
summary.skipped        # => skipped (already downloaded with resume)
summary.failed         # => failed downloads
summary.bytes_written  # => total bytes written
summary.elapsed        # => seconds elapsed

# Resume interrupted download
downloader.download("example.com", resume: true)

# Dry run (preview without fetching)
summary = downloader.download("example.com", dry_run: true)

# Filter by date range
downloader.download("example.com",
                    from: "20220101", to: "20221231")

# Parallel downloads
downloader = Archaeo::BulkDownloader.new(
  output_dir: "archive", concurrency: 4,
)
downloader.download("example.com")

# Download with page requisites (CSS/JS/images)
downloader.download("example.com", page_requisites: true)

# Point-in-time composite snapshot
downloader.download("example.com", snapshot_at: "20220615")

# All timestamps (not just latest per URL)
downloader.download("example.com", all_timestamps: true)

# URL pattern filtering
filter = Archaeo::PatternFilter.new(only: ".*\\.html$", exclude: nil)
downloader.download("example.com", filter: filter)

# Download scheduling strategies
scheduler = Archaeo::DownloadScheduler.new(
  strategy: :breadth_first,   # or :depth_first, :newest_first, :oldest_first
  priority: :html_first,
  max_file_size: 50 * 1024 * 1024,
)
# Integrates with BulkDownloader via strategy: option

# Rate limiting
limiter = Archaeo::RateLimiter.new(min_interval: 0.5)
downloader = Archaeo::BulkDownloader.new(
  output_dir: "archive", rate_limiter: limiter,
)

# Limit snapshots
downloader.download("example.com", max_snapshots: 10, strategy: :newest_first)

# Progress reporting
downloader.download("example.com") do |current, total, snap|
  report = Archaeo::ProgressReport.new(
    current: current, total: total,
    downloaded_bytes: current * 1024, elapsed: 10.0,
    current_url: snap.original_url,
  )
  puts "#{report.percent_complete}% — ETA #{report.eta}s"
end

Download State (Resume Tracking)

state = Archaeo::DownloadState.new("archive")

# Check if a snapshot was already downloaded
state.completed?("20220615000000")  # => true/false

# Get metadata for a completed snapshot
entry = state.entry_for("20220615000000")
# => { "ts" => "20220615000000", "at" => "2022-06-15T12:00:00Z",
#      "url" => "https://example.com/", "bytes" => 12345 }

# Total bytes downloaded
state.total_bytes  # => Integer

# List all completed timestamps
state.size        # => number of completed entries
state.timestamps  # => ["20220101000000", "20220102000000"]

# Clear state for a fresh download
state.clear

URL Normalization

Archaeo::UrlNormalizer.normalize("  https://example.com/  ")
# => "https://example.com/"

Archaeo::UrlNormalizer.normalize('"https://example.com/%252F"')
# => "https://example.com/%2F"

Archaeo::UrlNormalizer.with_scheme("example.com")
# => "https://example.com"

# Default ports are stripped
Archaeo::UrlNormalizer.normalize("https://example.com:443/path")
# => "https://example.com/path"

CDX Filters

# Build validated filter expressions
Archaeo::CdxFilter.by_status(200)           # => "statuscode:200"
Archaeo::CdxFilter.excluding_status(404)    # => "!statuscode:404"
Archaeo::CdxFilter.by_mimetype("text/html") # => "mimetype:text/html"
Archaeo::CdxFilter.by_url("example.com")    # => "original:example.com"

# Compose filters
filters = Archaeo::CdxFilter.only_successful
error_filters = Archaeo::CdxFilter.excluding_errors

# Mimetype prefix matching
Archaeo::CdxFilter.by_mimetype_prefix("image")  # => matches image/*

# Convenience factories
Archaeo::CdxFilter.only_html            # => text/html only
Archaeo::CdxFilter.excluding_redirects  # => excludes 3xx statuses

# Introspection
filter = Archaeo::CdxFilter.by_status(200)
filter.field    # => "statuscode"
filter.pattern  # => "200"
filter.matches?("200")  # => true
filter.matches?("404")  # => false
filter.negated?         # => false

URL Rewriting

rewriter = Archaeo::UrlRewriter.new(
  "https://web.archive.org/web/20220615000000/",
  "local",
)

# Rewrite single URL
rewriter.rewrite("https://web.archive.org/web/20220615000000/style.css")
# => "local/style.css"

# Rewrite batch
rewriter.rewrite_batch(["url1", "url2"])

# Rewrite URLs within HTML (src, href, srcset, data-src, poster, action, data-url)
# Also rewrites inline style url() and <style> element url()
rewritten_html = rewriter.rewrite_html(html_content)

# Enhanced rewriting with JS strings, absolute URLs, and server extensions
rewriter = Archaeo::UrlRewriter.new(
  "https://web.archive.org/web/20220615000000/",
  "local",
  rewrite_js: true,           # rewrite URLs inside JS string literals
  rewrite_absolute: true,     # rewrite all absolute archive URLs (not just prefix match)
  server_extensions: true,    # handle .php/.asp/.jsp URLs specially
)

# Standalone CSS file rewriting
rewritten_css = rewriter.rewrite_css(css_content)

Snapshot Convenience

snap = cdx.near("example.com", timestamp: "20220101")

# Status predicates
snap.success?       # => true (200)
snap.redirect?      # => true for 3xx
snap.client_error?  # => true for 4xx
snap.server_error?  # => true for 5xx
snap.error?         # => true for 4xx/5xx

# Age helpers
snap.age            # => seconds since capture
snap.older_than?(3600)  # => true if older than 1 hour
snap.newer_than?(3600)  # => true if newer than 1 hour

# Content comparison (by digest)
snap1.same_content_as?(snap2)  # => true if same digest
snap1.duplicate_of?(snap2)     # => true if same digest AND different timestamp

# Identity URL (raw content, no Wayback rewriting)
snap.identity_url

# Fetch content directly from a snapshot
page = snap.fetch

# Fetch with assets
bundle = snap.fetch_with_assets

# JSON-serializable representation
snap.as_json  # => Hash with primitive values only
snap.inspect  # => "#<Archaeo::Snapshot 20220101 ...>"

Timestamps

# Create from components
ts = Archaeo::Timestamp.new(year: 2022, month: 6, day: 15)

# Parse from Wayback format
ts = Archaeo::Timestamp.parse("20220615120000")

# From Time object
ts = Archaeo::Timestamp.from_time(Time.now)

# Current time
ts = Archaeo::Timestamp.now

# Format as 14-digit string
ts.to_s  # => "20220615000000"

# Standard time formats
ts.to_iso8601  # => "2022-06-15T00:00:00Z"
ts.to_rfc3339  # => "2022-06-15T00:00:00+00:00"

# Decompose
ts.to_h  # => { year: 2022, month: 6, day: 15, hour: 0, minute: 0, second: 0 }
ts.to_a  # => [2022, 6, 15, 0, 0, 0]

# Arithmetic
ts + 3600          # => Timestamp one hour later
ts - 3600          # => Timestamp one hour earlier
ts1 - ts2          # => seconds between timestamps

# Comparison
ts1 < ts2   # => true/false

# Immutable -- frozen on creation
ts.frozen?  # => true

# Date/time helpers
ts.quarter         # => 1..4
ts.wday            # => 0..6 (Sunday = 0)
ts.human_readable  # => "2022-06-15 00:00:00 UTC"
ts.to_date         # => Date object

# Date ranges for coverage analysis
range = ts.date_range(:month)
# => Timestamp(Jun 1)..Timestamp(Jun 30 23:59:59)
ts.date_range(:day)   # => single day range
ts.date_range(:year)  # => full year range

HTTP Client Observability

# Track every request with a callback
client = Archaeo::HttpClient.new(
  on_request: ->(uri, elapsed, status, retries) {
    puts "#{status} #{uri} (#{elapsed.round(3)}s, #{retries} retries)"
  },
)

# Intercept requests before they are sent
client = Archaeo::HttpClient.new(
  before_request: ->(uri, request) {
    request["X-Custom-Header"] = "value"
  },
)

# Inspect connection pool state
client.pool_stats
# => { active_connections: 2, max_pool_size: 8,
#      hosts: ["web.archive.org"],
#      idle_times: { "web.archive.org": 12 } }

Snapshot Comparison (Diff)

diff = Archaeo::SnapshotDiff.new(
  url: "https://example.com/",
  page_a: page_a, page_b: page_b,
  timestamp_a: "20220101", timestamp_b: "20220615",
)

diff.content_changed?    # => true/false (SHA256 digest comparison)
diff.text_diff           # => unified diff of content lines
diff.link_changes        # => { added: [...], removed: [...], unchanged: N }
diff.asset_changes       # => { added: [...], removed: [...], unchanged: N }
diff.structural_changes  # => { "a" => { from: 1, to: 2 }, ... }
diff.to_h                # => Hash with all fields

Coverage Analysis

analyzer = Archaeo::CoverageAnalyzer.new
report = analyzer.analyze("example.com", from: "20220101", to: "20221231")

report.url               # => "example.com"
report.total_urls        # => unique URLs found
report.archived_urls     # => URLs with at least one capture
report.coverage_percent  # => 87.3
report.temporal_gaps      # => [{ from: ts, to: ts, gap_days: 45 }, ...]
report.has_gaps?         # => true/false
report.status_distribution # => { 200 => 150, 404 => 10 }
report.missing_assets    # => resources referenced but not archived

Archive Health Check

checker = Archaeo::ArchiveHealthCheck.new
report = checker.check("example.com", from: "20220101", to: "20221231")

report.total       # => 150
report.accessible  # => 148
report.missing     # => 2
report.errors      # => 0
report.details     # => [HealthDetail, ...]

# Sample a subset (for large collections)
report = checker.check("example.com", sample: 50)

Content Tracking

tracker = Archaeo::ContentTracker.new
report = tracker.track("example.com", from: "20220101", to: "20221231")

report.changed_urls       # => URLs whose digest changed over time
report.new_urls           # => URLs that appeared in the second half
report.removed_urls       # => URLs that disappeared in the second half
report.content_frequency  # => { "url" => unique_digest_count }
report.any_changes?       # => true if any changes detected
searcher = Archaeo::ArchiveSearch.new
results = searcher.search("example.com",
                          query: "contact us",
                          from: "20220101",
                          to: "20221231",
                          case_sensitive: false,
                          max_results: 10)

results.each do |match|
  puts match.snapshot.timestamp  # => when it was archived
  puts match.url                 # => the page URL
  puts match.context             # => "...contact us..." with surrounding text
end

WARC Support

# Export snapshots to WARC format
writer = Archaeo::WarcWriter.new
writer.write("archive/output.warc", pages)

# Gzip-compressed output
writer.write("archive/output.warc.gz", pages, compress: true)

# Read WARC files
reader = Archaeo::WarcReader.new
records = reader.read_records("archive/output.warc")

records.each do |record|
  record.warc_type    # => "response" or "warcinfo"
  record.target_uri   # => original URL
  record.body         # => archived content
  record.response?    # => true for response records
end

Configuration

# Load .archaeo.yml config
config = Archaeo::Configuration.new

config.get("output_dir")             # => "archive" (default)
config.get("rate_limit")             # => 0.5
config.get("concurrency", profile: "fast")  # => 8

# Persist settings
config.set("rate_limit", 1.0)
config.set("concurrency", 4, profile: "fast")

# List profiles
config.profiles  # => ["fast", "careful"]

Encoding Detection

detector = Archaeo::EncodingDetector.new

# Detect encoding from content + content-type charset
encoding = detector.detect(binary_content, content_type: "text/html; charset=iso-8859-1")
# => Encoding::ISO_8859_1

# Detect from HTML meta tag
encoding = detector.detect("<html><head><meta charset='utf-8'>...")
# => Encoding::UTF_8

# Multi-encoding fallback chain
detector.detect(content)  # tries UTF-8, ISO-8859-1, Windows-1252

Path Sanitization

sanitizer = Archaeo::PathSanitizer.new
safe_path = sanitizer.sanitize("https://example.com/path?q=1&r=2")
# => "path_q_1_r_2"

# Handles query string hashing, recursive percent-decoding,
# and file/directory conflict resolution

Pattern Filtering

# Include/exclude URL patterns
filter = Archaeo::PatternFilter.new(
  only: ".*\\.html$",       # regex string or %r{} Regexp
  exclude: /\\/api\\//,
)

filter.match?("https://example.com/page.html")  # => true
filter.match?("https://example.com/style.css")   # => false
filter.match?("https://example.com/api/data")    # => false (excluded)

Subdomain Discovery

discovery = Archaeo::SubdomainDiscovery.new("example.com", max_depth: 2)

# Scan downloaded files to discover subdomains
subdomains = discovery.scan_files("archive/")
# => ["cdn.example.com", "blog.example.com"]

# Scan raw content (HTML, CSS, JS)
subdomains = discovery.scan_content("<a href='https://blog.example.com/post'>")
# => ["blog.example.com"]

Rate Limiting

# Per-host rate limiter with adaptive backoff
limiter = Archaeo::RateLimiter.new(min_interval: 0.5)
limiter.wait(host: "web.archive.org")  # sleeps if needed
limiter.wait(host: "api.example.com")  # independent per-host tracking

Color Output

color = Archaeo::ColorOutput.new(enabled: true)

color.success("Done!")    # green + bold
color.warning("Careful")  # yellow + bold
color.error("Failed!")    # red + bold
color.info("Info")        # cyan

# Auto-detects from TTY, NO_COLOR env, TERM=dumb
color = Archaeo::ColorOutput.new  # enabled: auto-detected

Command-Line Interface

# Show version
archaeo --version

# List snapshots (table, json, or csv format)
archaeo snapshots example.com
archaeo snapshots --format json example.com
archaeo snapshots --format csv --from 20220101 --to 20221231 example.com
archaeo snapshots --filter-status 200 --filter-type text/html example.com

# Find closest snapshot
archaeo near example.com 20220101
archaeo near --format json example.com 20220101

# Find oldest/newest
archaeo oldest example.com
archaeo newest --format json example.com

# Find before/after a timestamp
archaeo before example.com 20220101
archaeo after example.com 20220101

# List snapshots in a date range
archaeo between example.com 20220101 20221231

# Count snapshots
archaeo count example.com

# Check availability (with optional timestamp)
archaeo available example.com
archaeo available --timestamp 20220101 example.com

# Save a URL
archaeo save https://example.com/

# Fetch archived content
archaeo fetch https://example.com/ 20220615120000

# Fetch and save to file
archaeo fetch --output page.html https://example.com/ 20220615120000

# Fetch raw (identity) content
archaeo fetch --identity https://example.com/ 20220615120000

# Fetch a page and list its extracted assets
archaeo fetch-assets https://example.com/ 20220615120000
archaeo fetch-assets --format json https://example.com/ 20220615120000

# Rewrite archive URLs to local paths
archaeo rewrite https://example.com/ 20220615120000
archaeo rewrite --output page.html --prefix local https://example.com/ 20220615120000

# Compare assets between two snapshots
archaeo diff https://example.com/ 20220101 20220615
archaeo diff --format json https://example.com/ 20220101 20220615

# Audit assets for an archived page
archaeo asset-audit https://example.com/ 20220615120000
archaeo asset-audit --format json https://example.com/ 20220615120000

# Download all snapshots
archaeo download example.com --output ./archive

# Dry run (preview without fetching)
archaeo download --dry_run example.com

# Parallel downloads
archaeo download --concurrency 4 example.com --output ./archive

# Resume interrupted download
archaeo download example.com --resume

# Download with page requisites (linked assets)
archaeo download --page-requisites example.com

# Point-in-time composite snapshot
archaeo download --snapshot-at 20220615 example.com

# All timestamps (not just latest)
archaeo download --all-timestamps example.com

# URL pattern filtering
archaeo download --only '.*\.html$' --exclude '/api/' example.com

# Download scheduling
archaeo download --strategy newest_first --max-snapshots 10 example.com

# Reset download state
archaeo download --reset example.com

# Rate limiting
archaeo download --rate-limit 0.5 example.com

# Recursive subdomain discovery
archaeo download --recursive-subdomains --subdomain-depth 2 example.com

# Suppress progress messages
archaeo --quiet download example.com

# Disable colored output
archaeo --no-color download example.com

# Discover all known URLs for a domain
archaeo known_urls example.com
archaeo known_urls --file urls.txt example.com
archaeo known_urls --subdomain example.com

# Check archive health
archaeo health example.com
archaeo health --from 20220101 --to 20221231 --sample 50 example.com

# Analyze archive coverage
archaeo coverage example.com
archaeo coverage --from 20220101 --to 20221231 --format json example.com

# Compare two snapshots
archaeo snapshot-diff example.com 20220101 20220615
archaeo snapshot-diff --format json example.com 20220101 20220615

# Search archived content
archaeo search example.com "contact us"
archaeo search --from 20220101 --to 20221231 --max-results 10 example.com "about"

# Track content changes over time
archaeo track-changes example.com
archaeo track-changes --from 20220101 --to 20221231 --format json example.com

# Export to WARC format
archaeo warc-export --output archive.warc example.com
archaeo warc-export --output archive.warc.gz --gzip example.com

# Save API with headers
archaeo save --headers https://example.com/

Error Handling

# Blocked site (robots.txt)
Archaeo::BlockedSiteError

# No snapshot found
Archaeo::NoSnapshotFound

# Rate limited by Wayback Machine
Archaeo::RateLimitError

# Maximum retries exceeded
Archaeo::MaximumRetriesExceeded

# SavePageNow session limit
Archaeo::SaveFailed

# Content digest mismatch
Archaeo::IntegrityError

# HTTP error during fetch (includes .page, .url, .status_code)
Archaeo::FetchError

Architecture

Archaeo follows a model-driven, OOP design:

Layer Classes Purpose

Models

Timestamp, ArchiveUrl, Snapshot, Page, PageBundle, SaveResult, AvailabilityResult, CdxTimeline, ProgressReport, CoverageReport, ContentChangeReport, SearchResult, WarcRecord, HealthReport

Domain value objects with to_h, as_json, inspect support

URL Processing

UrlNormalizer, CdxFilter, UrlRewriter, PatternFilter, PathSanitizer

URL sanitization, validated filtering, regex include/exclude, path conflict resolution, and HTML/JS/CSS URL rewriting

Asset Extraction

AssetExtractor, AssetList

Parse HTML for resource URLs including preloads and modulepreload

APIs

CdxApi, ParallelCdx, AvailabilityApi, SaveApi, ArchiveSearch

Query and mutate the archive, parallel CDX fetching, full-text search

Operations

Fetcher, BulkDownloader, DownloadState, DownloadScheduler, SubdomainDiscovery

Download content with resume, scheduling strategies, subdomain discovery, and digest verification

Analysis

SnapshotDiff, CoverageAnalyzer, ContentTracker, ArchiveHealthCheck

Compare snapshots, analyze coverage, track changes over time, verify accessibility

Infrastructure

HttpClient, RateLimiter, EncodingDetector, CdxCache, Configuration, ColorOutput, WarcWriter, WarcReader

HTTP transport, rate limiting, encoding detection, caching, config management, WARC I/O, and color output

All API classes accept an HttpClient via dependency injection for testability.

Development

bundle install
bundle exec rspec
bundle exec rubocop

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/riboseinc/archaeo.

License

MIT License. See LICENSE for details.