Class: Archaeo::BulkDownloader

Inherits:

Object

Object
Archaeo::BulkDownloader

show all

Defined in:: lib/archaeo/bulk_downloader.rb

Overview

Downloads all archived snapshots of a URL with resume support.

Queries the CDX API for matching snapshots, fetches each page, and saves content to disk. Progress is tracked in a state file for interrupted download recovery.

Instance Method Summary collapse

#download(url, from: nil, to: nil, resume: false, &block) ⇒ Object
#initialize(client: HttpClient.new, output_dir: "archive", cdx_api: nil, concurrency: 1) ⇒ BulkDownloader constructor

A new instance of BulkDownloader.

Constructor Details

#initialize(client: HttpClient.new, output_dir: "archive", cdx_api: nil, concurrency: 1) ⇒ `BulkDownloader`

Returns a new instance of BulkDownloader.

# File 'lib/archaeo/bulk_downloader.rb', line 12

def initialize(client: HttpClient.new, output_dir: "archive",
               cdx_api: nil, concurrency: 1)
  @client = client
  @output_dir = output_dir
  @cdx_api = cdx_api
  @concurrency = [1, concurrency.to_i].max
end

Instance Method Details

#download(url, from: nil, to: nil, resume: false, &block) ⇒ `Object`

# File 'lib/archaeo/bulk_downloader.rb', line 20

def download(url, from: nil, to: nil, resume: false, &block)
  url = UrlNormalizer.normalize(url)
  FileUtils.mkdir_p(@output_dir)
  state = DownloadState.new(@output_dir)

  snapshots = fetch_snapshots(url, from: from, to: to)
  total = snapshots.size
  progress = block

  if @concurrency == 1
    download_sequential(snapshots, total, state, resume, progress)
  else
    download_concurrent(snapshots, total, state, resume, progress)
  end
end