Class: Relaton::W3c::DataFetcher

Inherits:

Core::DataFetcher

Object
Core::DataFetcher
Relaton::W3c::DataFetcher

show all

Includes:: SafeRealize

Defined in:: lib/relaton/w3c/data_fetcher.rb

Defined Under Namespace

Classes: CrawlIncompleteError

Constant Summary collapse

DEFAULT_CONCURRENCY = Conservative default: too many parallel workers burst the per-spec version-history requests fast enough to trip the W3C API rate limiter (429s), which is what silently truncated the dataset before the crawl learned to abort on incomplete pagination. Raise it via the env var on a faster/shallower run; lower it further if 429s still appear.

PAGE_FETCH_ATTEMPTS = How many times #fetch_specifications_page retries a transient failure (rate-limit/connection) before giving up and aborting the crawl.

Class Method Summary collapse

.concurrency ⇒ Object

Number of fetch_spec worker threads.
.fetch_versions? ⇒ Boolean

Whether to crawl each specification’s version history (version_history, predecessor_versions, successor_versions).

Instance Method Summary collapse

#client ⇒ Object
#enqueue_specs(queue) ⇒ Object

Page through the specifications index, feeding each spec (paired with its embedded page) to the worker queue.
#fetch(_source = nil) ⇒ Object

Parse documents in parallel.
#fetch_spec(unrealized_spec, page = nil) ⇒ Object
#fetch_versions(spec) ⇒ Object

Crawl a specification’s version history: its dated editions plus the predecessor/successor version chains.
#file_name(id) ⇒ String

Generate file name.
#guard_complete_pagination(last_page, expected_pages) ⇒ Object

Defense in depth: even when no page fetch raised, make sure pagination actually reached the last page the API advertised.
#index ⇒ Object
#initialize(*args) ⇒ DataFetcher constructor

A new instance of DataFetcher.
#log_error(msg) ⇒ Object
#save_doc(bib, warn_duplicate: true) ⇒ Object

Save document to file.
#to_bibxml(bib) ⇒ Object
#to_xml(bib) ⇒ Object
#to_yaml(bib) ⇒ Object

Methods included from SafeRealize

#realize, skipped

Constructor Details

#initialize(*args) ⇒ `DataFetcher`

Returns a new instance of DataFetcher.

# File 'lib/relaton/w3c/data_fetcher.rb', line 49

def initialize(*args)
  super
  @mutex = Mutex.new
  @interrupted = false
end

Class Method Details

.concurrency ⇒ `Object`

Number of fetch_spec worker threads. Tunable via env var so CI or local runs can dial it up for speed or down to lighten load on api.w3.org (or for debugging).



33
34
35

# File 'lib/relaton/w3c/data_fetcher.rb', line 33

def self.concurrency
  (ENV["RELATON_W3C_FETCH_CONCURRENCY"] || DEFAULT_CONCURRENCY).to_i
end

.fetch_versions? ⇒ `Boolean`

Whether to crawl each specification’s version history (version_history, predecessor_versions, successor_versions). Enabled by default for a complete dataset. Set RELATON_W3C_FETCH_VERSIONS=false for a faster, shallower crawl that emits only the top-level specifications and skips the per-spec version fan-out (the bulk of the API requests).

Returns:

(Boolean)

# File 'lib/relaton/w3c/data_fetcher.rb', line 42

def self.fetch_versions?
  val = ENV["RELATON_W3C_FETCH_VERSIONS"]
  return true if val.nil? || val.empty?

  !%w[0 false no off].include?(val.strip.downcase)
end

Instance Method Details

#client ⇒ `Object`



63
64
65

# File 'lib/relaton/w3c/data_fetcher.rb', line 63

def client
  @client ||= W3cApi::Client.new
end

#enqueue_specs(queue) ⇒ `Object`

Page through the specifications index, feeding each spec (paired with its embedded page) to the worker queue. Returns early when interrupted.

embed: true inlines each specification’s full payload into the index page’s ‘_embedded` block, so a spec link realizes from that page in memory instead of making its own HTTP request — one request per page rather than one per specification. The page is queued alongside each link so the worker can hand it back to realize as the parent_resource.

# File 'lib/relaton/w3c/data_fetcher.rb', line 109

def enqueue_specs(queue)
  specs = client.specifications(embed: true)
  expected_pages = specs.pages
  last_page = nil
  loop do
    page = specs
    page.links.specifications.each do |spec|
      break if @interrupted

      queue << [spec, page]
    end
    break if @interrupted

    last_page = page.page
    break unless page.next?

    # Fetch the next page through the client's fetch path rather than
    # realizing the `next` link: only fetch populates the page's
    # embedded_data, so this keeps embed working past page 1. Realizing
    # the `next` link drops `_embedded` and forces a per-spec HTTP
    # request for every specification on every later page.
    next_page = fetch_specifications_page(page.page + 1)
    # A nil here means the page fetch failed after retries (not the end
    # of the list — that is `!page.next?` above). Aborting rather than
    # `break`ing prevents a rate-limit blip from silently truncating the
    # dataset: a partial crawl must never be saved/committed.
    unless next_page
      raise CrawlIncompleteError,
            "specifications pagination stopped at page #{page.page}: " \
            "failed to fetch page #{page.page + 1}"
    end

    specs = next_page
  end

  return if @interrupted

  guard_complete_pagination(last_page, expected_pages)
end

#fetch(_source = nil) ⇒ `Object`

Parse documents in parallel. The crawler is heavily I/O-bound on api.w3.org round-trips (~30-50k requests per run), so a small thread pool gives a near-linear speedup. Pagination still happens serially: each page’s ‘next?` flag gates whether the next page is requested.

A SIGINT (Ctrl-C) is handled gracefully: the producer stops queuing and the workers stop processing after their in-flight spec, then the index of everything fetched so far is saved rather than the run being lost.

# File 'lib/relaton/w3c/data_fetcher.rb', line 77

def fetch(_source = nil)
  n_workers = self.class.concurrency
  queue = SizedQueue.new(n_workers * 4)
  workers = Array.new(n_workers) { spawn_worker(queue) }

  with_interrupt_handler do
    # The poison pills + join run in `ensure` so an exception raised while
    # enqueuing (e.g. CrawlIncompleteError) still unblocks the producer
    # and drains the workers instead of deadlocking on queue.pop.
    begin
      enqueue_specs(queue)
    ensure
      n_workers.times { queue << nil } # poison pills
      workers.each(&:join)
    end
    Util.warn "Crawl interrupted — saving progress collected so far." if @interrupted
    index.save
  end

  report_errors
end

#fetch_spec(unrealized_spec, page = nil) ⇒ `Object`

# File 'lib/relaton/w3c/data_fetcher.rb', line 162

def fetch_spec(unrealized_spec, page = nil)
  # When `page` came from an embed:true fetch, realizing against it as the
  # parent_resource serves the spec from embedded data (no HTTP request).
  spec = realize(unrealized_spec, parent_resource: page)
  return unless spec

  local_errors = Hash.new(true)
  save_doc DataParser.parse(spec, local_errors)

  fetch_versions(spec) if self.class.fetch_versions?

  @mutex.synchronize { local_errors.each { |k, v| @errors[k] &&= v } }
end

#fetch_versions(spec) ⇒ `Object`

Crawl a specification’s version history: its dated editions plus the predecessor/successor version chains. Each entry is a separate HTTP request, so this is the bulk of a run and can be skipped via RELATON_W3C_FETCH_VERSIONS=false (see .fetch_versions?).

# File 'lib/relaton/w3c/data_fetcher.rb', line 182

def fetch_versions(spec)
  if spec.links.respond_to?(:version_history) && spec.links.version_history
    version_history = realize spec.links.version_history
    version_history&.links&.spec_versions&.each { |version| parse_and_save version }
  end

  if spec.links.respond_to?(:predecessor_versions) && spec.links.predecessor_versions
    predecessor_versions = realize spec.links.predecessor_versions
    predecessor_versions&.links&.predecessor_versions&.each { |version| parse_and_save version }
  end

  return unless spec.links.respond_to?(:successor_versions) && spec.links.successor_versions

  successor_versions = realize spec.links.successor_versions
  successor_versions&.links&.successor_versions&.each { |version| parse_and_save version }
end

#file_name(id) ⇒ `String`

Generate file name

Parameters:

id (String) —

document id

Returns:

(String) —

file name

# File 'lib/relaton/w3c/data_fetcher.rb', line 239

def file_name(id)
  name = id.sub(/^W3C\s/, "").gsub(/[\s,:\/+]/, "_").squeeze("_").downcase
  File.join @output, "#{name}.#{@ext}"
end

#guard_complete_pagination(last_page, expected_pages) ⇒ `Object`

Defense in depth: even when no page fetch raised, make sure pagination actually reached the last page the API advertised. Catches truncation modes other than a failed fetch (e.g. a ‘next` link that goes missing). Only enforced when the index reported a positive page count.

Raises:

(CrawlIncompleteError)

# File 'lib/relaton/w3c/data_fetcher.rb', line 153

def guard_complete_pagination(last_page, expected_pages)
  return unless expected_pages.is_a?(Integer) && expected_pages.positive?
  return unless last_page.is_a?(Integer) && last_page < expected_pages

  raise CrawlIncompleteError,
        "specifications pagination ended at page #{last_page} of " \
        "#{expected_pages}; refusing to save a partial dataset"
end

#index ⇒ `Object`



55
56
57

# File 'lib/relaton/w3c/data_fetcher.rb', line 55

def index
  @index ||= Relaton::Index.find_or_create(:W3C, file: "#{INDEXFILE}.yaml")
end

#log_error(msg) ⇒ `Object`



59
60
61

# File 'lib/relaton/w3c/data_fetcher.rb', line 59

def log_error(msg)
  Util.error msg
end

#save_doc(bib, warn_duplicate: true) ⇒ `Object`

Save document to file

Parameters:

bib (Relaton::W3c::ItemData, nil) —

bibliographic item

# File 'lib/relaton/w3c/data_fetcher.rb', line 204

def save_doc(bib, warn_duplicate: true)
  return unless bib

  file = file_name(bib.docnumber)
  @mutex.synchronize do
    if @files.include?(file)
      Util.warn "File #{file} already exists. Document: #{bib.docnumber}" if warn_duplicate
    else
      pubid = PubId.parse bib.docnumber
      index.add_or_update pubid.to_hash, file
      @files << file
    end
    File.write file, serialize(bib), encoding: "UTF-8"
  end
end

#to_bibxml(bib) ⇒ `Object`



228
229
230

# File 'lib/relaton/w3c/data_fetcher.rb', line 228

def to_bibxml(bib)
  bib.to_xml
end

#to_xml(bib) ⇒ `Object`



220
221
222

# File 'lib/relaton/w3c/data_fetcher.rb', line 220

def to_xml(bib)
  bib.to_xml(bibdata: true)
end

#to_yaml(bib) ⇒ `Object`



224
225
226

# File 'lib/relaton/w3c/data_fetcher.rb', line 224

def to_yaml(bib)
  bib.to_yaml
end

Class: Relaton::W3c::DataFetcher

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from SafeRealize

Constructor Details

#initialize(*args) ⇒ DataFetcher

Class Method Details

.concurrency ⇒ Object

.fetch_versions? ⇒ Boolean

Instance Method Details

#client ⇒ Object

#enqueue_specs(queue) ⇒ Object

#fetch(_source = nil) ⇒ Object

#fetch_spec(unrealized_spec, page = nil) ⇒ Object

#fetch_versions(spec) ⇒ Object

#file_name(id) ⇒ String

#guard_complete_pagination(last_page, expected_pages) ⇒ Object

#index ⇒ Object

#log_error(msg) ⇒ Object

#save_doc(bib, warn_duplicate: true) ⇒ Object

#to_bibxml(bib) ⇒ Object

#to_xml(bib) ⇒ Object

#to_yaml(bib) ⇒ Object