Class: Relaton::W3c::DataFetcher

Inherits:

Core::DataFetcher

Object
Core::DataFetcher
Relaton::W3c::DataFetcher

show all

Includes:: SafeRealize

Defined in:: lib/relaton/w3c/data_fetcher.rb

Constant Summary collapse

DEFAULT_CONCURRENCY =

Class Method Summary collapse

.concurrency ⇒ Object

Number of fetch_spec worker threads.
.fetch_versions? ⇒ Boolean

Whether to crawl each specification’s version history (version_history, predecessor_versions, successor_versions).

Instance Method Summary collapse

#client ⇒ Object
#enqueue_specs(queue) ⇒ Object

Page through the specifications index, feeding each spec (paired with its embedded page) to the worker queue.
#fetch(_source = nil) ⇒ Object

Parse documents in parallel.
#fetch_spec(unrealized_spec, page = nil) ⇒ Object
#fetch_versions(spec) ⇒ Object

Crawl a specification’s version history: its dated editions plus the predecessor/successor version chains.
#file_name(id) ⇒ String

Generate file name.
#index ⇒ Object
#initialize(*args) ⇒ DataFetcher constructor

A new instance of DataFetcher.
#log_error(msg) ⇒ Object
#save_doc(bib, warn_duplicate: true) ⇒ Object

Save document to file.
#to_bibxml(bib) ⇒ Object
#to_xml(bib) ⇒ Object
#to_yaml(bib) ⇒ Object

Methods included from SafeRealize

#realize, skipped

Constructor Details

#initialize(*args) ⇒ `DataFetcher`

Returns a new instance of DataFetcher.

# File 'lib/relaton/w3c/data_fetcher.rb', line 34

def initialize(*args)
  super
  @mutex = Mutex.new
  @interrupted = false
end

Class Method Details

.concurrency ⇒ `Object`

Number of fetch_spec worker threads. Tunable via env var so CI or local runs can dial it down (e.g. for debugging or to lighten load on api.w3.org).



18
19
20

# File 'lib/relaton/w3c/data_fetcher.rb', line 18

def self.concurrency
  (ENV["RELATON_W3C_FETCH_CONCURRENCY"] || DEFAULT_CONCURRENCY).to_i
end

.fetch_versions? ⇒ `Boolean`

Whether to crawl each specification’s version history (version_history, predecessor_versions, successor_versions). Enabled by default for a complete dataset. Set RELATON_W3C_FETCH_VERSIONS=false for a faster, shallower crawl that emits only the top-level specifications and skips the per-spec version fan-out (the bulk of the API requests).

Returns:

(Boolean)

# File 'lib/relaton/w3c/data_fetcher.rb', line 27

def self.fetch_versions?
  val = ENV["RELATON_W3C_FETCH_VERSIONS"]
  return true if val.nil? || val.empty?

  !%w[0 false no off].include?(val.strip.downcase)
end

Instance Method Details

#client ⇒ `Object`



48
49
50

# File 'lib/relaton/w3c/data_fetcher.rb', line 48

def client
  @client ||= W3cApi::Client.new
end

#enqueue_specs(queue) ⇒ `Object`

Page through the specifications index, feeding each spec (paired with its embedded page) to the worker queue. Returns early when interrupted.

embed: true inlines each specification’s full payload into the index page’s ‘_embedded` block, so a spec link realizes from that page in memory instead of making its own HTTP request — one request per page rather than one per specification. The page is queued alongside each link so the worker can hand it back to realize as the parent_resource.

# File 'lib/relaton/w3c/data_fetcher.rb', line 88

def enqueue_specs(queue)
  specs = client.specifications(embed: true)
  loop do
    page = specs
    page.links.specifications.each do |spec|
      break if @interrupted

      queue << [spec, page]
    end
    break if @interrupted || !page.next?

    # Fetch the next page through the client's fetch path rather than
    # realizing the `next` link: only fetch populates the page's
    # embedded_data, so this keeps embed working past page 1. Realizing
    # the `next` link drops `_embedded` and forces a per-spec HTTP
    # request for every specification on every later page.
    next_page = fetch_specifications_page(page.page + 1)
    break unless next_page

    specs = next_page
  end
end

#fetch(_source = nil) ⇒ `Object`

Parse documents in parallel. The crawler is heavily I/O-bound on api.w3.org round-trips (~30-50k requests per run), so a small thread pool gives a near-linear speedup. Pagination still happens serially: each page’s ‘next?` flag gates whether the next page is requested.

A SIGINT (Ctrl-C) is handled gracefully: the producer stops queuing and the workers stop processing after their in-flight spec, then the index of everything fetched so far is saved rather than the run being lost.

# File 'lib/relaton/w3c/data_fetcher.rb', line 62

def fetch(_source = nil)
  n_workers = self.class.concurrency
  queue = SizedQueue.new(n_workers * 4)
  workers = Array.new(n_workers) { spawn_worker(queue) }

  with_interrupt_handler do
    enqueue_specs(queue)
    n_workers.times { queue << nil } # poison pills
    workers.each(&:join)
    Util.warn "Crawl interrupted — saving progress collected so far." if @interrupted
    index.save
  end

  report_errors
end

#fetch_spec(unrealized_spec, page = nil) ⇒ `Object`

# File 'lib/relaton/w3c/data_fetcher.rb', line 111

def fetch_spec(unrealized_spec, page = nil)
  # When `page` came from an embed:true fetch, realizing against it as the
  # parent_resource serves the spec from embedded data (no HTTP request).
  spec = realize(unrealized_spec, parent_resource: page)
  return unless spec

  local_errors = Hash.new(true)
  save_doc DataParser.parse(spec, local_errors)

  fetch_versions(spec) if self.class.fetch_versions?

  @mutex.synchronize { local_errors.each { |k, v| @errors[k] &&= v } }
end

#fetch_versions(spec) ⇒ `Object`

Crawl a specification’s version history: its dated editions plus the predecessor/successor version chains. Each entry is a separate HTTP request, so this is the bulk of a run and can be skipped via RELATON_W3C_FETCH_VERSIONS=false (see .fetch_versions?).

# File 'lib/relaton/w3c/data_fetcher.rb', line 131

def fetch_versions(spec)
  if spec.links.respond_to?(:version_history) && spec.links.version_history
    version_history = realize spec.links.version_history
    version_history&.links&.spec_versions&.each { |version| parse_and_save version }
  end

  if spec.links.respond_to?(:predecessor_versions) && spec.links.predecessor_versions
    predecessor_versions = realize spec.links.predecessor_versions
    predecessor_versions&.links&.predecessor_versions&.each { |version| parse_and_save version }
  end

  return unless spec.links.respond_to?(:successor_versions) && spec.links.successor_versions

  successor_versions = realize spec.links.successor_versions
  successor_versions&.links&.successor_versions&.each { |version| parse_and_save version }
end

#file_name(id) ⇒ `String`

Generate file name

Parameters:

id (String) —

document id

Returns:

(String) —

file name

# File 'lib/relaton/w3c/data_fetcher.rb', line 188

def file_name(id)
  name = id.sub(/^W3C\s/, "").gsub(/[\s,:\/+]/, "_").squeeze("_").downcase
  File.join @output, "#{name}.#{@ext}"
end

#index ⇒ `Object`



40
41
42

# File 'lib/relaton/w3c/data_fetcher.rb', line 40

def index
  @index ||= Relaton::Index.find_or_create(:W3C, file: "#{INDEXFILE}.yaml")
end

#log_error(msg) ⇒ `Object`



44
45
46

# File 'lib/relaton/w3c/data_fetcher.rb', line 44

def log_error(msg)
  Util.error msg
end

#save_doc(bib, warn_duplicate: true) ⇒ `Object`

Save document to file

Parameters:

bib (Relaton::W3c::ItemData, nil) —

bibliographic item

# File 'lib/relaton/w3c/data_fetcher.rb', line 153

def save_doc(bib, warn_duplicate: true)
  return unless bib

  file = file_name(bib.docnumber)
  @mutex.synchronize do
    if @files.include?(file)
      Util.warn "File #{file} already exists. Document: #{bib.docnumber}" if warn_duplicate
    else
      pubid = PubId.parse bib.docnumber
      index.add_or_update pubid.to_hash, file
      @files << file
    end
    File.write file, serialize(bib), encoding: "UTF-8"
  end
end

#to_bibxml(bib) ⇒ `Object`



177
178
179

# File 'lib/relaton/w3c/data_fetcher.rb', line 177

def to_bibxml(bib)
  bib.to_xml
end

#to_xml(bib) ⇒ `Object`



169
170
171

# File 'lib/relaton/w3c/data_fetcher.rb', line 169

def to_xml(bib)
  bib.to_xml(bibdata: true)
end

#to_yaml(bib) ⇒ `Object`



173
174
175

# File 'lib/relaton/w3c/data_fetcher.rb', line 173

def to_yaml(bib)
  bib.to_yaml
end

Class: Relaton::W3c::DataFetcher

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from SafeRealize

Constructor Details

#initialize(*args) ⇒ DataFetcher

Class Method Details

.concurrency ⇒ Object

.fetch_versions? ⇒ Boolean

Instance Method Details

#client ⇒ Object

#enqueue_specs(queue) ⇒ Object

#fetch(_source = nil) ⇒ Object

#fetch_spec(unrealized_spec, page = nil) ⇒ Object

#fetch_versions(spec) ⇒ Object

#file_name(id) ⇒ String

#index ⇒ Object

#log_error(msg) ⇒ Object

#save_doc(bib, warn_duplicate: true) ⇒ Object

#to_bibxml(bib) ⇒ Object

#to_xml(bib) ⇒ Object

#to_yaml(bib) ⇒ Object