Class: Relaton::W3c::DataFetcher
- Inherits:
-
Core::DataFetcher
- Object
- Core::DataFetcher
- Relaton::W3c::DataFetcher
- Includes:
- SafeRealize
- Defined in:
- lib/relaton/w3c/data_fetcher.rb
Defined Under Namespace
Classes: CrawlIncompleteError
Constant Summary collapse
- DEFAULT_CONCURRENCY =
Conservative default: too many parallel workers burst the per-spec version-history requests fast enough to trip the W3C API rate limiter (429s), which is what silently truncated the dataset before the crawl learned to abort on incomplete pagination. Raise it via the env var on a faster/shallower run; lower it further if 429s still appear.
4- PAGE_FETCH_ATTEMPTS =
How many times #fetch_specifications_page retries a transient failure (rate-limit/connection) before giving up and aborting the crawl.
3
Class Method Summary collapse
-
.concurrency ⇒ Object
Number of fetch_spec worker threads.
-
.fetch_versions? ⇒ Boolean
Whether to crawl each specification’s version history (version_history, predecessor_versions, successor_versions).
Instance Method Summary collapse
- #client ⇒ Object
-
#enqueue_specs(queue) ⇒ Object
Page through the specifications index, feeding each spec (paired with its embedded page) to the worker queue.
-
#fetch(_source = nil) ⇒ Object
Parse documents in parallel.
- #fetch_spec(unrealized_spec, page = nil) ⇒ Object
-
#fetch_versions(spec) ⇒ Object
Crawl a specification’s version history: its dated editions plus the predecessor/successor version chains.
-
#file_name(id) ⇒ String
Generate file name.
-
#guard_complete_pagination(last_page, expected_pages) ⇒ Object
Defense in depth: even when no page fetch raised, make sure pagination actually reached the last page the API advertised.
- #index ⇒ Object
-
#initialize(*args) ⇒ DataFetcher
constructor
A new instance of DataFetcher.
- #log_error(msg) ⇒ Object
-
#save_doc(bib, warn_duplicate: true) ⇒ Object
Save document to file.
- #to_bibxml(bib) ⇒ Object
- #to_xml(bib) ⇒ Object
- #to_yaml(bib) ⇒ Object
Methods included from SafeRealize
Constructor Details
#initialize(*args) ⇒ DataFetcher
Returns a new instance of DataFetcher.
49 50 51 52 53 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 49 def initialize(*args) super @mutex = Mutex.new @interrupted = false end |
Class Method Details
.concurrency ⇒ Object
Number of fetch_spec worker threads. Tunable via env var so CI or local runs can dial it up for speed or down to lighten load on api.w3.org (or for debugging).
33 34 35 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 33 def self.concurrency (ENV["RELATON_W3C_FETCH_CONCURRENCY"] || DEFAULT_CONCURRENCY).to_i end |
.fetch_versions? ⇒ Boolean
Whether to crawl each specification’s version history (version_history, predecessor_versions, successor_versions). Enabled by default for a complete dataset. Set RELATON_W3C_FETCH_VERSIONS=false for a faster, shallower crawl that emits only the top-level specifications and skips the per-spec version fan-out (the bulk of the API requests).
42 43 44 45 46 47 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 42 def self.fetch_versions? val = ENV["RELATON_W3C_FETCH_VERSIONS"] return true if val.nil? || val.empty? !%w[0 false no off].include?(val.strip.downcase) end |
Instance Method Details
#client ⇒ Object
63 64 65 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 63 def client @client ||= W3cApi::Client.new end |
#enqueue_specs(queue) ⇒ Object
Page through the specifications index, feeding each spec (paired with its embedded page) to the worker queue. Returns early when interrupted.
embed: true inlines each specification’s full payload into the index page’s ‘_embedded` block, so a spec link realizes from that page in memory instead of making its own HTTP request — one request per page rather than one per specification. The page is queued alongside each link so the worker can hand it back to realize as the parent_resource.
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 109 def enqueue_specs(queue) specs = client.specifications(embed: true) expected_pages = specs.pages last_page = nil loop do page = specs page.links.specifications.each do |spec| break if @interrupted queue << [spec, page] end break if @interrupted last_page = page.page break unless page.next? # Fetch the next page through the client's fetch path rather than # realizing the `next` link: only fetch populates the page's # embedded_data, so this keeps embed working past page 1. Realizing # the `next` link drops `_embedded` and forces a per-spec HTTP # request for every specification on every later page. next_page = fetch_specifications_page(page.page + 1) # A nil here means the page fetch failed after retries (not the end # of the list — that is `!page.next?` above). Aborting rather than # `break`ing prevents a rate-limit blip from silently truncating the # dataset: a partial crawl must never be saved/committed. unless next_page raise CrawlIncompleteError, "specifications pagination stopped at page #{page.page}: " \ "failed to fetch page #{page.page + 1}" end specs = next_page end return if @interrupted guard_complete_pagination(last_page, expected_pages) end |
#fetch(_source = nil) ⇒ Object
Parse documents in parallel. The crawler is heavily I/O-bound on api.w3.org round-trips (~30-50k requests per run), so a small thread pool gives a near-linear speedup. Pagination still happens serially: each page’s ‘next?` flag gates whether the next page is requested.
A SIGINT (Ctrl-C) is handled gracefully: the producer stops queuing and the workers stop processing after their in-flight spec, then the index of everything fetched so far is saved rather than the run being lost.
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 77 def fetch(_source = nil) n_workers = self.class.concurrency queue = SizedQueue.new(n_workers * 4) workers = Array.new(n_workers) { spawn_worker(queue) } with_interrupt_handler do # The poison pills + join run in `ensure` so an exception raised while # enqueuing (e.g. CrawlIncompleteError) still unblocks the producer # and drains the workers instead of deadlocking on queue.pop. begin enqueue_specs(queue) ensure n_workers.times { queue << nil } # poison pills workers.each(&:join) end Util.warn "Crawl interrupted — saving progress collected so far." if @interrupted index.save end report_errors end |
#fetch_spec(unrealized_spec, page = nil) ⇒ Object
162 163 164 165 166 167 168 169 170 171 172 173 174 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 162 def fetch_spec(unrealized_spec, page = nil) # When `page` came from an embed:true fetch, realizing against it as the # parent_resource serves the spec from embedded data (no HTTP request). spec = realize(unrealized_spec, parent_resource: page) return unless spec local_errors = Hash.new(true) save_doc DataParser.parse(spec, local_errors) fetch_versions(spec) if self.class.fetch_versions? @mutex.synchronize { local_errors.each { |k, v| @errors[k] &&= v } } end |
#fetch_versions(spec) ⇒ Object
Crawl a specification’s version history: its dated editions plus the predecessor/successor version chains. Each entry is a separate HTTP request, so this is the bulk of a run and can be skipped via RELATON_W3C_FETCH_VERSIONS=false (see .fetch_versions?).
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 182 def fetch_versions(spec) if spec.links.respond_to?(:version_history) && spec.links.version_history version_history = realize spec.links.version_history version_history&.links&.spec_versions&.each { |version| parse_and_save version } end if spec.links.respond_to?(:predecessor_versions) && spec.links.predecessor_versions predecessor_versions = realize spec.links.predecessor_versions predecessor_versions&.links&.predecessor_versions&.each { |version| parse_and_save version } end return unless spec.links.respond_to?(:successor_versions) && spec.links.successor_versions successor_versions = realize spec.links.successor_versions successor_versions&.links&.successor_versions&.each { |version| parse_and_save version } end |
#file_name(id) ⇒ String
Generate file name
239 240 241 242 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 239 def file_name(id) name = id.sub(/^W3C\s/, "").gsub(/[\s,:\/+]/, "_").squeeze("_").downcase File.join @output, "#{name}.#{@ext}" end |
#guard_complete_pagination(last_page, expected_pages) ⇒ Object
Defense in depth: even when no page fetch raised, make sure pagination actually reached the last page the API advertised. Catches truncation modes other than a failed fetch (e.g. a ‘next` link that goes missing). Only enforced when the index reported a positive page count.
153 154 155 156 157 158 159 160 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 153 def guard_complete_pagination(last_page, expected_pages) return unless expected_pages.is_a?(Integer) && expected_pages.positive? return unless last_page.is_a?(Integer) && last_page < expected_pages raise CrawlIncompleteError, "specifications pagination ended at page #{last_page} of " \ "#{expected_pages}; refusing to save a partial dataset" end |
#index ⇒ Object
55 56 57 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 55 def index @index ||= Relaton::Index.find_or_create(:W3C, file: "#{INDEXFILE}.yaml") end |
#log_error(msg) ⇒ Object
59 60 61 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 59 def log_error(msg) Util.error msg end |
#save_doc(bib, warn_duplicate: true) ⇒ Object
Save document to file
204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 204 def save_doc(bib, warn_duplicate: true) return unless bib file = file_name(bib.docnumber) @mutex.synchronize do if @files.include?(file) Util.warn "File #{file} already exists. Document: #{bib.docnumber}" if warn_duplicate else pubid = PubId.parse bib.docnumber index.add_or_update pubid.to_hash, file @files << file end File.write file, serialize(bib), encoding: "UTF-8" end end |
#to_bibxml(bib) ⇒ Object
228 229 230 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 228 def to_bibxml(bib) bib.to_xml end |
#to_xml(bib) ⇒ Object
220 221 222 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 220 def to_xml(bib) bib.to_xml(bibdata: true) end |
#to_yaml(bib) ⇒ Object
224 225 226 |
# File 'lib/relaton/w3c/data_fetcher.rb', line 224 def to_yaml(bib) bib.to_yaml end |