Module: Scrapetor::PersistentCache
- Defined in:
- lib/scrapetor/persistent_cache.rb
Overview
Disk-backed parse cache. Persists the parsed arena (nodes blob, attrs blob, html bytes) to disk so subsequent process invocations restore the document via memcpy + index rebuild — the SAX tokeniser doesn’t run on hit. Implementation is fully native: ‘Scrapetor::Native::Document#serialize_to_file` writes the binary arena; `Scrapetor::Native::Document.load_from_file` reads it back.
Designed for:
- CI / test suites looping the same fixture HTML across boots
- Batch jobs that restart (cron, sidekiq workers)
- A/B parser comparisons over a corpus
Storage layout: SCRAP_CACHE_DIR/<first-2-bytes>/<sha256>.arena Files are content-addressed so identical HTML inputs share one cache entry regardless of caller.
Opt-in via SCRAP_PERSISTENT_CACHE=1 or Scrapetor::PersistentCache.enable! Override the cache root via SCRAP_CACHE_DIR (default ~/.cache/scrapetor/parse).
Constant Summary collapse
- DEFAULT_DIR =
File.("~/.cache/scrapetor/parse")
Class Attribute Summary collapse
-
.dir ⇒ Object
Returns the value of attribute dir.
Class Method Summary collapse
- .clear! ⇒ Object
- .directory ⇒ Object
- .disable! ⇒ Object
- .disk_usage ⇒ Object
- .enable! ⇒ Object
- .enabled? ⇒ Boolean
-
.key_for(html) ⇒ Object
SHA-256 of the HTML — collisions effectively zero.
-
.load(html) ⇒ Object
Load a cached parsed arena for the given HTML, or nil on miss.
-
.store(html, native_doc) ⇒ Object
Persist a parsed arena to disk under its content fingerprint.
-
.warm(paths_or_globs) ⇒ Object
Pre-warm the cache for a directory of fixtures.
Class Attribute Details
.dir ⇒ Object
Returns the value of attribute dir.
30 31 32 |
# File 'lib/scrapetor/persistent_cache.rb', line 30 def dir @dir end |
Class Method Details
.clear! ⇒ Object
116 117 118 119 |
# File 'lib/scrapetor/persistent_cache.rb', line 116 def clear! return 0 unless File.directory?(directory) Dir.glob(File.join(directory, "*", "*.arena")).each(&File.method(:delete)).size end |
.directory ⇒ Object
49 50 51 |
# File 'lib/scrapetor/persistent_cache.rb', line 49 def directory @dir ||= ENV.fetch("SCRAP_CACHE_DIR", DEFAULT_DIR) end |
.disable! ⇒ Object
45 46 47 |
# File 'lib/scrapetor/persistent_cache.rb', line 45 def disable! @enabled = false end |
.disk_usage ⇒ Object
111 112 113 114 |
# File 'lib/scrapetor/persistent_cache.rb', line 111 def disk_usage return 0 unless File.directory?(directory) Dir.glob(File.join(directory, "*", "*.arena")).sum { |p| File.size(p) } end |
.enable! ⇒ Object
38 39 40 41 42 43 |
# File 'lib/scrapetor/persistent_cache.rb', line 38 def enable! @enabled = true @dir ||= ENV.fetch("SCRAP_CACHE_DIR", DEFAULT_DIR) FileUtils.mkdir_p(@dir) true end |
.enabled? ⇒ Boolean
32 33 34 35 36 |
# File 'lib/scrapetor/persistent_cache.rb', line 32 def enabled? e = defined?(@enabled) ? @enabled : nil return e unless e.nil? ENV["SCRAP_PERSISTENT_CACHE"] == "1" end |
.key_for(html) ⇒ Object
SHA-256 of the HTML — collisions effectively zero.
92 93 94 |
# File 'lib/scrapetor/persistent_cache.rb', line 92 def key_for(html) Digest::SHA256.hexdigest(html) end |
.load(html) ⇒ Object
Load a cached parsed arena for the given HTML, or nil on miss. The return value is a Scrapetor::Native::Document ready to be wrapped by Scrapetor::Document.
56 57 58 59 60 61 62 63 64 65 66 67 |
# File 'lib/scrapetor/persistent_cache.rb', line 56 def load(html) return nil unless enabled? return nil if html.nil? || html.empty? key = key_for(html) path = path_for(key) return nil unless File.exist?(path) native = Scrapetor::Native::Document.load_from_file(path) native rescue StandardError File.delete(path) rescue nil nil end |
.store(html, native_doc) ⇒ Object
Persist a parsed arena to disk under its content fingerprint. Takes the Scrapetor::Native::Document handle (i.e. ‘doc.backing.native` for an unmutated document). Returns the cache key on success, nil on miss / disabled.
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/scrapetor/persistent_cache.rb', line 73 def store(html, native_doc) return nil unless enabled? return nil if html.nil? || html.empty? return nil if native_doc.nil? key = key_for(html) path = path_for(key) return key if File.exist?(path) FileUtils.mkdir_p(File.dirname(path)) tmp = "#{path}.tmp.#{Process.pid}" ok = native_doc.serialize_to_file(tmp) unless ok File.delete(tmp) rescue nil return nil end File.rename(tmp, path) key end |
.warm(paths_or_globs) ⇒ Object
Pre-warm the cache for a directory of fixtures.
97 98 99 100 101 102 103 104 105 106 107 108 109 |
# File 'lib/scrapetor/persistent_cache.rb', line 97 def warm(paths_or_globs) return 0 unless enabled? n = 0 Array(paths_or_globs).each do |entry| Dir.glob(entry).each do |path| html = File.read(path) doc = Scrapetor.parse(html) store(html, doc.backing.native) n += 1 end end n end |