Module: Scrapetor::PersistentCache

Defined in:
lib/scrapetor/persistent_cache.rb

Overview

Disk-backed parse cache. Persists the parsed arena (nodes blob, attrs blob, html bytes) to disk so subsequent process invocations restore the document via memcpy + index rebuild — the SAX tokeniser doesn’t run on hit. Implementation is fully native: ‘Scrapetor::Native::Document#serialize_to_file` writes the binary arena; `Scrapetor::Native::Document.load_from_file` reads it back.

Designed for:

- CI / test suites looping the same fixture HTML across boots
- Batch jobs that restart (cron, sidekiq workers)
- A/B parser comparisons over a corpus

Storage layout: SCRAP_CACHE_DIR/<first-2-bytes>/<sha256>.arena Files are content-addressed so identical HTML inputs share one cache entry regardless of caller.

Opt-in via SCRAP_PERSISTENT_CACHE=1 or Scrapetor::PersistentCache.enable! Override the cache root via SCRAP_CACHE_DIR (default ~/.cache/scrapetor/parse).

Constant Summary collapse

DEFAULT_DIR =
File.expand_path("~/.cache/scrapetor/parse")

Class Attribute Summary collapse

Class Method Summary collapse

Class Attribute Details

.dirObject

Returns the value of attribute dir.



30
31
32
# File 'lib/scrapetor/persistent_cache.rb', line 30

def dir
  @dir
end

Class Method Details

.clear!Object



116
117
118
119
# File 'lib/scrapetor/persistent_cache.rb', line 116

def clear!
  return 0 unless File.directory?(directory)
  Dir.glob(File.join(directory, "*", "*.arena")).each(&File.method(:delete)).size
end

.directoryObject



49
50
51
# File 'lib/scrapetor/persistent_cache.rb', line 49

def directory
  @dir ||= ENV.fetch("SCRAP_CACHE_DIR", DEFAULT_DIR)
end

.disable!Object



45
46
47
# File 'lib/scrapetor/persistent_cache.rb', line 45

def disable!
  @enabled = false
end

.disk_usageObject



111
112
113
114
# File 'lib/scrapetor/persistent_cache.rb', line 111

def disk_usage
  return 0 unless File.directory?(directory)
  Dir.glob(File.join(directory, "*", "*.arena")).sum { |p| File.size(p) }
end

.enable!Object



38
39
40
41
42
43
# File 'lib/scrapetor/persistent_cache.rb', line 38

def enable!
  @enabled = true
  @dir   ||= ENV.fetch("SCRAP_CACHE_DIR", DEFAULT_DIR)
  FileUtils.mkdir_p(@dir)
  true
end

.enabled?Boolean

Returns:

  • (Boolean)


32
33
34
35
36
# File 'lib/scrapetor/persistent_cache.rb', line 32

def enabled?
  e = defined?(@enabled) ? @enabled : nil
  return e unless e.nil?
  ENV["SCRAP_PERSISTENT_CACHE"] == "1"
end

.key_for(html) ⇒ Object

SHA-256 of the HTML — collisions effectively zero.



92
93
94
# File 'lib/scrapetor/persistent_cache.rb', line 92

def key_for(html)
  Digest::SHA256.hexdigest(html)
end

.load(html) ⇒ Object

Load a cached parsed arena for the given HTML, or nil on miss. The return value is a Scrapetor::Native::Document ready to be wrapped by Scrapetor::Document.



56
57
58
59
60
61
62
63
64
65
66
67
# File 'lib/scrapetor/persistent_cache.rb', line 56

def load(html)
  return nil unless enabled?
  return nil if html.nil? || html.empty?
  key = key_for(html)
  path = path_for(key)
  return nil unless File.exist?(path)
  native = Scrapetor::Native::Document.load_from_file(path)
  native
rescue StandardError
  File.delete(path) rescue nil
  nil
end

.store(html, native_doc) ⇒ Object

Persist a parsed arena to disk under its content fingerprint. Takes the Scrapetor::Native::Document handle (i.e. ‘doc.backing.native` for an unmutated document). Returns the cache key on success, nil on miss / disabled.



73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/scrapetor/persistent_cache.rb', line 73

def store(html, native_doc)
  return nil unless enabled?
  return nil if html.nil? || html.empty?
  return nil if native_doc.nil?
  key = key_for(html)
  path = path_for(key)
  return key if File.exist?(path)
  FileUtils.mkdir_p(File.dirname(path))
  tmp = "#{path}.tmp.#{Process.pid}"
  ok = native_doc.serialize_to_file(tmp)
  unless ok
    File.delete(tmp) rescue nil
    return nil
  end
  File.rename(tmp, path)
  key
end

.warm(paths_or_globs) ⇒ Object

Pre-warm the cache for a directory of fixtures.



97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/scrapetor/persistent_cache.rb', line 97

def warm(paths_or_globs)
  return 0 unless enabled?
  n = 0
  Array(paths_or_globs).each do |entry|
    Dir.glob(entry).each do |path|
      html = File.read(path)
      doc = Scrapetor.parse(html)
      store(html, doc.backing.native)
      n += 1
    end
  end
  n
end