Class: Documentrix::Documents::Cache::SQLiteCache

Inherits:
Object
  • Object
show all
Includes:
Common
Defined in:
lib/documentrix/documents/cache/sqlite_cache.rb

Overview

SQLiteCache is a cache implementation that uses SQLite database for storing document embeddings and related metadata.

This class provides a persistent cache storage solution for document embeddings, leveraging SQLite's capabilities to store both the embedding vectors and associated text data, tags, and source information. It supports efficient vector similarity searches using the sqlite_vec extension for fast nearest neighbor queries.

Instance Attribute Summary collapse

Attributes included from Common

#prefix

Instance Method Summary collapse

Methods included from Common

#clear, #collections, #pre, #unpre

Methods included from Utils::Math

#cosine_similarity, #norm

Constructor Details

#initialize(prefix:, embedding_length: 1_024, filename: ':memory:', debug: false, busy_timeout: 5000) ⇒ void

The initialize method sets up the cache by calling super and setting various instance variables.

Parameters:

  • prefix (String)

    the prefix for keys

  • embedding_length (Integer) (defaults to: 1_024)

    the length of the embeddings vector

  • filename (String) (defaults to: ':memory:')

    the name of the SQLite database file or ':memory:' for in-memory.

  • debug (FalseClass, TrueClass) (defaults to: false)

    whether to enable debugging

  • busy_timeout (Integer) (defaults to: 5000)

    the SQLite busy timeout in milliseconds (defaults to 5000)



27
28
29
30
31
32
33
34
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 27

def initialize(prefix:, embedding_length: 1_024, filename: ':memory:', debug: false, busy_timeout: 5000)
  super(prefix:)
  @embedding_length = embedding_length
  @filename         = filename
  @debug            = debug
  @busy_timeout     = busy_timeout
  setup_database(filename)
end

Instance Attribute Details

#embedding_lengthObject (readonly)

length of the embeddings vector



38
39
40
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 38

def embedding_length
  @embedding_length
end

#filenameObject (readonly)

filename for the database, :memory: is in memory



36
37
38
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 36

def filename
  @filename
end

Instance Method Details

#[](key) ⇒ Documentrix::Documents::Record, NilClass

The method retrieves the value associated with the given key from the cache.

Parameters:

  • key (String)

    The key for which to retrieve the value.

Returns:



47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 47

def [](key)
  result = execute(
    %{
      SELECT records.key, records.text, records.norm, records.source,
        records.digest, records.tags, embeddings.embedding
      FROM records
      INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
      WHERE records.key = ?
    },
    pre(key)
  )&.first or return
  key, text, norm, source, digest, tags, embedding = *result
  embedding = embedding.unpack("f*")
  tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
  convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
end

#[]=(key, value) ⇒ Object

The []= method sets the value for a given key by inserting it into the database.

Parameters:

  • key (String)

    the key to set

  • value (Hash, Documentrix::Documents::Record)

    the hash or record containing the text, embedding, and other metadata



70
71
72
73
74
75
76
77
78
79
80
81
82
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 70

def []=(key, value)
  value     = convert_value_to_record(value)
  digest    = compute_file_digest(value.source)
  embedding = value.embedding.pack("f*")
  execute(%{BEGIN IMMEDIATE})
  execute(%{INSERT INTO embeddings(embedding) VALUES(?)}, [ embedding ])
  embedding_id, = execute(%{ SELECT last_insert_rowid() }).flatten
  execute(%{
    INSERT INTO records(key,text,embedding_id,norm,source,digest,tags)
    VALUES(?,?,?,?,?,?,?)
  }, [ pre(key), value.text, embedding_id, value.norm, value.source, digest, JSON(value.tags) ])
  execute(%{COMMIT})
end

#clear_all_with_prefixDocumentrix::Documents::RedisBackedMemoryCache

The clear_all_with_prefix method deletes all records for prefix prefix from the cache by executing a SQL query.

Returns:

  • (Documentrix::Documents::RedisBackedMemoryCache)

    self



161
162
163
164
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 161

def clear_all_with_prefix
  execute(%{DELETE FROM records WHERE key LIKE ?}, [ start_with_prefix ])
  self
end

#clear_by_source(source, digest: nil, operator: ?=) ⇒ self

Removes all records associated with the specified source from the cache.

If a digest is provided, the method will only remove records that do NOT match this digest. This allows for updating a source by wiping old versions while preserving records that are already up-to-date.

Parameters:

  • source (String)

    the source identifier used to filter records

  • digest (String, nil) (defaults to: nil)

    the SHA256 hexadecimal digest of the source. Records matching this digest will be preserved.

  • operator (String) (defaults to: ?=)

    the operator to use for comparison ('=' or '!='). Defaults to '='.

Returns:

  • (self)

    the cache instance for method chaining



179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 179

def clear_by_source(source, digest: nil, operator: ?=)
  operator = '!=' if operator != ?=
  if digest
    execute(
      %{
        DELETE FROM records
        WHERE key LIKE ? AND source = ? AND digest #{operator} ?
      },
      [ start_with_prefix, source, digest ]
    )
  else
    execute(
      %{
        DELETE FROM records
        WHERE key LIKE ? AND source = ?
      },
      [ start_with_prefix, source ]
    )
  end
  self
end

#clear_for_tags(tags = nil) ⇒ Documentrix::Documents::Cache::SQLiteCache

The clear_for_tags method clears the cache for specific tags by deleting records that match those tags and have the prefix prefix.

Parameters:

  • tags (Array<String>, NilClass) (defaults to: nil)

    An array of tag names to clear from the cache or nil for all records

Returns:



145
146
147
148
149
150
151
152
153
154
155
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 145

def clear_for_tags(tags = nil)
  tags = Documentrix::Utils::Tags.new(tags).to_a
  if tags.present?
    records = find_records_for_tags(tags)
    keys = '(%s)' % records.transpose.first.map { "'%s'" % quote(_1) }.join(?,)
    execute(%{DELETE FROM records WHERE key IN #{keys}})
  else
    clear_all_with_prefix
  end
  self
end

#convert_to_vector(vector) ⇒ Array

The convert_to_vector method returns the input vector itself, because conversion isn't necessary for this cache class.

Parameters:

  • vector (Array)

    the input vector

Returns:

  • (Array)

    the (not) converted vector



324
325
326
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 324

def convert_to_vector(vector)
  vector
end

#delete(key) ⇒ NilClass

The delete method removes a key from the cache by executing a SQL query.

Parameters:

  • key (String)

    the key to be deleted

Returns:

  • (NilClass)


102
103
104
105
106
107
108
109
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 102

def delete(key)
  result = key?(key)
  execute(
    %{ DELETE FROM records WHERE records.key = ? },
    pre(key)
  )
  result
end

#each(prefix: start_with_prefix) {|key, value| ... } ⇒ Object

The each method iterates over records matching the given prefix and yields them to the block.

Examples:

cache.each do |key, value|
  puts "#{key}: #{value}"
end

Parameters:

  • prefix (String) (defaults to: start_with_prefix)

    the prefix to match

Yields:

  • (key, value)

    where key is the record's key and value is the record itself



288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 288

def each(prefix: start_with_prefix, &block)
  block or return enum_for(__method__, prefix:)

  execute(%{
    SELECT records.key, records.text, records.norm, records.source,
      records.digest, records.tags, embeddings.embedding
    FROM records
    INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
    WHERE records.key LIKE ?
  }, [ prefix ]).each do |key, text, norm, source, digest, tags, embedding|
    embedding = embedding.unpack("f*")
    tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
    value     = convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
    block.(key, value)
  end
  self
end

#each_source {|source| ... } ⇒ Enumerator

Yields each unique, full source present in the cache records.

This is a high-performance override for SQLite that avoids loading embeddings and parsing JSON for every record.

Yields:

  • (source)

    the full source string

Returns:

  • (Enumerator)

    an enumerator if no block is given, nil otherwise.



239
240
241
242
243
244
245
246
247
248
249
250
251
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 239

def each_source(&block)
  block or return enum_for(__method__)

  execute(%{
    SELECT DISTINCT source FROM records
    WHERE key LIKE ? AND source IS NOT NULL
  }, [ start_with_prefix ]).each do |source,|
    source = source.full? or next

    block.(source)
  end
  nil
end

#find_records(needle, tags: nil, max_records: nil, min_similarity: -1)) {|key, value| ... } ⇒ Array<Documentrix::Documents::Record>

The find_records method finds records that match the given needle and tags.

Examples:

documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ], min_similarity: 0.7)

Parameters:

  • needle (Array)

    the embedding vector

  • tags (Array) (defaults to: nil)

    the list of tags to filter by (optional)

  • max_records (Integer) (defaults to: nil)

    the maximum number of records to return (optional)

  • min_similarity (Float) (defaults to: -1))

    the minimum similarity score to include (defaults to -1)

Yields:

  • (key, value)

Returns:

Raises:

  • (ArgumentError)

    if needle size does not match embedding length



370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 370

def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
  needle.size != @embedding_length and
    raise ArgumentError, "needle embedding length != %s" % @embedding_length
  needle_binary = needle.pack("f*")
  max_records   = [ max_records, size, 4_096 ].compact.min
  records = find_records_for_tags(tags)
  rowids_where = '(%s)' % records.transpose.last&.join(?,)
  execute(
    %{
      SELECT records.key, records.text, records.norm, records.source,
        records.digest, records.tags, embeddings.embedding,
        1 - vec_distance_cosine(?, vec_f32(embeddings.embedding)) AS similarity
      FROM records
      INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
      WHERE embeddings.rowid IN #{rowids_where}
        AND embeddings.embedding MATCH ? AND similarity >= ?
        AND embeddings.k = ?
      ORDER BY similarity DESC
    }, [ needle_binary, needle_binary, min_similarity, max_records ]
  ).map do |key, text, norm, source, digest, tags, embedding, similarity|
    key       = unpre(key)
    embedding = embedding.unpack("f*")
    tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
    convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:, similarity:)
  end
end

#find_records_for_tags(tags) ⇒ Array

The find_records_for_tags method filters records based on the provided tags.

Parameters:

  • tags (Array)

    an array of tag names

Returns:

  • (Array)

    an array of filtered records



333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 333

def find_records_for_tags(tags)
  if tags.present?
    tags_filter = Documentrix::Utils::Tags.new(tags).to_a
    unless tags_filter.empty?
      tags_where = ' AND (%s)' % tags_filter.map {
        'tags LIKE "%%%s%%"' % quote(_1)
      }.join(' OR ')
    end
  end
  records = execute(%{
    SELECT key, tags, embedding_id
    FROM records
    WHERE key LIKE ?#{tags_where}
  }, [ start_with_prefix ])
  if tags_filter
    records = records.select { |key, tags, embedding_id|
      (tags_filter & JSON(tags.to_s).to_a).size >= 1
    }
  end
  records
end

#full_each {|key, value| ... } ⇒ Documentrix::Documents::Cache::SQLiteCache

The full_each method iterates over all keys and values in the cache, regardless of their prefix.

Yields:

  • (key, value)

Returns:



312
313
314
315
316
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 312

def full_each(&block)
  block or return enum_for(__method__)

  each(prefix: ?%, &block)
end

#key?(key) ⇒ FalseClass, TrueClass

The key? method checks if the given key exists in the cache by executing a SQL query.

Parameters:

  • key (String)

    the key to check for existence

Returns:

  • (FalseClass, TrueClass)

    true if the key exists, false otherwise



90
91
92
93
94
95
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 90

def key?(key)
  execute(
    %{ SELECT count(records.key) FROM records WHERE records.key = ? },
    pre(key)
  ).flatten.first == 1
end

#move_prefix(old_prefix, new_prefix) ⇒ Documentrix::Documents::Cache::SQLiteCache

Move a key prefix in the cache.

This operation updates every record whose key starts with +old_prefix+, rewriting the prefix to +new_prefix+. It uses SQLite’s built‑in replace() string function, which means the change is atomic and performed entirely inside the database engine—no Ruby‑side iteration or temporary data structures are needed.

Parameters:

  • old_prefix (String)

    the current prefix to replace (must match the beginning of the keys you want to move).

  • new_prefix (String)

    the new prefix that should take its place.

Returns:



267
268
269
270
271
272
273
274
275
276
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 267

def move_prefix(old_prefix, new_prefix)
  execute(
    %{
      UPDATE records
        SET key = replace(key, '#{quote(old_prefix)}', '#{quote(new_prefix)}')
        WHERE key LIKE ?
    },
    old_prefix + '%'
  )
end

#sizeInteger

The size method returns the total number of records stored in the cache, that is the ones with prefix prefix.

Returns:

  • (Integer)

    the count of records



130
131
132
133
134
135
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 130

def size
  execute(
    %{SELECT COUNT(*) FROM records WHERE key LIKE ?},
    [ start_with_prefix ]
  ).flatten.first
end

#source_exist?(source, digest: nil, operator: ?=) ⇒ Boolean

The source_exist? method checks if any records associated with the given source exist in the cache. If a digest is provided, it verifies if the source exists and matches the specified digest using the provided operator.

Parameters:

  • source (#to_s)

    the source to check for existence

  • digest (String, nil) (defaults to: nil)

    the SHA256 hexadecimal digest to compare against the stored source digest (optional)

  • operator (String) (defaults to: ?=)

    the operator to use for comparison ('=' or '!='). Defaults to '='.

Returns:

  • (Boolean)

    true if the source exists (and matches the digest condition if provided), false otherwise.



213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 213

def source_exist?(source, digest: nil, operator: ?=)
  operator = '!=' if operator != ?=
  if digest
    !!execute(
      %{
        SELECT 1 FROM records WHERE key LIKE ? AND source = ? AND digest #{operator} ?
      },
      [ start_with_prefix, source, digest ]
    ).first
  else
    !!execute(
      %{
        SELECT 1 FROM records WHERE key LIKE ? AND source = ?
      },
      [ start_with_prefix, source ]
    ).first
  end
end

#tagsDocumentrix::Utils::Tags

The tags method returns an array of unique tags from the database.

Returns:

  • (Documentrix::Utils::Tags)

    An instance of Documentrix::Utils::Tags containing all unique tags found in the database.



115
116
117
118
119
120
121
122
123
124
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 115

def tags
  result = Documentrix::Utils::Tags.new
  execute(%{
      SELECT DISTINCT(tags) FROM records WHERE key LIKE ?
    }, [ start_with_prefix ]
  ).flatten.each do
    JSON(_1).each { |t| result.add(t) }
  end
  result
end