Class: Documentrix::Documents::Cache::SQLiteCache

Inherits:
Object
  • Object
show all
Includes:
Common
Defined in:
lib/documentrix/documents/cache/sqlite_cache.rb

Overview

SQLiteCache is a cache implementation that uses SQLite database for storing document embeddings and related metadata.

This class provides a persistent cache storage solution for document embeddings, leveraging SQLite's capabilities to store both the embedding vectors and associated text data, tags, and source information. It supports efficient vector similarity searches using the sqlite_vec extension for fast nearest neighbor queries.

Instance Attribute Summary collapse

Attributes included from Common

#prefix

Instance Method Summary collapse

Methods included from Common

#clear, #collections, #pre, #unpre

Methods included from Utils::Math

#cosine_similarity, #norm

Constructor Details

#initialize(prefix:, embedding_length: 1_024, filename: ':memory:', debug: false) ⇒ void

The initialize method sets up the cache by calling super and setting various instance variables.

Parameters:

  • prefix (String)

    the prefix for keys

  • embedding_length (Integer) (defaults to: 1_024)

    the length of the embeddings vector

  • filename (String) (defaults to: ':memory:')

    the name of the SQLite database file or ':memory:' for in-memory.

  • debug (FalseClass, TrueClass) (defaults to: false)

    whether to enable debugging



26
27
28
29
30
31
32
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 26

def initialize(prefix:, embedding_length: 1_024, filename: ':memory:', debug: false)
  super(prefix:)
  @embedding_length = embedding_length
  @filename         = filename
  @debug            = debug
  setup_database(filename)
end

Instance Attribute Details

#embedding_lengthObject (readonly)

length of the embeddings vector



36
37
38
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 36

def embedding_length
  @embedding_length
end

#filenameObject (readonly)

filename for the database, :memory: is in memory



34
35
36
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 34

def filename
  @filename
end

Instance Method Details

#[](key) ⇒ Documentrix::Documents::Record, NilClass

The method retrieves the value associated with the given key from the cache.

Parameters:

  • key (String)

    The key for which to retrieve the value.

Returns:



45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 45

def [](key)
  result = execute(
    %{
      SELECT records.key, records.text, records.norm, records.source,
        records.digest, records.tags, embeddings.embedding
      FROM records
      INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
      WHERE records.key = ?
    },
    pre(key)
  )&.first or return
  key, text, norm, source, digest, tags, embedding = *result
  embedding = embedding.unpack("f*")
  tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
  convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
end

#[]=(key, value) ⇒ Object

The []= method sets the value for a given key by inserting it into the database.

Parameters:

  • key (String)

    the key to set

  • value (Hash, Documentrix::Documents::Record)

    the hash or record containing the text, embedding, and other metadata



68
69
70
71
72
73
74
75
76
77
78
79
80
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 68

def []=(key, value)
  value     = convert_value_to_record(value)
  digest    = compute_file_digest(value.source)
  embedding = value.embedding.pack("f*")
  execute(%{BEGIN})
  execute(%{INSERT INTO embeddings(embedding) VALUES(?)}, [ embedding ])
  embedding_id, = execute(%{ SELECT last_insert_rowid() }).flatten
  execute(%{
    INSERT INTO records(key,text,embedding_id,norm,source,digest,tags)
    VALUES(?,?,?,?,?,?,?)
  }, [ pre(key), value.text, embedding_id, value.norm, value.source, digest, JSON(value.tags) ])
  execute(%{COMMIT})
end

#clear_all_with_prefixDocumentrix::Documents::RedisBackedMemoryCache

The clear_all_with_prefix method deletes all records for prefix prefix from the cache by executing a SQL query.

Returns:

  • (Documentrix::Documents::RedisBackedMemoryCache)

    self



159
160
161
162
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 159

def clear_all_with_prefix
  execute(%{DELETE FROM records WHERE key LIKE ?}, [ start_with_prefix ])
  self
end

#clear_by_source(source, digest: nil, operator: ?=) ⇒ self

Removes all records associated with the specified source from the cache.

If a digest is provided, the method will only remove records that do NOT match this digest. This allows for updating a source by wiping old versions while preserving records that are already up-to-date.

Parameters:

  • source (String)

    the source identifier used to filter records

  • digest (String, nil) (defaults to: nil)

    the SHA256 hexadecimal digest of the source. Records matching this digest will be preserved.

Returns:

  • (self)

    the cache instance for method chaining



175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 175

def clear_by_source(source, digest: nil, operator: ?=)
  operator = '!=' if operator != ?=
  if digest
    execute(
      %{
        DELETE FROM records
        WHERE key LIKE ? AND source = ? AND digest #{operator} ?
      },
      [ start_with_prefix, source, digest ]
    )
  else
    execute(
      %{
        DELETE FROM records
        WHERE key LIKE ? AND source = ?
      },
      [ start_with_prefix, source ]
    )
  end
  self
end

#clear_for_tags(tags = nil) ⇒ Documentrix::Documents::Cache::SQLiteCache

The clear_for_tags method clears the cache for specific tags by deleting records that match those tags and have the prefix prefix.

Parameters:

  • tags (Array<String>, NilClass) (defaults to: nil)

    An array of tag names to clear from the cache or nil for all records

Returns:



143
144
145
146
147
148
149
150
151
152
153
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 143

def clear_for_tags(tags = nil)
  tags = Documentrix::Utils::Tags.new(tags).to_a
  if tags.present?
    records = find_records_for_tags(tags)
    keys = '(%s)' % records.transpose.first.map { "'%s'" % quote(_1) }.join(?,)
    execute(%{DELETE FROM records WHERE key IN #{keys}})
  else
    clear_all_with_prefix
  end
  self
end

#convert_to_vector(vector) ⇒ Array

The convert_to_vector method returns the input vector itself, because conversion isn't necessary for this cache class.

Parameters:

  • vector (Array)

    the input vector

Returns:

  • (Array)

    the (not) converted vector



320
321
322
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 320

def convert_to_vector(vector)
  vector
end

#delete(key) ⇒ NilClass

The delete method removes a key from the cache by executing a SQL query.

Parameters:

  • key (String)

    the key to be deleted

Returns:

  • (NilClass)


100
101
102
103
104
105
106
107
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 100

def delete(key)
  result = key?(key)
  execute(
    %{ DELETE FROM records WHERE records.key = ? },
    pre(key)
  )
  result
end

#each(prefix: start_with_prefix) {|key, value| ... } ⇒ Object

The each method iterates over records matching the given prefix and yields them to the block.

Examples:

cache.each do |key, value|
  puts "#{key}: #{value}"
end

Parameters:

  • prefix (String) (defaults to: start_with_prefix)

    the prefix to match

Yields:

  • (key, value)

    where key is the record's key and value is the record itself



284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 284

def each(prefix: start_with_prefix, &block)
  block or return enum_for(__method__, prefix:)

  execute(%{
    SELECT records.key, records.text, records.norm, records.source,
      records.digest, records.tags, embeddings.embedding
    FROM records
    INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
    WHERE records.key LIKE ?
  }, [ prefix ]).each do |key, text, norm, source, digest, tags, embedding|
    embedding = embedding.unpack("f*")
    tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
    value     = convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
    block.(key, value)
  end
  self
end

#each_source {|source| ... } ⇒ Enumerator

Yields each unique, full source present in the cache records.

This is a high-performance override for SQLite that avoids loading embeddings and parsing JSON for every record.

Yields:

  • (source)

    the full source string

Returns:

  • (Enumerator)

    an enumerator if no block is given, nil otherwise.



235
236
237
238
239
240
241
242
243
244
245
246
247
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 235

def each_source(&block)
  block or return enum_for(__method__)

  execute(%{
    SELECT DISTINCT source FROM records
    WHERE key LIKE ? AND source IS NOT NULL
  }, [ start_with_prefix ]).each do |source,|
    source = source.full? or next

    block.(source)
  end
  nil
end

#find_records(needle, tags: nil, max_records: nil, min_similarity: -1)) {|key, value| ... } ⇒ Array<Documentrix::Documents::Record>

The find_records method finds records that match the given needle and tags.

Examples:

documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ], min_similarity: 0.7)

Parameters:

  • needle (Array)

    the embedding vector

  • tags (Array) (defaults to: nil)

    the list of tags to filter by (optional)

  • max_records (Integer) (defaults to: nil)

    the maximum number of records to return (optional)

  • min_similarity (Float) (defaults to: -1))

    the minimum similarity score to include (defaults to -1)

Yields:

  • (key, value)

Returns:

Raises:

  • (ArgumentError)

    if needle size does not match embedding length



366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 366

def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
  needle.size != @embedding_length and
    raise ArgumentError, "needle embedding length != %s" % @embedding_length
  needle_binary = needle.pack("f*")
  max_records   = [ max_records, size, 4_096 ].compact.min
  records = find_records_for_tags(tags)
  rowids_where = '(%s)' % records.transpose.last&.join(?,)
  execute(
    %{
      SELECT records.key, records.text, records.norm, records.source,
        records.digest, records.tags, embeddings.embedding,
        1 - vec_distance_cosine(?, vec_f32(embeddings.embedding)) AS similarity
      FROM records
      INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
      WHERE embeddings.rowid IN #{rowids_where}
        AND embeddings.embedding MATCH ? AND similarity >= ?
        AND embeddings.k = ?
      ORDER BY similarity DESC
    }, [ needle_binary, needle_binary, min_similarity, max_records ]
  ).map do |key, text, norm, source, digest, tags, embedding, similarity|
    key       = unpre(key)
    embedding = embedding.unpack("f*")
    tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
    convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:, similarity:)
  end
end

#find_records_for_tags(tags) ⇒ Array

The find_records_for_tags method filters records based on the provided tags.

Parameters:

  • tags (Array)

    an array of tag names

Returns:

  • (Array)

    an array of filtered records



329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 329

def find_records_for_tags(tags)
  if tags.present?
    tags_filter = Documentrix::Utils::Tags.new(tags).to_a
    unless tags_filter.empty?
      tags_where = ' AND (%s)' % tags_filter.map {
        'tags LIKE "%%%s%%"' % quote(_1)
      }.join(' OR ')
    end
  end
  records = execute(%{
    SELECT key, tags, embedding_id
    FROM records
    WHERE key LIKE ?#{tags_where}
  }, [ start_with_prefix ])
  if tags_filter
    records = records.select { |key, tags, embedding_id|
      (tags_filter & JSON(tags.to_s).to_a).size >= 1
    }
  end
  records
end

#full_each {|key, value| ... } ⇒ Documentrix::Documents::Cache::SQLiteCache

The full_each method iterates over all keys and values in the cache, regardless of their prefix.

Yields:

  • (key, value)

Returns:



308
309
310
311
312
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 308

def full_each(&block)
  block or return enum_for(__method__)

  each(prefix: ?%, &block)
end

#key?(key) ⇒ FalseClass, TrueClass

The key? method checks if the given key exists in the cache by executing a SQL query.

Parameters:

  • key (String)

    the key to check for existence

Returns:

  • (FalseClass, TrueClass)

    true if the key exists, false otherwise



88
89
90
91
92
93
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 88

def key?(key)
  execute(
    %{ SELECT count(records.key) FROM records WHERE records.key = ? },
    pre(key)
  ).flatten.first == 1
end

#move_prefix(old_prefix, new_prefix) ⇒ Documentrix::Documents::Cache::SQLiteCache

Move a key prefix in the cache.

This operation updates every record whose key starts with +old_prefix+, rewriting the prefix to +new_prefix+. It uses SQLite’s built‑in replace() string function, which means the change is atomic and performed entirely inside the database engine—no Ruby‑side iteration or temporary data structures are needed.

Parameters:

  • old_prefix (String)

    the current prefix to replace (must match the beginning of the keys you want to move).

  • new_prefix (String)

    the new prefix that should take its place.

Returns:



263
264
265
266
267
268
269
270
271
272
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 263

def move_prefix(old_prefix, new_prefix)
  execute(
    %{
      UPDATE records
        SET key = replace(key, '#{quote(old_prefix)}', '#{quote(new_prefix)}')
        WHERE key LIKE ?
    },
    old_prefix + '%'
  )
end

#sizeInteger

The size method returns the total number of records stored in the cache, that is the ones with prefix prefix.

Returns:

  • (Integer)

    the count of records



128
129
130
131
132
133
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 128

def size
  execute(
    %{SELECT COUNT(*) FROM records WHERE key LIKE ?},
    [ start_with_prefix ]
  ).flatten.first
end

#source_exist?(source, digest: nil, operator: ?=) ⇒ Boolean

The source_exist? method checks if any records associated with the given source exist in the cache. If a digest is provided, it verifies if the source exists and matches the specified digest using the provided operator.

Parameters:

  • source (#to_s)

    the source to check for existence

  • digest (String, nil) (defaults to: nil)

    the SHA256 hexadecimal digest to compare against the stored source digest (optional)

  • operator (String) (defaults to: ?=)

    the operator to use for comparison ('=' or '!='). Defaults to '='.

Returns:

  • (Boolean)

    true if the source exists (and matches the digest condition if provided), false otherwise.



209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 209

def source_exist?(source, digest: nil, operator: ?=)
  operator = '!=' if operator != ?=
  if digest
    !!execute(
      %{
        SELECT 1 FROM records WHERE key LIKE ? AND source = ? AND digest #{operator} ?
      },
      [ start_with_prefix, source, digest ]
    ).first
  else
    !!execute(
      %{
        SELECT 1 FROM records WHERE key LIKE ? AND source = ?
      },
      [ start_with_prefix, source ]
    ).first
  end
end

#tagsDocumentrix::Utils::Tags

The tags method returns an array of unique tags from the database.

Returns:

  • (Documentrix::Utils::Tags)

    An instance of Documentrix::Utils::Tags containing all unique tags found in the database.



113
114
115
116
117
118
119
120
121
122
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 113

def tags
  result = Documentrix::Utils::Tags.new
  execute(%{
      SELECT DISTINCT(tags) FROM records WHERE key LIKE ?
    }, [ start_with_prefix ]
  ).flatten.each do
    JSON(_1).each { |t| result.add(t) }
  end
  result
end