Class: Documentrix::Documents::Cache::SQLiteCache

Inherits:
Object
  • Object
show all
Includes:
Common
Defined in:
lib/documentrix/documents/cache/sqlite_cache.rb

Overview

SQLiteCache is a cache implementation that uses SQLite database for storing document embeddings and related metadata.

This class provides a persistent cache storage solution for document embeddings, leveraging SQLite's capabilities to store both the embedding vectors and associated text data, tags, and source information. It supports efficient vector similarity searches using the sqlite_vec extension for fast nearest neighbor queries.

Instance Attribute Summary collapse

Attributes included from Common

#prefix

Instance Method Summary collapse

Methods included from Common

#clear, #pre, #unpre

Methods included from Utils::Math

#cosine_similarity, #norm

Constructor Details

#initialize(prefix:, embedding_length: 1_024, filename: ':memory:', debug: false, busy_timeout: 5000) ⇒ void

The initialize method sets up the cache by calling super and setting various instance variables.

Parameters:

  • prefix (String)

    the prefix for keys

  • embedding_length (Integer) (defaults to: 1_024)

    the length of the embeddings vector

  • filename (String) (defaults to: ':memory:')

    the name of the SQLite database file or ':memory:' for in-memory.

  • debug (FalseClass, TrueClass) (defaults to: false)

    whether to enable debugging

  • busy_timeout (Integer) (defaults to: 5000)

    the SQLite busy timeout in milliseconds (defaults to 5000)



27
28
29
30
31
32
33
34
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 27

def initialize(prefix:, embedding_length: 1_024, filename: ':memory:', debug: false, busy_timeout: 5000)
  super(prefix:)
  @embedding_length = embedding_length
  @filename         = filename
  @debug            = debug
  @busy_timeout     = busy_timeout
  setup_database(filename)
end

Instance Attribute Details

#embedding_lengthObject (readonly)

length of the embeddings vector



38
39
40
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 38

def embedding_length
  @embedding_length
end

#filenameObject (readonly)

filename for the database, :memory: is in memory



36
37
38
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 36

def filename
  @filename
end

Instance Method Details

#[](key) ⇒ Documentrix::Documents::Record, NilClass

The method retrieves the value associated with the given key from the cache.

Parameters:

  • key (String)

    The key for which to retrieve the value.

Returns:



47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 47

def [](key)
  result = execute(
    %{
      SELECT records.key, records.text, records.norm, records.source,
        records.digest, records.tags, embeddings.embedding
      FROM records
      INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
      WHERE records.key = ?
    },
    pre(key)
  )&.first or return
  key, text, norm, source, digest, tags, embedding = *result
  embedding = embedding.unpack("f*")
  tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
  convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
end

#[]=(key, value) ⇒ Object

The []= method sets the value for a given key by inserting it into the database.

Parameters:

  • key (String)

    the key to set

  • value (Hash, Documentrix::Documents::Record)

    the hash or record containing the text, embedding, and other metadata



70
71
72
73
74
75
76
77
78
79
80
81
82
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 70

def []=(key, value)
  value     = convert_value_to_record(value)
  digest    = compute_file_digest(value.source)
  embedding = value.embedding.pack("f*")
  execute(%{BEGIN IMMEDIATE})
  execute(%{INSERT INTO embeddings(embedding) VALUES(?)}, [ embedding ])
  embedding_id, = execute(%{ SELECT last_insert_rowid() }).flatten
  execute(%{
    INSERT INTO records(key,text,embedding_id,norm,source,digest,tags)
    VALUES(?,?,?,?,?,?,?)
  }, [ pre(key), value.text, embedding_id, value.norm, value.source, digest, JSON(value.tags) ])
  execute(%{COMMIT})
end

#clear_all_with_prefixDocumentrix::Documents::RedisBackedMemoryCache

The clear_all_with_prefix method deletes all records for prefix prefix from the cache by executing a SQL query.

Returns:

  • (Documentrix::Documents::RedisBackedMemoryCache)

    self



177
178
179
180
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 177

def clear_all_with_prefix
  execute(%{DELETE FROM records WHERE key LIKE ?}, [ start_with_prefix ])
  self
end

#clear_by_source(source, digest: nil, operator: ?=) ⇒ self

Removes all records associated with the specified source from the cache.

If a digest is provided, the method will only remove records that do NOT match this digest. This allows for updating a source by wiping old versions while preserving records that are already up-to-date.

Parameters:

  • source (String)

    the source identifier used to filter records

  • digest (String, nil) (defaults to: nil)

    the SHA256 hexadecimal digest of the source. Records matching this digest will be preserved.

  • operator (String) (defaults to: ?=)

    the operator to use for comparison ('=' or '!='). Defaults to '='.

Returns:

  • (self)

    the cache instance for method chaining



195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 195

def clear_by_source(source, digest: nil, operator: ?=)
  operator = '!=' if operator != ?=
  if digest
    execute(
      %{
        DELETE FROM records
        WHERE key LIKE ? AND source = ? AND digest #{operator} ?
      },
      [ start_with_prefix, source, digest ]
    )
  else
    execute(
      %{
        DELETE FROM records
        WHERE key LIKE ? AND source = ?
      },
      [ start_with_prefix, source ]
    )
  end
  self
end

#clear_for_tags(tags = nil) ⇒ Documentrix::Documents::Cache::SQLiteCache

The clear_for_tags method clears the cache for specific tags by deleting records that match those tags and have the prefix prefix.

Parameters:

  • tags (Array<String>, NilClass) (defaults to: nil)

    An array of tag names to clear from the cache or nil for all records

Returns:



161
162
163
164
165
166
167
168
169
170
171
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 161

def clear_for_tags(tags = nil)
  tags = Documentrix::Utils::Tags.new(tags).to_a
  if tags.present?
    records = find_records_for_tags(tags)
    keys = '(%s)' % records.transpose.first.map { "'%s'" % quote(_1) }.join(?,)
    execute(%{DELETE FROM records WHERE key IN #{keys}})
  else
    clear_all_with_prefix
  end
  self
end

#collections(prefix) ⇒ Array<Symbol>

Returns an array of collection names that match the given prefix. This is a high-performance override for SQLite that only queries keys.

Parameters:

  • prefix (String)

    the prefix to search for in collection names

Returns:

  • (Array<Symbol>)

    an array of matching collection names



116
117
118
119
120
121
122
123
124
125
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 116

def collections(prefix)
  execute(
    %{ SELECT DISTINCT key FROM records WHERE key LIKE ? },
    [ "#{prefix}%" ]
  ).flatten.each_with_object(Set.new) do |key, set|
    if key =~ /\A#{prefix}(.+)-/
      set << $1.to_sym
    end
  end.to_a
end

#convert_to_vector(vector) ⇒ Array

The convert_to_vector method returns the input vector itself, because conversion isn't necessary for this cache class.

Parameters:

  • vector (Array)

    the input vector

Returns:

  • (Array)

    the (not) converted vector



340
341
342
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 340

def convert_to_vector(vector)
  vector
end

#delete(key) ⇒ NilClass

The delete method removes a key from the cache by executing a SQL query.

Parameters:

  • key (String)

    the key to be deleted

Returns:

  • (NilClass)


102
103
104
105
106
107
108
109
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 102

def delete(key)
  result = key?(key)
  execute(
    %{ DELETE FROM records WHERE records.key = ? },
    pre(key)
  )
  result
end

#each(prefix: start_with_prefix) {|key, value| ... } ⇒ Object

The each method iterates over records matching the given prefix and yields them to the block.

Examples:

cache.each do |key, value|
  puts "#{key}: #{value}"
end

Parameters:

  • prefix (String) (defaults to: start_with_prefix)

    the prefix to match

Yields:

  • (key, value)

    where key is the record's key and value is the record itself



304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 304

def each(prefix: start_with_prefix, &block)
  block or return enum_for(__method__, prefix:)

  execute(%{
    SELECT records.key, records.text, records.norm, records.source,
      records.digest, records.tags, embeddings.embedding
    FROM records
    INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
    WHERE records.key LIKE ?
  }, [ prefix ]).each do |key, text, norm, source, digest, tags, embedding|
    embedding = embedding.unpack("f*")
    tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
    value     = convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
    block.(key, value)
  end
  self
end

#each_source {|source| ... } ⇒ Enumerator

Yields each unique, full source present in the cache records.

This is a high-performance override for SQLite that avoids loading embeddings and parsing JSON for every record.

Yields:

  • (source)

    the full source string

Returns:

  • (Enumerator)

    an enumerator if no block is given, nil otherwise.



255
256
257
258
259
260
261
262
263
264
265
266
267
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 255

def each_source(&block)
  block or return enum_for(__method__)

  execute(%{
    SELECT DISTINCT source FROM records
    WHERE key LIKE ? AND source IS NOT NULL
  }, [ start_with_prefix ]).each do |source,|
    source = source.full? or next

    block.(source)
  end
  nil
end

#find_records(needle, tags: nil, max_records: nil, min_similarity: -1)) {|key, value| ... } ⇒ Array<Documentrix::Documents::Record>

The find_records method finds records that match the given needle and tags.

Examples:

documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ], min_similarity: 0.7)

Parameters:

  • needle (Array)

    the embedding vector

  • tags (Array) (defaults to: nil)

    the list of tags to filter by (optional)

  • max_records (Integer) (defaults to: nil)

    the maximum number of records to return (optional)

  • min_similarity (Float) (defaults to: -1))

    the minimum similarity score to include (defaults to -1)

Yields:

  • (key, value)

Returns:

Raises:

  • (ArgumentError)

    if needle size does not match embedding length



386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 386

def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
  needle.size != @embedding_length and
    raise ArgumentError, "needle embedding length != %s" % @embedding_length
  needle_binary = needle.pack("f*")
  max_records   = [ max_records, size, 4_096 ].compact.min
  records = find_records_for_tags(tags)
  rowids_where = '(%s)' % records.transpose.last&.join(?,)
  execute(
    %{
      SELECT records.key, records.text, records.norm, records.source,
        records.digest, records.tags, embeddings.embedding,
        1 - vec_distance_cosine(?, vec_f32(embeddings.embedding)) AS similarity
      FROM records
      INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
      WHERE embeddings.rowid IN #{rowids_where}
        AND embeddings.embedding MATCH ? AND similarity >= ?
        AND embeddings.k = ?
      ORDER BY similarity DESC
    }, [ needle_binary, needle_binary, min_similarity, max_records ]
  ).map do |key, text, norm, source, digest, tags, embedding, similarity|
    key       = unpre(key)
    embedding = embedding.unpack("f*")
    tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
    convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:, similarity:)
  end
end

#find_records_for_tags(tags) ⇒ Array

The find_records_for_tags method filters records based on the provided tags.

Parameters:

  • tags (Array)

    an array of tag names

Returns:

  • (Array)

    an array of filtered records



349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 349

def find_records_for_tags(tags)
  if tags.present?
    tags_filter = Documentrix::Utils::Tags.new(tags).to_a
    unless tags_filter.empty?
      tags_where = ' AND (%s)' % tags_filter.map {
        'tags LIKE "%%%s%%"' % quote(_1)
      }.join(' OR ')
    end
  end
  records = execute(%{
    SELECT key, tags, embedding_id
    FROM records
    WHERE key LIKE ?#{tags_where}
  }, [ start_with_prefix ])
  if tags_filter
    records = records.select { |key, tags, embedding_id|
      (tags_filter & JSON(tags.to_s).to_a).size >= 1
    }
  end
  records
end

#full_each {|key, value| ... } ⇒ Documentrix::Documents::Cache::SQLiteCache

The full_each method iterates over all keys and values in the cache, regardless of their prefix.

Yields:

  • (key, value)

Returns:



328
329
330
331
332
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 328

def full_each(&block)
  block or return enum_for(__method__)

  each(prefix: ?%, &block)
end

#key?(key) ⇒ FalseClass, TrueClass

The key? method checks if the given key exists in the cache by executing a SQL query.

Parameters:

  • key (String)

    the key to check for existence

Returns:

  • (FalseClass, TrueClass)

    true if the key exists, false otherwise



90
91
92
93
94
95
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 90

def key?(key)
  execute(
    %{ SELECT count(records.key) FROM records WHERE records.key = ? },
    pre(key)
  ).flatten.first == 1
end

#move_prefix(old_prefix, new_prefix) ⇒ Documentrix::Documents::Cache::SQLiteCache

Move a key prefix in the cache.

This operation updates every record whose key starts with +old_prefix+, rewriting the prefix to +new_prefix+. It uses SQLite’s built‑in replace() string function, which means the change is atomic and performed entirely inside the database engine—no Ruby‑side iteration or temporary data structures are needed.

Parameters:

  • old_prefix (String)

    the current prefix to replace (must match the beginning of the keys you want to move).

  • new_prefix (String)

    the new prefix that should take its place.

Returns:



283
284
285
286
287
288
289
290
291
292
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 283

def move_prefix(old_prefix, new_prefix)
  execute(
    %{
      UPDATE records
        SET key = replace(key, '#{quote(old_prefix)}', '#{quote(new_prefix)}')
        WHERE key LIKE ?
    },
    old_prefix + '%'
  )
end

#sizeInteger

The size method returns the total number of records stored in the cache, that is the ones with prefix prefix.

Returns:

  • (Integer)

    the count of records



146
147
148
149
150
151
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 146

def size
  execute(
    %{SELECT COUNT(*) FROM records WHERE key LIKE ?},
    [ start_with_prefix ]
  ).flatten.first
end

#source_exist?(source, digest: nil, operator: ?=) ⇒ Boolean

The source_exist? method checks if any records associated with the given source exist in the cache. If a digest is provided, it verifies if the source exists and matches the specified digest using the provided operator.

Parameters:

  • source (#to_s)

    the source to check for existence

  • digest (String, nil) (defaults to: nil)

    the SHA256 hexadecimal digest to compare against the stored source digest (optional)

  • operator (String) (defaults to: ?=)

    the operator to use for comparison ('=' or '!='). Defaults to '='.

Returns:

  • (Boolean)

    true if the source exists (and matches the digest condition if provided), false otherwise.



229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 229

def source_exist?(source, digest: nil, operator: ?=)
  operator = '!=' if operator != ?=
  if digest
    !!execute(
      %{
        SELECT 1 FROM records WHERE key LIKE ? AND source = ? AND digest #{operator} ?
      },
      [ start_with_prefix, source, digest ]
    ).first
  else
    !!execute(
      %{
        SELECT 1 FROM records WHERE key LIKE ? AND source = ?
      },
      [ start_with_prefix, source ]
    ).first
  end
end

#tagsDocumentrix::Utils::Tags

The tags method returns an array of unique tags from the database.

Returns:

  • (Documentrix::Utils::Tags)

    An instance of Documentrix::Utils::Tags containing all unique tags found in the database.



131
132
133
134
135
136
137
138
139
140
# File 'lib/documentrix/documents/cache/sqlite_cache.rb', line 131

def tags
  result = Documentrix::Utils::Tags.new
  execute(%{
      SELECT DISTINCT(tags) FROM records WHERE key LIKE ?
    }, [ start_with_prefix ]
  ).flatten.each do
    JSON(_1).each { |t| result.add(t) }
  end
  result
end