Class: Documentrix::Documents

Inherits:
Object
  • Object
show all
Includes:
Cache, Utils::Digests, Kramdown::ANSI::Width
Defined in:
lib/documentrix/documents.rb,
lib/documentrix/documents.rb

Overview

Documentrix::Documents is a class that provides functionality for building and querying vector databases for natural language processing and large language model applications.

It allows users to store and retrieve dense vector embeddings for text strings, supporting various cache backends including memory, Redis, and SQLite for efficient data management.

The class handles the complete workflow of adding documents, computing their embeddings using a specified model, storing them in a cache, and performing similarity searches to find relevant documents based on query text.

Examples:

documents = Documentrix::Documents.new(
  ollama: ollama_client,
  model: 'mxbai-embed-large',
  collection: 'my-collection'
)
documents.add(['text1', 'text2'])
results = documents.find('search query')

Defined Under Namespace

Modules: Cache, Splitters Classes: MemoryCache, RedisCache

Constant Summary collapse

Record =

Shortcut for Documentrix::Documents::Cache::Records::Record

Class.new Documentrix::Documents::Cache::Records::Record

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false) ⇒ Documents

The initialize method sets up the Documentrix::Documents instance by configuring its components.

Parameters:

  • ollama (Ollama::Client)

    the client used for embedding

  • model (String)

    the name of the model to use for embeddings

  • model_options (Hash) (defaults to: nil)

    optional parameters for the model

  • collection (Symbol) (defaults to: nil)

    the default collection to use (defaults to :default)

  • embedding_length (Integer) (defaults to: 1_024)

    the length of the embeddings (defaults to 1024)

  • cache (Documentrix::Cache) (defaults to: MemoryCache)

    the cache to use for storing documents (defaults to MemoryCache)

  • database_filename (String) (defaults to: nil)

    the filename of the SQLite database to use (defaults to ':memory:')

  • redis_url (String) (defaults to: nil)

    the URL of the Redis server to use (defaults to nil)

  • debug (FalseClass, TrueClass) (defaults to: false)

    whether to enable debugging mode (defaults to false)



79
80
81
82
83
84
85
# File 'lib/documentrix/documents.rb', line 79

def initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false)
  collection ||= default_collection
  @ollama, @model, @model_options, @collection, @debug =
    ollama, model, model_options, collection.to_sym, debug
  database_filename ||= ':memory:'
  @cache = connect_cache(cache, redis_url, embedding_length, database_filename)
end

Instance Attribute Details

#cacheObject (readonly)

Returns the value of attribute cache.



94
95
96
# File 'lib/documentrix/documents.rb', line 94

def cache
  @cache
end

#collectionObject

Returns the value of attribute collection.



94
95
96
# File 'lib/documentrix/documents.rb', line 94

def collection
  @collection
end

#modelObject (readonly)

Returns the value of attribute model.



94
95
96
# File 'lib/documentrix/documents.rb', line 94

def model
  @model
end

#ollamaObject (readonly)

Returns the value of attribute ollama.



94
95
96
# File 'lib/documentrix/documents.rb', line 94

def ollama
  @ollama
end

Instance Method Details

#[](text) ⇒ Object

The [] method retrieves the value associated with the given text from the cache.

Parameters:

  • text (String)

    the text for which to retrieve the cached value

Returns:

  • (Object)

    the cached value, or nil if not found



175
176
177
# File 'lib/documentrix/documents.rb', line 175

def [](text)
  @cache[key(text)]
end

#[]=(text, record) ⇒ Object

The []= method sets the value for a given text in the cache.

Parameters:

  • text (String)

    the text to set

  • record (Hash)

    the value to store



183
184
185
# File 'lib/documentrix/documents.rb', line 183

def []=(text, record)
  @cache[key(text)] = record
end

#add(texts, batch_size: nil, source: nil, tags: [], digest: nil) ⇒ Documentrix::Documents Also known as: <<

The add method adds new texts texts to the documents collection by processing them through various stages. It first filters out existing texts from the input array using the prepare_texts method, then fetches embeddings for each text using the specified model and options. The fetched embeddings are used to create a new record in the cache, which is associated with the original text, tags, and version digest (if any). The method processes the texts in batches of size batch_size, displaying progress information in the console. It also accepts an optional source string to associate with the added texts, an array of tags to attach to each record, and an optional digest string for version tracking. Once all texts have been processed, it returns the Documentrix::Documents instance itself, allowing for method chaining.

Examples:

documents.add(%w[ foo bar ], batch_size: 23, source: 'https://example.com', tags: %w[tag1 tag2])

Parameters:

  • texts (Array)

    an array of input texts

  • batch_size (Integer) (defaults to: nil)

    the number of texts to process in one batch

  • source (String) (defaults to: nil)

    the source URL for the added texts

  • tags (Array) (defaults to: [])

    an array of tags associated with the added texts

  • digest (String, nil) (defaults to: nil)

    the SHA256 hexadecimal digest of the source

Returns:



143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/documentrix/documents.rb', line 143

def add(texts, batch_size: nil, source: nil, tags: [], digest: nil)
  texts    = prepare_texts(texts) or return self
  source   = normalize_source(source)
  tags     = Documentrix::Utils::Tags.new(tags, source:)
  digest ||= compute_file_digest(source)
  if source
    tags.add(File.basename(source).gsub(/\?.*/, ''), source:)
  end
  batches = texts.each_slice(batch_size || 10).
    with_infobar(
      label: "Add #{truncate(tags.to_s(link: false), percentage: 25)}",
      total: texts.size
    )
  batches.each do |batch|
    embeddings = fetch_embeddings(model:, options: @model_options, input: batch)
    batch.zip(embeddings) do |text, embedding|
      norm       = @cache.norm(embedding)
      self[text] = Record[text:, embedding:, norm:, source:, tags: tags.to_a, digest:]
    end
    infobar.progress by: batch.size
  end
  infobar.newline
  invalidate_collections_cache!
end

#clear(tags: nil) ⇒ Documentrix::Documents

The clear method clears all texts from the cache or tags was given the ones tagged with the .

Parameters:

  • tags (NilClass, Array<String>) (defaults to: nil)

    the tag name to filter by

Returns:



223
224
225
226
# File 'lib/documentrix/documents.rb', line 223

def clear(tags: nil)
  @cache.clear(tags:)
  invalidate_collections_cache!
end

#collectionsArray

The collections method returns an array of unique collection names

Returns:

  • (Array)

    An array of unique collection names



383
384
385
386
387
# File 'lib/documentrix/documents.rb', line 383

def collections
  @collections_cache ||= (
    [ default_collection ] + @cache.collections('%s-' % class_prefix)
  ).uniq
end

#default_collection:default

The default_collection method returns the default collection name.

Returns:

  • (:default)

    The default collection name.



90
91
92
# File 'lib/documentrix/documents.rb', line 90

def default_collection
  :default
end

#delete(text) ⇒ FalseClass, TrueClass

The delete method removes the specified text from the cache by calling the delete method on the underlying cache object.

Parameters:

  • text (String)

    the text for which to remove the value

Returns:

  • (FalseClass, TrueClass)

    true if the text was removed, false otherwise.



203
204
205
206
207
# File 'lib/documentrix/documents.rb', line 203

def delete(text)
  res = @cache.delete(key(text))
  invalidate_collections_cache! if res
  res
end

#exist?(text) ⇒ FalseClass, TrueClass

The exist? method checks if the given text exists in the cache.

Parameters:

  • text (String)

    the text to check for existence

Returns:

  • (FalseClass, TrueClass)

    true if the text exists, false otherwise.



192
193
194
# File 'lib/documentrix/documents.rb', line 192

def exist?(text)
  @cache.key?(key(text))
end

#find(string, tags: nil, prompt: nil, max_records: nil, min_similarity: nil) ⇒ Array<Documentrix::Documents::Record>

The find method searches for strings within the cache by computing their similarity scores.

Examples:

documents.find("foo")

Parameters:

  • string (String)

    the input string

  • tags (Array<String>) (defaults to: nil)

    an array of tags to filter results by (optional)

  • prompt (String) (defaults to: nil)

    a prompt to use when searching for similar strings (optional)

  • max_records (Integer) (defaults to: nil)

    the maximum number of records to return (optional)

  • min_similarity (Numeric) (defaults to: nil)

    the minimum similarity score to include in results (defaults to -1)

Returns:



339
340
341
342
343
# File 'lib/documentrix/documents.rb', line 339

def find(string, tags: nil, prompt: nil, max_records: nil, min_similarity: nil)
  min_similarity ||= -1
  needle = convert_to_vector(string, prompt:)
  @cache.find_records(needle, tags:, max_records:, min_similarity:)
end

#find_where(string, text_size: nil, text_count: nil, **opts) ⇒ Array<Documentrix::Documents::Record>

The find_where method filters the records returned by find based on text size and count.

Examples:

documents.find_where('foo', text_size: 1000, text_count: 5, tags: ['ruby'])

Parameters:

  • string (String)

    the search query

  • text_size (Integer) (defaults to: nil)

    the maximum allowed total text size to return

  • text_count (Integer) (defaults to: nil)

    the maximum number of records to return

  • opts (Hash)

    additional options passed to #find, such as:

    • :tags [Array] filter results by tags
    • :prompt [String] use for the search
    • :min_similarity [Numeric] minimum similarity score

Returns:



371
372
373
374
375
376
377
378
# File 'lib/documentrix/documents.rb', line 371

def find_where(string, text_size: nil, text_count: nil, **opts)
  text_count and opts[:max_records] =  text_count
  records = find(string, **opts)
  size    = 0
  records.take_while do |record|
    !text_size || (size += record.text.size) <= text_size
  end
end

#normalize_source(source) ⇒ String?

Normalizes the source identifier to a canonical form.

If the source is blank, returns nil. If the source is an absolute URL, it is returned as-is. If the source is a local file path that exists, it is expanded to its real path, resolving all symlinks and absolute paths. Otherwise, the original source is returned.

Parameters:

  • source (String, #to_s)

    the source identifier to normalize

Returns:

  • (String, nil)

    the normalized canonical path, the original source, or nil if blank



239
240
241
242
243
244
245
246
247
248
# File 'lib/documentrix/documents.rb', line 239

def normalize_source(source)
  source.blank? and return
  begin
    URI::PARSER.parse(source).absolute? and return source
  rescue
  end
  Pathname.new(source).realpath.to_path
rescue Errno::ENOENT
  source
end

#rename_collection(new_collection) ⇒ Documentrix::Documents

Rename the current collection, moving all keys from the old prefix to a new one. After the rename the instance’s collection attribute points to new_collection, and the cache keys are updated accordingly.

Parameters:

  • new_collection (Symbol)

    The name of the collection to rename to.

Returns:



395
396
397
398
399
400
401
402
403
# File 'lib/documentrix/documents.rb', line 395

def rename_collection(new_collection)
  new_collection = new_collection.to_sym
  collections.member?(new_collection) and
    raise ArgumentError, "new collection #{new_collection} already exists!"
  new_prefix = '%s-%s-' % [ class_prefix, new_collection ]
  @cache.move_prefix(prefix, new_prefix)
  self.collection = new_collection
  invalidate_collections_cache!
end

#sizeInteger

The size method returns the number of texts stored in the cache of this Documentrix::Documents instance.

Returns:

  • (Integer)

    The total count of cached texts.



213
214
215
# File 'lib/documentrix/documents.rb', line 213

def size
  @cache.size
end

#source_exist?(source, digest: nil, operator: ?=) ⇒ Boolean

The source_exist? method checks if any records associated with the given source exist in the cache. If a digest is provided, it verifies if the source exists and satisfies the comparison with the specified digest.

Parameters:

  • source (#to_s)

    the source to check for existence

  • digest (String, nil) (defaults to: nil)

    the SHA256 hexadecimal digest to compare against the stored source digest (optional)

  • operator (Symbol, String) (defaults to: ?=)

    the operator to compare the digest with (defaults to '=')

Returns:

  • (Boolean)

    true if the source exists (and satisfies the digest comparison if provided), false otherwise.



262
263
264
265
# File 'lib/documentrix/documents.rb', line 262

def source_exist?(source, digest: nil, operator: ?=)
  source = normalize_source(source)
  @cache.source_exist?(source, digest:, operator:)
end

#source_modified?(source) ⇒ Boolean

Checks if the content of the given source has been modified compared to the version stored in the cache, or if it is missing from the cache.

The method is considered modified (returns true) if:

  1. The source is blank or cannot be normalized.
  2. The source is not a valid local file or its digest cannot be computed.
  3. No records exist in the cache for this source.
  4. Records exist in the cache for this source, but they have a different digest than the current version on disk.

Parameters:

  • source (String, #to_s)

    the source identifier to check

Returns:

  • (Boolean)

    true if the source is modified, missing, or cannot be verified, false if it is up-to-date.



280
281
282
283
284
# File 'lib/documentrix/documents.rb', line 280

def source_modified?(source)
  source = normalize_source(source) or return true
  digest = compute_file_digest(source) or return true
  !source_exist?(source) || source_exist?(source, digest:, operator: '!=')
end

#source_remove(source, digest: nil) ⇒ Documentrix::Documents

The source_remove method removes all documents associated with the given source.

Parameters:

  • source (#to_s)

    the source of the documents to remove

  • digest (String, nil) (defaults to: nil)

    the SHA256 hexadecimal digest for which records with this source are not to be removed if given.

Returns:



320
321
322
323
324
# File 'lib/documentrix/documents.rb', line 320

def source_remove(source, digest: nil)
  source = normalize_source(source)
  @cache.clear_by_source(source, digest:, operator: '!=')
  invalidate_collections_cache!
end

#source_update(texts, **opts) ⇒ Documentrix::Documents?

Updates the records associated with a given source.

If the source already exists in the cache, this method computes its current digest and removes only the stale records that do not match this digest. If the source is new or has been modified, it adds the provided texts to the cache.

Parameters:

  • texts (Array)

    the text strings to add if the source is new or modified

  • opts (Hash)

    additional options passed to #add (e.g., :batch_size, :tags)

    • :source [#to_s] the source to update

Returns:

  • (Documentrix::Documents, nil)

    the instance itself if the source was added/updated, or nil if the source was already up-to-date.



299
300
301
302
303
304
305
306
307
308
309
310
# File 'lib/documentrix/documents.rb', line 299

def source_update(texts, **opts)
  if source = normalize_source(opts[:source]) and source_exist?(source)
    digest = compute_file_digest(source)
    source_remove(source, digest:)
    unless source_exist?(source, digest:, operator: ?=)
      opts[:digest] = digest
      add(texts, **opts)
    end
  else
    add(texts, **opts)
  end
end

#tagsDocumentrix::Utils::Tags

The tags method returns an array of unique tags from the cache.

Returns:



408
409
410
# File 'lib/documentrix/documents.rb', line 408

def tags
  @cache.tags
end