Class: Documentrix::Documents
- Inherits:
-
Object
- Object
- Documentrix::Documents
- Includes:
- Cache, Utils::Digests, Kramdown::ANSI::Width
- Defined in:
- lib/documentrix/documents.rb,
lib/documentrix/documents.rb
Overview
Documentrix::Documents is a class that provides functionality for building and querying vector databases for natural language processing and large language model applications.
It allows users to store and retrieve dense vector embeddings for text strings, supporting various cache backends including memory, Redis, and SQLite for efficient data management.
The class handles the complete workflow of adding documents, computing their embeddings using a specified model, storing them in a cache, and performing similarity searches to find relevant documents based on query text.
Defined Under Namespace
Modules: Cache, Splitters Classes: MemoryCache, RedisCache
Constant Summary collapse
- Record =
Shortcut for Documentrix::Documents::Cache::Records::Record
Class.new Documentrix::Documents::Cache::Records::Record
Instance Attribute Summary collapse
-
#cache ⇒ Object
readonly
Returns the value of attribute cache.
-
#collection ⇒ Object
Returns the value of attribute collection.
-
#model ⇒ Object
readonly
Returns the value of attribute model.
-
#ollama ⇒ Object
readonly
Returns the value of attribute ollama.
Instance Method Summary collapse
-
#[](text) ⇒ Object
The [] method retrieves the value associated with the given text from the cache.
-
#[]=(text, record) ⇒ Object
The []= method sets the value for a given text in the cache.
-
#add(texts, batch_size: nil, source: nil, tags: [], digest: nil) ⇒ Documentrix::Documents
(also: #<<)
The add method adds new texts
textsto the documents collection by processing them through various stages. -
#clear(tags: nil) ⇒ Documentrix::Documents
The clear method clears all texts from the cache or tags was given the ones tagged with the .
-
#collections ⇒ Array
The collections method returns an array of unique collection names.
-
#default_collection ⇒ :default
The default_collection method returns the default collection name.
-
#delete(text) ⇒ FalseClass, TrueClass
The delete method removes the specified text from the cache by calling the delete method on the underlying cache object.
-
#exist?(text) ⇒ FalseClass, TrueClass
The exist? method checks if the given text exists in the cache.
-
#find(string, tags: nil, prompt: nil, max_records: nil, min_similarity: nil) ⇒ Array<Documentrix::Documents::Record>
The find method searches for strings within the cache by computing their similarity scores.
-
#find_where(string, text_size: nil, text_count: nil, **opts) ⇒ Array<Documentrix::Documents::Record>
The find_where method filters the records returned by find based on text size and count.
-
#initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false) ⇒ Documents
constructor
The initialize method sets up the Documentrix::Documents instance by configuring its components.
-
#normalize_source(source) ⇒ String?
Normalizes the source identifier to a canonical form.
-
#rename_collection(new_collection) ⇒ Documentrix::Documents
Rename the current collection, moving all keys from the old prefix to a new one.
-
#size ⇒ Integer
The size method returns the number of texts stored in the cache of this Documentrix::Documents instance.
-
#source_exist?(source, digest: nil, operator: ?=) ⇒ Boolean
The source_exist? method checks if any records associated with the given source exist in the cache.
-
#source_modified?(source) ⇒ Boolean
Checks if the content of the given source has been modified compared to the version stored in the cache, or if it is missing from the cache.
-
#source_remove(source, digest: nil) ⇒ Documentrix::Documents
The source_remove method removes all documents associated with the given source.
-
#source_update(texts, **opts) ⇒ Documentrix::Documents?
Updates the records associated with a given source.
-
#tags ⇒ Documentrix::Utils::Tags
The tags method returns an array of unique tags from the cache.
Constructor Details
#initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false) ⇒ Documents
The initialize method sets up the Documentrix::Documents instance by configuring its components.
79 80 81 82 83 84 85 |
# File 'lib/documentrix/documents.rb', line 79 def initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false) collection ||= default_collection @ollama, @model, @model_options, @collection, @debug = ollama, model, , collection.to_sym, debug database_filename ||= ':memory:' @cache = connect_cache(cache, redis_url, , database_filename) end |
Instance Attribute Details
#cache ⇒ Object (readonly)
Returns the value of attribute cache.
94 95 96 |
# File 'lib/documentrix/documents.rb', line 94 def cache @cache end |
#collection ⇒ Object
Returns the value of attribute collection.
94 95 96 |
# File 'lib/documentrix/documents.rb', line 94 def collection @collection end |
#model ⇒ Object (readonly)
Returns the value of attribute model.
94 95 96 |
# File 'lib/documentrix/documents.rb', line 94 def model @model end |
#ollama ⇒ Object (readonly)
Returns the value of attribute ollama.
94 95 96 |
# File 'lib/documentrix/documents.rb', line 94 def ollama @ollama end |
Instance Method Details
#[](text) ⇒ Object
The [] method retrieves the value associated with the given text from the cache.
175 176 177 |
# File 'lib/documentrix/documents.rb', line 175 def [](text) @cache[key(text)] end |
#[]=(text, record) ⇒ Object
The []= method sets the value for a given text in the cache.
183 184 185 |
# File 'lib/documentrix/documents.rb', line 183 def []=(text, record) @cache[key(text)] = record end |
#add(texts, batch_size: nil, source: nil, tags: [], digest: nil) ⇒ Documentrix::Documents Also known as: <<
The add method adds new texts texts to the documents collection by
processing them through various stages. It first filters out existing texts
from the input array using the prepare_texts method, then fetches
embeddings for each text using the specified model and options. The fetched
embeddings are used to create a new record in the cache, which is
associated with the original text, tags, and version digest (if any). The
method processes the texts in batches of size batch_size, displaying
progress information in the console. It also accepts an optional source
string to associate with the added texts, an array of tags to attach to
each record, and an optional digest string for version tracking. Once
all texts have been processed, it returns the Documentrix::Documents
instance itself, allowing for method chaining.
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/documentrix/documents.rb', line 143 def add(texts, batch_size: nil, source: nil, tags: [], digest: nil) texts = prepare_texts(texts) or return self source = normalize_source(source) = Documentrix::Utils::Tags.new(, source:) digest ||= compute_file_digest(source) if source .add(File.basename(source).gsub(/\?.*/, ''), source:) end batches = texts.each_slice(batch_size || 10). ( label: "Add #{truncate(.to_s(link: false), percentage: 25)}", total: texts.size ) batches.each do |batch| = (model:, options: @model_options, input: batch) batch.zip() do |text, | norm = @cache.norm() self[text] = Record[text:, embedding:, norm:, source:, tags: .to_a, digest:] end .progress by: batch.size end .newline invalidate_collections_cache! end |
#clear(tags: nil) ⇒ Documentrix::Documents
The clear method clears all texts from the cache or tags was given the ones tagged with the .
223 224 225 226 |
# File 'lib/documentrix/documents.rb', line 223 def clear(tags: nil) @cache.clear(tags:) invalidate_collections_cache! end |
#collections ⇒ Array
The collections method returns an array of unique collection names
383 384 385 386 387 |
# File 'lib/documentrix/documents.rb', line 383 def collections @collections_cache ||= ( [ default_collection ] + @cache.collections('%s-' % class_prefix) ).uniq end |
#default_collection ⇒ :default
The default_collection method returns the default collection name.
90 91 92 |
# File 'lib/documentrix/documents.rb', line 90 def default_collection :default end |
#delete(text) ⇒ FalseClass, TrueClass
The delete method removes the specified text from the cache by calling the delete method on the underlying cache object.
203 204 205 206 207 |
# File 'lib/documentrix/documents.rb', line 203 def delete(text) res = @cache.delete(key(text)) invalidate_collections_cache! if res res end |
#exist?(text) ⇒ FalseClass, TrueClass
The exist? method checks if the given text exists in the cache.
192 193 194 |
# File 'lib/documentrix/documents.rb', line 192 def exist?(text) @cache.key?(key(text)) end |
#find(string, tags: nil, prompt: nil, max_records: nil, min_similarity: nil) ⇒ Array<Documentrix::Documents::Record>
The find method searches for strings within the cache by computing their similarity scores.
339 340 341 342 343 |
# File 'lib/documentrix/documents.rb', line 339 def find(string, tags: nil, prompt: nil, max_records: nil, min_similarity: nil) min_similarity ||= -1 needle = convert_to_vector(string, prompt:) @cache.find_records(needle, tags:, max_records:, min_similarity:) end |
#find_where(string, text_size: nil, text_count: nil, **opts) ⇒ Array<Documentrix::Documents::Record>
The find_where method filters the records returned by find based on text size and count.
371 372 373 374 375 376 377 378 |
# File 'lib/documentrix/documents.rb', line 371 def find_where(string, text_size: nil, text_count: nil, **opts) text_count and opts[:max_records] = text_count records = find(string, **opts) size = 0 records.take_while do |record| !text_size || (size += record.text.size) <= text_size end end |
#normalize_source(source) ⇒ String?
Normalizes the source identifier to a canonical form.
If the source is blank, returns nil. If the source is an absolute URL, it is returned as-is. If the source is a local file path that exists, it is expanded to its real path, resolving all symlinks and absolute paths. Otherwise, the original source is returned.
239 240 241 242 243 244 245 246 247 248 |
# File 'lib/documentrix/documents.rb', line 239 def normalize_source(source) source.blank? and return begin URI::PARSER.parse(source).absolute? and return source rescue end Pathname.new(source).realpath.to_path rescue Errno::ENOENT source end |
#rename_collection(new_collection) ⇒ Documentrix::Documents
Rename the current collection, moving all keys from the old prefix to a new
one. After the rename the instance’s collection attribute points to
new_collection, and the cache keys are updated accordingly.
395 396 397 398 399 400 401 402 403 |
# File 'lib/documentrix/documents.rb', line 395 def rename_collection(new_collection) new_collection = new_collection.to_sym collections.member?(new_collection) and raise ArgumentError, "new collection #{new_collection} already exists!" new_prefix = '%s-%s-' % [ class_prefix, new_collection ] @cache.move_prefix(prefix, new_prefix) self.collection = new_collection invalidate_collections_cache! end |
#size ⇒ Integer
The size method returns the number of texts stored in the cache of this Documentrix::Documents instance.
213 214 215 |
# File 'lib/documentrix/documents.rb', line 213 def size @cache.size end |
#source_exist?(source, digest: nil, operator: ?=) ⇒ Boolean
The source_exist? method checks if any records associated with the given source exist in the cache. If a digest is provided, it verifies if the source exists and satisfies the comparison with the specified digest.
262 263 264 265 |
# File 'lib/documentrix/documents.rb', line 262 def source_exist?(source, digest: nil, operator: ?=) source = normalize_source(source) @cache.source_exist?(source, digest:, operator:) end |
#source_modified?(source) ⇒ Boolean
Checks if the content of the given source has been modified compared to the version stored in the cache, or if it is missing from the cache.
The method is considered modified (returns true) if:
- The source is blank or cannot be normalized.
- The source is not a valid local file or its digest cannot be computed.
- No records exist in the cache for this source.
- Records exist in the cache for this source, but they have a different digest than the current version on disk.
280 281 282 283 284 |
# File 'lib/documentrix/documents.rb', line 280 def source_modified?(source) source = normalize_source(source) or return true digest = compute_file_digest(source) or return true !source_exist?(source) || source_exist?(source, digest:, operator: '!=') end |
#source_remove(source, digest: nil) ⇒ Documentrix::Documents
The source_remove method removes all documents associated with the given source.
320 321 322 323 324 |
# File 'lib/documentrix/documents.rb', line 320 def source_remove(source, digest: nil) source = normalize_source(source) @cache.clear_by_source(source, digest:, operator: '!=') invalidate_collections_cache! end |
#source_update(texts, **opts) ⇒ Documentrix::Documents?
Updates the records associated with a given source.
If the source already exists in the cache, this method computes its current digest and removes only the stale records that do not match this digest. If the source is new or has been modified, it adds the provided texts to the cache.
299 300 301 302 303 304 305 306 307 308 309 310 |
# File 'lib/documentrix/documents.rb', line 299 def source_update(texts, **opts) if source = normalize_source(opts[:source]) and source_exist?(source) digest = compute_file_digest(source) source_remove(source, digest:) unless source_exist?(source, digest:, operator: ?=) opts[:digest] = digest add(texts, **opts) end else add(texts, **opts) end end |
#tags ⇒ Documentrix::Utils::Tags
The tags method returns an array of unique tags from the cache.
408 409 410 |
# File 'lib/documentrix/documents.rb', line 408 def @cache. end |