Class: Documentrix::Documents
- Inherits:
-
Object
- Object
- Documentrix::Documents
- Includes:
- Cache, Utils::Digests, Kramdown::ANSI::Width
- Defined in:
- lib/documentrix/documents.rb,
lib/documentrix/documents.rb
Overview
Documentrix::Documents is a class that provides functionality for building and querying vector databases for natural language processing and large language model applications.
It allows users to store and retrieve dense vector embeddings for text strings, supporting various cache backends including memory, Redis, and SQLite for efficient data management.
The class handles the complete workflow of adding documents, computing their embeddings using a specified model, storing them in a cache, and performing similarity searches to find relevant documents based on query text.
Defined Under Namespace
Modules: Cache, Splitters Classes: MemoryCache, RedisCache
Constant Summary collapse
- Record =
Shortcut for Documentrix::Documents::Cache::Records::Record
Class.new Documentrix::Documents::Cache::Records::Record
Instance Attribute Summary collapse
-
#cache ⇒ Object
readonly
Returns the value of attribute cache.
-
#collection ⇒ Object
Returns the value of attribute collection.
-
#model ⇒ Object
readonly
Returns the value of attribute model.
-
#ollama ⇒ Object
readonly
Returns the value of attribute ollama.
Instance Method Summary collapse
-
#[](text) ⇒ Object
The [] method retrieves the value associated with the given text from the cache.
-
#[]=(text, record) ⇒ Object
The []= method sets the value for a given text in the cache.
-
#add(texts, batch_size: nil, source: nil, tags: [], digest: nil) ⇒ Documentrix::Documents
(also: #<<)
The add method adds new texts
textsto the documents collection by processing them through various stages. -
#clear(tags: nil) ⇒ Documentrix::Documents
The clear method clears all texts from the cache or tags was given the ones tagged with the .
-
#collections ⇒ Array
The collections method returns an array of unique collection names.
-
#default_collection ⇒ :default
The default_collection method returns the default collection name.
-
#delete(text) ⇒ FalseClass, TrueClass
The delete method removes the specified text from the cache by calling the delete method on the underlying cache object.
-
#each_record {|record| ... } ⇒ Enumerator
The each_record method iterates over all records stored in the cache.
-
#exist?(text) ⇒ FalseClass, TrueClass
The exist? method checks if the given text exists in the cache.
-
#find(string, tags: nil, prompt: nil, max_records: nil, min_similarity: nil) ⇒ Array<Documentrix::Documents::Record>
The find method searches for strings within the cache by computing their similarity scores.
-
#find_where(string, text_size: nil, text_count: nil, **opts) ⇒ Array<Documentrix::Documents::Record>
The find_where method filters the records returned by find based on text size and count.
-
#initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false, database_busy_timeout: 5000) ⇒ Documents
constructor
The initialize method sets up the Documentrix::Documents instance by configuring its components.
-
#normalize_source(source) ⇒ String?
Normalizes the source identifier to a canonical form.
-
#rename_collection(new_collection) ⇒ Documentrix::Documents
Rename the current collection, moving all keys from the old prefix to a new one.
-
#size ⇒ Integer
The size method returns the number of texts stored in the cache of this Documentrix::Documents instance.
-
#source_exist?(source, digest: nil, operator: ?=) ⇒ Boolean
The source_exist? method checks if any records associated with the given source exist in the cache.
-
#source_modified?(source) ⇒ Boolean
Checks if the content of the given source has been modified compared to the version stored in the cache, or if it is missing from the cache.
-
#source_remove(source, digest: nil) ⇒ Documentrix::Documents
The source_remove method removes all documents associated with the given source.
-
#source_update(texts, **opts) ⇒ Documentrix::Documents?
Updates the records associated with a given source.
-
#sources ⇒ Array<String>
Returns an array of all unique sources stored in the cache.
-
#tags ⇒ Documentrix::Utils::Tags
The tags method returns an array of unique tags from the cache.
Constructor Details
#initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false, database_busy_timeout: 5000) ⇒ Documents
The initialize method sets up the Documentrix::Documents instance by configuring its components.
80 81 82 83 84 85 86 |
# File 'lib/documentrix/documents.rb', line 80 def initialize(ollama:, model:, model_options: nil, collection: nil, embedding_length: 1_024, cache: MemoryCache, database_filename: nil, redis_url: nil, debug: false, database_busy_timeout: 5000) collection ||= default_collection @ollama, @model, @model_options, @collection, @debug = ollama, model, , collection.to_sym, debug database_filename ||= ':memory:' @cache = connect_cache(cache, redis_url, , database_filename, database_busy_timeout) end |
Instance Attribute Details
#cache ⇒ Object (readonly)
Returns the value of attribute cache.
95 96 97 |
# File 'lib/documentrix/documents.rb', line 95 def cache @cache end |
#collection ⇒ Object
Returns the value of attribute collection.
95 96 97 |
# File 'lib/documentrix/documents.rb', line 95 def collection @collection end |
#model ⇒ Object (readonly)
Returns the value of attribute model.
95 96 97 |
# File 'lib/documentrix/documents.rb', line 95 def model @model end |
#ollama ⇒ Object (readonly)
Returns the value of attribute ollama.
95 96 97 |
# File 'lib/documentrix/documents.rb', line 95 def ollama @ollama end |
Instance Method Details
#[](text) ⇒ Object
The [] method retrieves the value associated with the given text from the cache.
176 177 178 |
# File 'lib/documentrix/documents.rb', line 176 def [](text) @cache[key(text)] end |
#[]=(text, record) ⇒ Object
The []= method sets the value for a given text in the cache.
184 185 186 |
# File 'lib/documentrix/documents.rb', line 184 def []=(text, record) @cache[key(text)] = record end |
#add(texts, batch_size: nil, source: nil, tags: [], digest: nil) ⇒ Documentrix::Documents Also known as: <<
The add method adds new texts texts to the documents collection by
processing them through various stages. It first filters out existing texts
from the input array using the prepare_texts method, then fetches
embeddings for each text using the specified model and options. The fetched
embeddings are used to create a new record in the cache, which is
associated with the original text, tags, and version digest (if any). The
method processes the texts in batches of size batch_size, displaying
progress information in the console. It also accepts an optional source
string to associate with the added texts, an array of tags to attach to
each record, and an optional digest string for version tracking. Once
all texts have been processed, it returns the Documentrix::Documents
instance itself, allowing for method chaining.
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
# File 'lib/documentrix/documents.rb', line 144 def add(texts, batch_size: nil, source: nil, tags: [], digest: nil) texts = prepare_texts(texts) or return self source = normalize_source(source) = Documentrix::Utils::Tags.new(, source:) digest ||= compute_file_digest(source) if source .add(File.basename(source).gsub(/\?.*/, ''), source:) end batches = texts.each_slice(batch_size || 10). ( label: "Add #{truncate(.to_s(link: false), percentage: 25)}", total: texts.size ) batches.each do |batch| = (model:, options: @model_options, input: batch) batch.zip() do |text, | norm = @cache.norm() self[text] = Record[text:, embedding:, norm:, source:, tags: .to_a, digest:] end .progress by: batch.size end .newline self end |
#clear(tags: nil) ⇒ Documentrix::Documents
The clear method clears all texts from the cache or tags was given the ones tagged with the .
222 223 224 225 |
# File 'lib/documentrix/documents.rb', line 222 def clear(tags: nil) @cache.clear(tags:) self end |
#collections ⇒ Array
The collections method returns an array of unique collection names
382 383 384 |
# File 'lib/documentrix/documents.rb', line 382 def collections [ default_collection ].concat(@cache.collections('%s-' % class_prefix)).uniq end |
#default_collection ⇒ :default
The default_collection method returns the default collection name.
91 92 93 |
# File 'lib/documentrix/documents.rb', line 91 def default_collection :default end |
#delete(text) ⇒ FalseClass, TrueClass
The delete method removes the specified text from the cache by calling the delete method on the underlying cache object.
204 205 206 |
# File 'lib/documentrix/documents.rb', line 204 def delete(text) @cache.delete(key(text)) end |
#each_record {|record| ... } ⇒ Enumerator
The each_record method iterates over all records stored in the cache.
420 421 422 423 |
# File 'lib/documentrix/documents.rb', line 420 def each_record(&block) block or return enum_for(__method__) @cache.each { |_key, record| block.(record) } end |
#exist?(text) ⇒ FalseClass, TrueClass
The exist? method checks if the given text exists in the cache.
193 194 195 |
# File 'lib/documentrix/documents.rb', line 193 def exist?(text) @cache.key?(key(text)) end |
#find(string, tags: nil, prompt: nil, max_records: nil, min_similarity: nil) ⇒ Array<Documentrix::Documents::Record>
The find method searches for strings within the cache by computing their similarity scores.
338 339 340 341 342 |
# File 'lib/documentrix/documents.rb', line 338 def find(string, tags: nil, prompt: nil, max_records: nil, min_similarity: nil) min_similarity ||= -1 needle = convert_to_vector(string, prompt:) @cache.find_records(needle, tags:, max_records:, min_similarity:) end |
#find_where(string, text_size: nil, text_count: nil, **opts) ⇒ Array<Documentrix::Documents::Record>
The find_where method filters the records returned by find based on text size and count.
370 371 372 373 374 375 376 377 |
# File 'lib/documentrix/documents.rb', line 370 def find_where(string, text_size: nil, text_count: nil, **opts) text_count and opts[:max_records] = text_count records = find(string, **opts) size = 0 records.take_while do |record| !text_size || (size += record.text.size) <= text_size end end |
#normalize_source(source) ⇒ String?
Normalizes the source identifier to a canonical form.
If the source is blank, returns nil. If the source is an absolute URL, it is returned as-is. If the source is a local file path that exists, it is expanded to its real path, resolving all symlinks and absolute paths. Otherwise, the original source is returned.
238 239 240 241 242 243 244 245 246 247 |
# File 'lib/documentrix/documents.rb', line 238 def normalize_source(source) source.blank? and return begin URI::PARSER.parse(source).absolute? and return source rescue end Pathname.new(source).realpath.to_path rescue Errno::ENOENT source end |
#rename_collection(new_collection) ⇒ Documentrix::Documents
Rename the current collection, moving all keys from the old prefix to a new
one. After the rename the instance’s collection attribute points to
new_collection, and the cache keys are updated accordingly.
392 393 394 395 396 397 398 399 400 |
# File 'lib/documentrix/documents.rb', line 392 def rename_collection(new_collection) new_collection = new_collection.to_sym collections.member?(new_collection) and raise ArgumentError, "new collection #{new_collection} already exists!" new_prefix = '%s-%s-' % [ class_prefix, new_collection ] @cache.move_prefix(prefix, new_prefix) self.collection = new_collection self end |
#size ⇒ Integer
The size method returns the number of texts stored in the cache of this Documentrix::Documents instance.
212 213 214 |
# File 'lib/documentrix/documents.rb', line 212 def size @cache.size end |
#source_exist?(source, digest: nil, operator: ?=) ⇒ Boolean
The source_exist? method checks if any records associated with the given source exist in the cache. If a digest is provided, it verifies if the source exists and satisfies the comparison with the specified digest.
261 262 263 264 |
# File 'lib/documentrix/documents.rb', line 261 def source_exist?(source, digest: nil, operator: ?=) source = normalize_source(source) @cache.source_exist?(source, digest:, operator:) end |
#source_modified?(source) ⇒ Boolean
Checks if the content of the given source has been modified compared to the version stored in the cache, or if it is missing from the cache.
The method is considered modified (returns true) if:
- The source is blank or cannot be normalized.
- The source is not a valid local file or its digest cannot be computed.
- No records exist in the cache for this source.
- Records exist in the cache for this source, but they have a different digest than the current version on disk.
279 280 281 282 283 |
# File 'lib/documentrix/documents.rb', line 279 def source_modified?(source) source = normalize_source(source) or return true digest = compute_file_digest(source) or return true !source_exist?(source) || source_exist?(source, digest:, operator: '!=') end |
#source_remove(source, digest: nil) ⇒ Documentrix::Documents
The source_remove method removes all documents associated with the given source.
319 320 321 322 323 |
# File 'lib/documentrix/documents.rb', line 319 def source_remove(source, digest: nil) source = normalize_source(source) @cache.clear_by_source(source, digest:, operator: '!=') self end |
#source_update(texts, **opts) ⇒ Documentrix::Documents?
Updates the records associated with a given source.
If the source already exists in the cache, this method computes its current digest and removes only the stale records that do not match this digest. If the source is new or has been modified, it adds the provided texts to the cache.
298 299 300 301 302 303 304 305 306 307 308 309 |
# File 'lib/documentrix/documents.rb', line 298 def source_update(texts, **opts) if source = normalize_source(opts[:source]) and source_exist?(source) digest = compute_file_digest(source) source_remove(source, digest:) unless source_exist?(source, digest:, operator: ?=) opts[:digest] = digest add(texts, **opts) end else add(texts, **opts) end end |
#sources ⇒ Array<String>
Returns an array of all unique sources stored in the cache.
412 413 414 |
# File 'lib/documentrix/documents.rb', line 412 def sources @cache.each_source.to_a end |
#tags ⇒ Documentrix::Utils::Tags
The tags method returns an array of unique tags from the cache.
405 406 407 |
# File 'lib/documentrix/documents.rb', line 405 def @cache. end |