Class: RobotLab::DocumentStore
- Inherits:
-
Object
- Object
- RobotLab::DocumentStore
- Defined in:
- lib/robot_lab/document_store.rb,
lib/robot_lab/document_store/version.rb
Overview
Embedding-based document store for semantic search over arbitrary text.
Documents are embedded using fastembed (BAAI/bge-small-en-v1.5 by default) and stored in memory. Queries are embedded the same way, then compared by cosine similarity to find the closest documents.
The embedding model is initialised lazily on first use — the ONNX model file is downloaded on that first call (cached locally afterwards).
When fastembed is not installed, DocumentStore falls back to a lightweight TF-IDF word-frequency embedder. The fallback is lower quality (no semantic understanding, only lexical overlap) but works offline with no downloads, making it suitable for development and testing.
Constant Summary collapse
- FASTEMBED_AVAILABLE =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
begin require 'fastembed' true # :nocov: rescue LoadError false # :nocov: end
- DEFAULT_MODEL =
Default embedding model used when none is specified.
'BAAI/bge-small-en-v1.5'- STOP_WORDS =
%w[ a an the is are was were be been being am do does did to of in and or but for with on at by from as into it its this that these those i you he she we they not no nor so yet ].to_set.freeze
- VERSION =
'0.2.1'
Instance Method Summary collapse
-
#clear ⇒ self
Remove all stored documents.
-
#delete(key) ⇒ self
Remove the document stored under
key. -
#empty? ⇒ Boolean
Whether the store contains no documents.
-
#initialize(model_name: DEFAULT_MODEL) ⇒ DocumentStore
constructor
A new instance of DocumentStore.
-
#keys ⇒ Array<Symbol>
Keys of all stored documents.
-
#search(query, limit: 5) ⇒ Array<Hash>
Search for documents semantically similar to
query. -
#size ⇒ Integer
Number of stored documents.
-
#store(key, text) ⇒ self
Embed
textand store it underkey.
Constructor Details
#initialize(model_name: DEFAULT_MODEL) ⇒ DocumentStore
Returns a new instance of DocumentStore.
47 48 49 50 51 52 53 |
# File 'lib/robot_lab/document_store.rb', line 47 def initialize(model_name: DEFAULT_MODEL) @model_name = model_name @documents = {} # key (Symbol) => { text: String, vector: Array<Float> } @mutex = Mutex.new @fastembed_model = nil # lazy: initialised on first embed call @using_fastembed = FASTEMBED_AVAILABLE end |
Instance Method Details
#clear ⇒ self
Remove all stored documents.
119 120 121 122 |
# File 'lib/robot_lab/document_store.rb', line 119 def clear @mutex.synchronize { @documents.clear } self end |
#delete(key) ⇒ self
Remove the document stored under key.
112 113 114 115 |
# File 'lib/robot_lab/document_store.rb', line 112 def delete(key) @mutex.synchronize { @documents.delete(key.to_sym) } self end |
#empty? ⇒ Boolean
Whether the store contains no documents.
105 106 107 |
# File 'lib/robot_lab/document_store.rb', line 105 def empty? @mutex.synchronize { @documents.empty? } end |
#keys ⇒ Array<Symbol>
Keys of all stored documents.
99 100 101 |
# File 'lib/robot_lab/document_store.rb', line 99 def keys @mutex.synchronize { @documents.keys } end |
#search(query, limit: 5) ⇒ Array<Hash>
Search for documents semantically similar to query.
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/robot_lab/document_store.rb', line 75 def search(query, limit: 5) return [] if empty? query_vec = query_vector(query) results = [] @mutex.synchronize do @documents.each do |key, doc| score = cosine_similarity(query_vec, doc[:vector]) results << { key: key, text: doc[:text], score: score } end end results.sort_by { |r| -r[:score] }.first(limit) end |
#size ⇒ Integer
Number of stored documents.
93 94 95 |
# File 'lib/robot_lab/document_store.rb', line 93 def size @mutex.synchronize { @documents.size } end |
#store(key, text) ⇒ self
Embed text and store it under key.
If a document already exists under key it is replaced.
62 63 64 65 66 67 |
# File 'lib/robot_lab/document_store.rb', line 62 def store(key, text) key = key.to_sym vector = passage_vector(text) @mutex.synchronize { @documents[key] = { text: text, vector: vector } } self end |