Class: Pikuri::VectorDb::Backend::InMemory
- Inherits:
-
Object
- Object
- Pikuri::VectorDb::Backend::InMemory
- Defined in:
- lib/pikuri/vector_db/backend/in_memory.rb
Overview
Pure-Ruby vector store. The educational default backend —the “small enough to audit” first stop the demo + guide walk through before promoting users to Chroma for persistence.
What it does
Holds an in-memory Hash from chunk id to [Chunk, vector]; #query computes cosine similarity against every stored vector, sorts descending, returns the top-k as Backend::Result instances. O(n) per query, where n is the number of stored chunks. Fine for thousands of chunks (a personal notes folder, a single product’s docs); slow for millions (a full corporate knowledge base — that’s the Chroma use case).
What it deliberately doesn’t do
-
**No persistence.** RAM-only, intentional — the user who wants persistence picks
Chroma. Reloads from sources on every boot, which makes the in-memory backend the natural teaching shape: the same code path the demo binary walks on startup is the one the user inspects when they’re learning what “indexing” actually means. -
**No approximate search.** Exhaustive scan. Approximate nearest neighbor (HNSW, IVF) adds complexity that doesn’t teach anything additional once the cosine math is clear.
-
**No approximate-search index.** Exhaustive scan only.
Thread safety
Every public method runs under a single reentrant Monitor. The agent’s main thread calls #query while a background Watcher thread calls #replace_source / #delete_by_source, so concurrent access is real once auto-watch is wired. The lock’s load-bearing job is #replace_source: it holds the monitor across the delete-then-upsert so a concurrent #query never observes the gap where a source has zero chunks. Monitor (not a bare Mutex) because #replace_source re-enters the lock via #delete_by_source + #upsert, which a non-reentrant Mutex would deadlock on. Chroma needs no client-side lock — the server serializes — so this is the one backend that locks.
Cosine, not dot product
Some embedders return pre-normalized vectors (text-embedding-3, most sentence-transformers); others don’t. Cosine normalizes at compute time, so the backend works regardless of whether the embedder did. The readable two-pass form below (compute dot + magnitudes separately) is intentional over the single-loop micro-optimization — this is the file the newcomer reads to understand what’s happening.
Instance Method Summary collapse
-
#count ⇒ Integer
Current chunk count.
-
#delete_all ⇒ void
Drop every stored chunk.
-
#delete_by_source(source) ⇒ void
Remove every chunk whose
sourcematches. - #initialize ⇒ InMemory constructor
-
#query(vector:, top_k:) ⇒ Array<Backend::Result>
Cosine-similarity nearest neighbor search.
-
#replace_source(source:, chunks:, vectors:) ⇒ void
Atomically replace all chunks for one
source: delete the old set, then upsert the new one, under a single hold of the monitor. -
#source_indexed?(source) ⇒ Boolean
Is
sourcein the corpus? The scoped membership test behind Tools::Read‘s gate — a short-circuiting scan rather than building the whole #sources_with_hashes map just to read one key. -
#sources_with_hashes ⇒ Hash{String => String, nil}
The boot-sweep reference: a map from each indexed
sourceto the content hash stored on its chunks. -
#upsert(chunks:, vectors:) ⇒ void
Insert-or-replace by
chunk.id.
Constructor Details
#initialize ⇒ InMemory
63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 63 def initialize # id (String) → [Chunk, vector (Array<Float>)] @entries = {} # Dimension of every stored vector. +nil+ before the first # +#upsert+; locked to the dim of the first vector seen and # enforced for every subsequent +#upsert+ + +#query+ — see # the Backend protocol's "Vector-dim contract" yardoc. @dim = nil # Reentrant so +#replace_source+ can call +#delete_by_source+ # + +#upsert+ while holding the lock — see the class yardoc's # "Thread safety" section. @lock = Monitor.new end |
Instance Method Details
#count ⇒ Integer
Returns current chunk count.
149 150 151 |
# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 149 def count @lock.synchronize { @entries.size } end |
#delete_all ⇒ void
This method returns an undefined value.
Drop every stored chunk. Used by the v1 nuke-and-reload reindex flow; the embedder dim lock is also released so a reindex with a different embedder model starts clean.
140 141 142 143 144 145 146 |
# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 140 def delete_all @lock.synchronize do @entries.clear @dim = nil end nil end |
#delete_by_source(source) ⇒ void
This method returns an undefined value.
Remove every chunk whose source matches. The scoped counterpart to #delete_all — drops one document’s chunks without touching the rest. No-op (and no error) when the source isn’t present. The dim lock is left intact: unlike #delete_all, a per-source delete doesn’t imply an embedder change.
163 164 165 166 167 168 |
# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 163 def delete_by_source(source) @lock.synchronize do @entries.reject! { |_id, (chunk, _vector)| chunk.source == source } end nil end |
#query(vector:, top_k:) ⇒ Array<Backend::Result>
Cosine-similarity nearest neighbor search. Returns the top-k Results in descending score order; empty array when the store has no entries.
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 118 def query(vector:, top_k:) raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0 @lock.synchronize do return [] if @entries.empty? if vector.size != @dim raise ArgumentError, "query vector dim #{vector.size}, stored dim #{@dim}" end scored = @entries.values.map do |chunk, stored| Result.new(chunk: chunk, score: cosine(vector, stored)) end scored.sort_by { |r| -r.score }.first(top_k) end end |
#replace_source(source:, chunks:, vectors:) ⇒ void
This method returns an undefined value.
Atomically replace all chunks for one source: delete the old set, then upsert the new one, under a single hold of the monitor. The incremental-reindex unit (see Indexer#reindex_file!). Holding the lock across both halves is the point — a concurrent #query sees either the old chunks or the new ones, never the empty gap between.
184 185 186 187 188 189 190 |
# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 184 def replace_source(source:, chunks:, vectors:) @lock.synchronize do delete_by_source(source) upsert(chunks: chunks, vectors: vectors) end nil end |
#source_indexed?(source) ⇒ Boolean
Is source in the corpus? The scoped membership test behind Tools::Read‘s gate — a short-circuiting scan rather than building the whole #sources_with_hashes map just to read one key. See the Backend protocol yardoc.
220 221 222 223 224 |
# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 220 def source_indexed?(source) @lock.synchronize do @entries.each_value.any? { |chunk, _vector| chunk.source == source } end end |
#sources_with_hashes ⇒ Hash{String => String, nil}
The boot-sweep reference: a map from each indexed source to the content hash stored on its chunks. Watcher (via Indexer#reconcile_plan) diffs this against the hashes of the files currently on disk to decide what to reindex. Built from chunk metadata; a chunk indexed before the hash metadata existed maps its source to nil, which the diff treats as “changed” and reindexes — self-healing.
203 204 205 206 207 208 209 210 211 |
# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 203 def sources_with_hashes @lock.synchronize do result = {} @entries.each_value do |chunk, _vector| result[chunk.source] ||= chunk.[:hash] end result end end |
#upsert(chunks:, vectors:) ⇒ void
This method returns an undefined value.
Insert-or-replace by chunk.id. Parallel arrays of equal length; raises on empty input or length mismatch. Vector dimension is locked at first upsert; raises on any subsequent vector of a different dim.
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 87 def upsert(chunks:, vectors:) raise ArgumentError, 'upsert called with empty chunks/vectors' if chunks.empty? if chunks.size != vectors.size raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors" end @lock.synchronize do expected = @dim || vectors.first.size vectors.each_with_index do |v, i| next if v.size == expected raise ArgumentError, "vector #{i} has dim #{v.size}, expected #{expected}" end @dim ||= expected chunks.zip(vectors).each { |chunk, vector| @entries[chunk.id] = [chunk, vector] } end nil end |