Class: Pikuri::VectorDb::Backend::InMemory

Inherits:

Object

Object
Pikuri::VectorDb::Backend::InMemory

show all

Defined in:: lib/pikuri/vector_db/backend/in_memory.rb

Overview

Pure-Ruby vector store. The educational default backend —the “small enough to audit” first stop the demo + guide walk through before promoting users to Chroma for persistence.

What it does

Holds an in-memory Hash from chunk id to [Chunk, vector]; #query computes cosine similarity against every stored vector, sorts descending, returns the top-k as Backend::Result instances. O(n) per query, where n is the number of stored chunks. Fine for thousands of chunks (a personal notes folder, a single product’s docs); slow for millions (a full corporate knowledge base — that’s the Chroma use case).

What it deliberately doesn’t do

**No persistence.** RAM-only, intentional — the user who wants persistence picks Chroma. Reloads from sources on every boot, which makes the in-memory backend the natural teaching shape: the same code path the demo binary walks on startup is the one the user inspects when they’re learning what “indexing” actually means.
**No approximate search.** Exhaustive scan. Approximate nearest neighbor (HNSW, IVF) adds complexity that doesn’t teach anything additional once the cosine math is clear.
**No approximate-search index.** Exhaustive scan only.

Thread safety

Every public method runs under a single reentrant Monitor. The agent’s main thread calls #query while a background Watcher thread calls #replace_source / #delete_by_source, so concurrent access is real once auto-watch is wired. The lock’s load-bearing job is #replace_source: it holds the monitor across the delete-then-upsert so a concurrent #query never observes the gap where a source has zero chunks. Monitor (not a bare Mutex) because #replace_source re-enters the lock via #delete_by_source + #upsert, which a non-reentrant Mutex would deadlock on. Chroma needs no client-side lock — the server serializes — so this is the one backend that locks.

Cosine, not dot product

Some embedders return pre-normalized vectors (text-embedding-3, most sentence-transformers); others don’t. Cosine normalizes at compute time, so the backend works regardless of whether the embedder did. The readable two-pass form below (compute dot + magnitudes separately) is intentional over the single-loop micro-optimization — this is the file the newcomer reads to understand what’s happening.

Instance Method Summary collapse

#count ⇒ Integer

Current chunk count.
#delete_all ⇒ void

Drop every stored chunk.
#delete_by_source(source) ⇒ void

Remove every chunk whose source matches.
#initialize ⇒ InMemory constructor
#query(vector:, top_k:) ⇒ Array<Backend::Result>

Cosine-similarity nearest neighbor search.
#replace_source(source:, chunks:, vectors:) ⇒ void

Atomically replace all chunks for one source: delete the old set, then upsert the new one, under a single hold of the monitor.
#source_indexed?(source) ⇒ Boolean

Is source in the corpus? The scoped membership test behind Tools::Read‘s gate — a short-circuiting scan rather than building the whole #sources_with_hashes map just to read one key.
#sources_with_hashes ⇒ Hash{String => String, nil}

The boot-sweep reference: a map from each indexed source to the content hash stored on its chunks.
#upsert(chunks:, vectors:) ⇒ void

Insert-or-replace by chunk.id.

Constructor Details

#initialize ⇒ `InMemory`

# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 63

def initialize
  # id (String) → [Chunk, vector (Array<Float>)]
  @entries = {}
  # Dimension of every stored vector. +nil+ before the first
  # +#upsert+; locked to the dim of the first vector seen and
  # enforced for every subsequent +#upsert+ + +#query+ — see
  # the Backend protocol's "Vector-dim contract" yardoc.
  @dim = nil
  # Reentrant so +#replace_source+ can call +#delete_by_source+
  # + +#upsert+ while holding the lock — see the class yardoc's
  # "Thread safety" section.
  @lock = Monitor.new
end

Instance Method Details

#count ⇒ `Integer`

Returns current chunk count.

Returns:

(Integer) —

current chunk count.



149
150
151

# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 149

def count
  @lock.synchronize { @entries.size }
end

#delete_all ⇒ `void`

This method returns an undefined value.

Drop every stored chunk. Used by the v1 nuke-and-reload reindex flow; the embedder dim lock is also released so a reindex with a different embedder model starts clean.

# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 140

def delete_all
  @lock.synchronize do
    @entries.clear
    @dim = nil
  end
  nil
end

#delete_by_source(source) ⇒ `void`

This method returns an undefined value.

Remove every chunk whose source matches. The scoped counterpart to #delete_all — drops one document’s chunks without touching the rest. No-op (and no error) when the source isn’t present. The dim lock is left intact: unlike #delete_all, a per-source delete doesn’t imply an embedder change.

Parameters:

source (String) —

the Chunk#source to purge, e.g. “notes/cooking.md”.

# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 163

def delete_by_source(source)
  @lock.synchronize do
    @entries.reject! { |_id, (chunk, _vector)| chunk.source == source }
  end
  nil
end

#query(vector:, top_k:) ⇒ `Array<Backend::Result>`

Cosine-similarity nearest neighbor search. Returns the top-k Results in descending score order; empty array when the store has no entries.

Parameters:

vector (Array<Float>) —

query vector; must match the stored vector dim.
top_k (Integer) —

number of results to return; must be positive.

Returns:

(Array<Backend::Result>)

Raises:

(ArgumentError) —

on top_k <= 0 or query-vector dim mismatch.

# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 118

def query(vector:, top_k:)
  raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0

  @lock.synchronize do
    return [] if @entries.empty?

    if vector.size != @dim
      raise ArgumentError, "query vector dim #{vector.size}, stored dim #{@dim}"
    end

    scored = @entries.values.map do |chunk, stored|
      Result.new(chunk: chunk, score: cosine(vector, stored))
    end
    scored.sort_by { |r| -r.score }.first(top_k)
  end
end

#replace_source(source:, chunks:, vectors:) ⇒ `void`

This method returns an undefined value.

Atomically replace all chunks for one source: delete the old set, then upsert the new one, under a single hold of the monitor. The incremental-reindex unit (see Indexer#reindex_file!). Holding the lock across both halves is the point — a concurrent #query sees either the old chunks or the new ones, never the empty gap between.

Parameters:

source (String) —

the Chunk#source being replaced.
chunks (Array<Chunk>) —

the new chunk set; every chunk.source should equal source.
vectors (Array<Array<Float>>) —

parallel to chunks.

Raises:

(ArgumentError) —

on empty input, length mismatch, or vector-dim mismatch (from the inner #upsert).

# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 184

def replace_source(source:, chunks:, vectors:)
  @lock.synchronize do
    delete_by_source(source)
    upsert(chunks: chunks, vectors: vectors)
  end
  nil
end

#source_indexed?(source) ⇒ `Boolean`

Is source in the corpus? The scoped membership test behind Tools::Read‘s gate — a short-circuiting scan rather than building the whole #sources_with_hashes map just to read one key. See the Backend protocol yardoc.

Parameters:

source (String) —

the Chunk#source to test.

Returns:

(Boolean) —

true if at least one chunk has this source.

# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 220

def source_indexed?(source)
  @lock.synchronize do
    @entries.each_value.any? { |chunk, _vector| chunk.source == source }
  end
end

#sources_with_hashes ⇒ `Hash{String => String, nil}`

The boot-sweep reference: a map from each indexed source to the content hash stored on its chunks. Watcher (via Indexer#reconcile_plan) diffs this against the hashes of the files currently on disk to decide what to reindex. Built from chunk metadata; a chunk indexed before the hash metadata existed maps its source to nil, which the diff treats as “changed” and reindexes — self-healing.

Returns:

(Hash{String => String, nil}) —

source → content hash. Empty when nothing is indexed (the InMemory case at every boot, since RAM resets).

# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 203

def sources_with_hashes
  @lock.synchronize do
    result = {}
    @entries.each_value do |chunk, _vector|
      result[chunk.source] ||= chunk.metadata[:hash]
    end
    result
  end
end

#upsert(chunks:, vectors:) ⇒ `void`

This method returns an undefined value.

Insert-or-replace by chunk.id. Parallel arrays of equal length; raises on empty input or length mismatch. Vector dimension is locked at first upsert; raises on any subsequent vector of a different dim.

Parameters:

chunks (Array<Chunk>)
vectors (Array<Array<Float>>)

Raises:

(ArgumentError) —

on empty input, length mismatch, or vector-dim mismatch.

# File 'lib/pikuri/vector_db/backend/in_memory.rb', line 87

def upsert(chunks:, vectors:)
  raise ArgumentError, 'upsert called with empty chunks/vectors' if chunks.empty?
  if chunks.size != vectors.size
    raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors"
  end

  @lock.synchronize do
    expected = @dim || vectors.first.size
    vectors.each_with_index do |v, i|
      next if v.size == expected

      raise ArgumentError, "vector #{i} has dim #{v.size}, expected #{expected}"
    end
    @dim ||= expected

    chunks.zip(vectors).each { |chunk, vector| @entries[chunk.id] = [chunk, vector] }
  end
  nil
end

Class: Pikuri::VectorDb::Backend::InMemory

Overview

What it does

What it deliberately doesn’t do

Thread safety

Cosine, not dot product

Instance Method Summary collapse

Constructor Details

#initialize ⇒ InMemory

Instance Method Details

#count ⇒ Integer

#delete_all ⇒ void

#delete_by_source(source) ⇒ void

#query(vector:, top_k:) ⇒ Array<Backend::Result>

#replace_source(source:, chunks:, vectors:) ⇒ void

#source_indexed?(source) ⇒ Boolean

#sources_with_hashes ⇒ Hash{String => String, nil}

#upsert(chunks:, vectors:) ⇒ void

#initialize ⇒ `InMemory`

#count ⇒ `Integer`

#delete_all ⇒ `void`

#delete_by_source(source) ⇒ `void`

#query(vector:, top_k:) ⇒ `Array<Backend::Result>`

#replace_source(source:, chunks:, vectors:) ⇒ `void`

#source_indexed?(source) ⇒ `Boolean`

#sources_with_hashes ⇒ `Hash{String => String, nil}`

#upsert(chunks:, vectors:) ⇒ `void`