Class: Pikuri::VectorDb::Backend::Chroma
- Inherits:
-
Object
- Object
- Pikuri::VectorDb::Backend::Chroma
- Defined in:
- lib/pikuri/vector_db/backend/chroma.rb
Overview
Thin Faraday HTTP client against a self-hosted Chroma server (v2 API). The persistent backend, behind the same duck-typed Pikuri::VectorDb::Backend protocol as InMemory: same method names, same return shapes, same ArgumentError contract on empty input + non-positive top_k. Where the two diverge is the vector-dim contract — see below.
Two ways to get one
-
**Bring your own.** Backend::Chroma.new(host:, port:, collection:) against an existing chroma deployment (production cluster, docker-compose stack, a chroma already running on the host for an unrelated project). The host owns the process; this class is purely the HTTP client.
-
**Let pikuri manage it.** ChromaServer.ensure_running spawns and supervises a chroma container under the
pikuri-internal-chromaname, against a pinned image, with a bind-mounted volume in the user’s cache dir. Its #client(collection:) returns aBackend::Chromapre-pointed at the supervised container. The split is deliberate: docker lifecycle and HTTP wire protocol have nothing in common, so each lives in its own class.
Chroma v2 API
Endpoints used:
-
POST /api/v2/tenants/{tenant}/databases/{db}/collections with get_or_create: true — idempotent collection creation. Returns {id, name, …}.
-
POST /api/v2/…/collections/{id}/upsert — insert or replace by id. Body carries parallel arrays of
ids,embeddings,documents,metadatas. -
POST /api/v2/…/collections/{id}/query — k-NN search. Body: {query_embeddings, n_results, include}.
-
GET /api/v2/…/collections/{id}/count — integer count.
-
DELETE /api/v2/…/collections/{id} — drop the collection (used by
#delete_all).
BYO embeddings (not Chroma’s embedder)
Chroma collections can carry an embedding function in their metadata — Chroma’s term for what pikuri calls an Embedder. When configured, add / query accept raw text via documents / query_texts and Chroma embeds server-side. We deliberately don’t use this: pikuri’s Embedder is the one source of truth for embedder choice, the provider-cliff visibility lives in pikuri’s config, and a parallel Chroma-side embedder config would split the truth without pikuri noticing (e.g. local embedder in pikuri + OpenAIEmbeddingFunction in Chroma — every indexed document silently lands at OpenAI). We always send pre-computed embeddings; Chroma’s collection embedder is never invoked.
Vector-dim contract diverges from InMemory
InMemory enforces vector-dim consistency client-side (locks on first upsert, raises ArgumentError on mismatch). Chroma enforces server-side — first upsert to a collection establishes the dim; mismatched subsequent upserts produce HTTP 4xx which propagates as RuntimeError. Different exception class, same loud-failure shape. Documented divergence; not worth parsing Chroma’s error envelope to coerce to ArgumentError.
Lazy collection resolution
Backend::Chroma.new doesn’t talk to the server. The first #upsert / #query / #count call resolves (and creates if missing) the collection by name, caches the id, and uses it thereafter. #delete_all drops the collection and clears the cached id; the next #upsert re-creates from scratch.
Cosine distance (matches InMemory)
Collection is created with hnsw.space: ‘cosine’. Chroma returns cosine distance (range [0, 2] where 0 = identical, 1 = orthogonal); #query converts to similarity via 1 - distance so the Result score has the same meaning across backends.
Metadata key normalization
Chroma serializes through JSON, so Symbol metadata keys become Strings on round-trip. #upsert converts the incoming Chunk‘s metadata keys to Strings before sending; #query converts them back to Symbols on the way out, so the Chunk a caller pulls from a query looks identical to one stored in InMemory. source rides as a special metadata key (Chroma has no native source concept).
Testing posture
Specs use Faraday::Adapter::Test stubs only — they verify “we send what we think we’re sending” against the v2 API shape but don’t catch real-Chroma protocol drift. Real-Chroma smoke testing is wired into the demo binary in a later phase. Targets Chroma 0.5.x+ (v2 API).
Instance Method Summary collapse
-
#count ⇒ Integer
Current chunk count.
-
#delete_all ⇒ void
Drop the collection.
- #initialize(host:, port:, collection:, tenant: 'default_tenant', database: 'default_database', connection: nil) ⇒ Chroma constructor
-
#query(vector:, top_k:) ⇒ Array<Backend::Result>
k-NN query by cosine similarity.
-
#upsert(chunks:, vectors:) ⇒ void
Insert-or-replace by
chunk.id.
Constructor Details
#initialize(host:, port:, collection:, tenant: 'default_tenant', database: 'default_database', connection: nil) ⇒ Chroma
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 128 def initialize(host:, port:, collection:, tenant: 'default_tenant', database: 'default_database', connection: nil) raise ArgumentError, 'host must be non-empty' if host.nil? || host.to_s.empty? raise ArgumentError, 'collection must be non-empty' if collection.nil? || collection.to_s.empty? @host = host @port = port @collection_name = collection @tenant = tenant @database = database @collection_id = nil @connection = connection || Faraday.new(url: "http://#{host}:#{port}") do |f| f.request :json f.response :json f.adapter Faraday.default_adapter end end |
Instance Method Details
#count ⇒ Integer
Returns current chunk count. Zero before the first #upsert.
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 |
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 257 def count return 0 if @collection_id.nil? && !collection_exists? response = @connection.get("#{collection_path}/count") unless response.status == 200 raise "Backend::Chroma: GET #{collection_path}/count returned " \ "HTTP #{response.status}: #{response.body.inspect}" end body = response.body # Chroma v2 returns the count as a bare integer. return body if body.is_a?(Integer) return body['count'] if body.is_a?(Hash) && body['count'].is_a?(Integer) raise "Backend::Chroma: count response was not an Integer (got #{body.inspect})" end |
#delete_all ⇒ void
This method returns an undefined value.
Drop the collection. Next #upsert re-creates from scratch — that’s the v1 nuke-and-reload reindex path the Indexer drives. No-op if no collection was ever created (consistent with InMemory‘s clear-on-empty behaviour). 404 on the DELETE is treated as “already gone” — idempotent.
243 244 245 246 247 248 249 250 251 252 253 |
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 243 def delete_all return nil if @collection_id.nil? && !collection_exists? response = @connection.delete(collection_path) unless [200, 204, 404].include?(response.status) raise "Backend::Chroma: DELETE #{collection_path} returned " \ "HTTP #{response.status}: #{response.body.inspect}" end @collection_id = nil nil end |
#query(vector:, top_k:) ⇒ Array<Backend::Result>
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 199 def query(vector:, top_k:) raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0 # If we've never upserted, the collection doesn't # exist yet — semantic answer is "no hits." return [] if @collection_id.nil? && !collection_exists? response_body = post_json("#{collection_path}/query", { query_embeddings: [vector], n_results: top_k, include: %w[documents metadatas distances] }) ids = (response_body['ids'] || [[]]).first || [] docs = (response_body['documents'] || [[]]).first || [] = (response_body['metadatas'] || [[]]).first || [] dists = (response_body['distances'] || [[]]).first || [] ids.each_with_index.map do |id, i| = [i] || {} # Pull +source+ back out of the metadata blob; # symbolize the remaining keys for round-trip # consistency with InMemory. source = ['source'] || '' = {} .each do |k, v| next if k == 'source' [k.to_sym] = v end chunk = Chunk.new(id: id, source: source, text: docs[i] || '', metadata: ) Result.new(chunk: chunk, score: 1.0 - dists[i].to_f) end end |
#upsert(chunks:, vectors:) ⇒ void
This method returns an undefined value.
Insert-or-replace by chunk.id. Parallel arrays of equal length; raises on empty input or length mismatch (same contract as InMemory). Chroma server enforces vector-dim consistency; mismatched dims surface as RuntimeError from a 4xx response (the InMemory backend raises ArgumentError for the same case —documented divergence).
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 161 def upsert(chunks:, vectors:) raise ArgumentError, 'upsert called with empty chunks/vectors' if chunks.empty? if chunks.size != vectors.size raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors" end ensure_collection! = chunks.map do |c| # Serialize +source+ as a reserved key in Chroma's # +metadata+; merge in the user's metadata Hash with # keys stringified for JSON round-trip stability. base = { 'source' => c.source } c..each { |k, v| base[k.to_s] = v } base end body = { ids: chunks.map(&:id), embeddings: vectors, documents: chunks.map(&:text), metadatas: } post_json("#{collection_path}/upsert", body) nil end |