Class: Pikuri::VectorDb::Backend::Chroma

Inherits:
Object
  • Object
show all
Defined in:
lib/pikuri/vector_db/backend/chroma.rb

Overview

Thin Faraday HTTP client against a self-hosted Chroma server (v2 API). The persistent backend, behind the same duck-typed Pikuri::VectorDb::Backend protocol as InMemory: same method names, same return shapes, same ArgumentError contract on empty input + non-positive top_k. Where the two diverge is the vector-dim contract — see below.

Two ways to get one

  • **Bring your own.** Backend::Chroma.new(host:, port:, collection:) against an existing chroma deployment (production cluster, docker-compose stack, a chroma already running on the host for an unrelated project). The host owns the process; this class is purely the HTTP client.

  • **Let pikuri manage it.** ChromaServer.ensure_running spawns and supervises a chroma container under the pikuri-internal-chroma name, against a pinned image, with a bind-mounted volume in the user’s cache dir. Its #client(collection:) returns a Backend::Chroma pre-pointed at the supervised container. The split is deliberate: docker lifecycle and HTTP wire protocol have nothing in common, so each lives in its own class.

Chroma v2 API

Endpoints used:

  • POST /api/v2/tenants/{tenant}/databases/{db}/collections with get_or_create: true — idempotent collection creation. Returns {id, name, …}.

  • POST /api/v2/…/collections/{id}/upsert — insert or replace by id. Body carries parallel arrays of ids, embeddings, documents, metadatas.

  • POST /api/v2/…/collections/{id}/query — k-NN search. Body: {query_embeddings, n_results, include}.

  • GET /api/v2/…/collections/{id}/count — integer count.

  • DELETE /api/v2/…/collections/{id} — drop the collection (used by #delete_all).

BYO embeddings (not Chroma’s embedder)

Chroma collections can carry an embedding function in their metadata — Chroma’s term for what pikuri calls an Embedder. When configured, add / query accept raw text via documents / query_texts and Chroma embeds server-side. We deliberately don’t use this: pikuri’s Embedder is the one source of truth for embedder choice, the provider-cliff visibility lives in pikuri’s config, and a parallel Chroma-side embedder config would split the truth without pikuri noticing (e.g. local embedder in pikuri + OpenAIEmbeddingFunction in Chroma — every indexed document silently lands at OpenAI). We always send pre-computed embeddings; Chroma’s collection embedder is never invoked.

Vector-dim contract diverges from InMemory

InMemory enforces vector-dim consistency client-side (locks on first upsert, raises ArgumentError on mismatch). Chroma enforces server-side — first upsert to a collection establishes the dim; mismatched subsequent upserts produce HTTP 4xx which propagates as RuntimeError. Different exception class, same loud-failure shape. Documented divergence; not worth parsing Chroma’s error envelope to coerce to ArgumentError.

Lazy collection resolution

Backend::Chroma.new doesn’t talk to the server. The first #upsert / #query / #count call resolves (and creates if missing) the collection by name, caches the id, and uses it thereafter. #delete_all drops the collection and clears the cached id; the next #upsert re-creates from scratch.

Cosine distance (matches InMemory)

Collection is created with hnsw.space: ‘cosine’. Chroma returns cosine distance (range [0, 2] where 0 = identical, 1 = orthogonal); #query converts to similarity via 1 - distance so the Result score has the same meaning across backends.

Metadata key normalization

Chroma serializes through JSON, so Symbol metadata keys become Strings on round-trip. #upsert converts the incoming Chunk‘s metadata keys to Strings before sending; #query converts them back to Symbols on the way out, so the Chunk a caller pulls from a query looks identical to one stored in InMemory. source rides as a special metadata key (Chroma has no native source concept).

Testing posture

Specs use Faraday::Adapter::Test stubs only — they verify “we send what we think we’re sending” against the v2 API shape but don’t catch real-Chroma protocol drift. Real-Chroma smoke testing is wired into the demo binary in a later phase. Targets Chroma 0.5.x+ (v2 API).

Instance Method Summary collapse

Constructor Details

#initialize(host:, port:, collection:, tenant: 'default_tenant', database: 'default_database', connection: nil) ⇒ Chroma

Parameters:

  • host (String)
  • port (Integer)
  • collection (String)

    collection name in Chroma. This is a Chroma-specific identifier, so it lives here rather than on VectorDb::Extension (where it’d be a no-op for Backend::InMemory).

  • tenant (String) (defaults to: 'default_tenant')

    Chroma v2 tenant; defaults to Chroma’s own default.

  • database (String) (defaults to: 'default_database')

    Chroma v2 database; defaults to Chroma’s own default.

  • connection (Faraday::Connection, nil) (defaults to: nil)

    optional dependency-injection point for tests.

Raises:

  • (ArgumentError)

    on empty host or empty collection.



128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 128

def initialize(host:, port:, collection:,
               tenant: 'default_tenant',
               database: 'default_database',
               connection: nil)
  raise ArgumentError, 'host must be non-empty' if host.nil? || host.to_s.empty?
  raise ArgumentError, 'collection must be non-empty' if collection.nil? || collection.to_s.empty?

  @host = host
  @port = port
  @collection_name = collection
  @tenant = tenant
  @database = database
  @collection_id = nil
  @connection = connection || Faraday.new(url: "http://#{host}:#{port}") do |f|
    f.request :json
    f.response :json
    f.adapter Faraday.default_adapter
  end
end

Instance Method Details

#countInteger

Returns current chunk count. Zero before the first #upsert.

Returns:

  • (Integer)

    current chunk count. Zero before the first #upsert.



257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 257

def count
  return 0 if @collection_id.nil? && !collection_exists?

  response = @connection.get("#{collection_path}/count")
  unless response.status == 200
    raise "Backend::Chroma: GET #{collection_path}/count returned " \
          "HTTP #{response.status}: #{response.body.inspect}"
  end

  body = response.body
  # Chroma v2 returns the count as a bare integer.
  return body if body.is_a?(Integer)
  return body['count'] if body.is_a?(Hash) && body['count'].is_a?(Integer)

  raise "Backend::Chroma: count response was not an Integer (got #{body.inspect})"
end

#delete_allvoid

This method returns an undefined value.

Drop the collection. Next #upsert re-creates from scratch — that’s the v1 nuke-and-reload reindex path the Indexer drives. No-op if no collection was ever created (consistent with InMemory‘s clear-on-empty behaviour). 404 on the DELETE is treated as “already gone” — idempotent.



243
244
245
246
247
248
249
250
251
252
253
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 243

def delete_all
  return nil if @collection_id.nil? && !collection_exists?

  response = @connection.delete(collection_path)
  unless [200, 204, 404].include?(response.status)
    raise "Backend::Chroma: DELETE #{collection_path} returned " \
          "HTTP #{response.status}: #{response.body.inspect}"
  end
  @collection_id = nil
  nil
end

#query(vector:, top_k:) ⇒ Array<Backend::Result>

k-NN query by cosine similarity. Returns at most top_k Results descending by score. score is 1 - cosine_distance so the value matches InMemory‘s cosine-similarity scale.

Parameters:

  • vector (Array<Float>)
  • top_k (Integer)

Returns:

Raises:

  • (ArgumentError)

    on non-positive top_k.

  • (RuntimeError)

    on HTTP failure.



199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 199

def query(vector:, top_k:)
  raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0

  # If we've never upserted, the collection doesn't
  # exist yet — semantic answer is "no hits."
  return [] if @collection_id.nil? && !collection_exists?

  response_body = post_json("#{collection_path}/query", {
                              query_embeddings: [vector],
                              n_results: top_k,
                              include: %w[documents metadatas distances]
                            })

  ids = (response_body['ids']       || [[]]).first || []
  docs = (response_body['documents'] || [[]]).first || []
  metas = (response_body['metadatas'] || [[]]).first || []
  dists = (response_body['distances'] || [[]]).first || []

  ids.each_with_index.map do |id, i|
    meta = metas[i] || {}
    # Pull +source+ back out of the metadata blob;
    # symbolize the remaining keys for round-trip
    # consistency with InMemory.
    source = meta['source'] || ''
    chunk_meta = {}
    meta.each do |k, v|
      next if k == 'source'

      chunk_meta[k.to_sym] = v
    end

    chunk = Chunk.new(id: id, source: source, text: docs[i] || '', metadata: chunk_meta)
    Result.new(chunk: chunk, score: 1.0 - dists[i].to_f)
  end
end

#upsert(chunks:, vectors:) ⇒ void

This method returns an undefined value.

Insert-or-replace by chunk.id. Parallel arrays of equal length; raises on empty input or length mismatch (same contract as InMemory). Chroma server enforces vector-dim consistency; mismatched dims surface as RuntimeError from a 4xx response (the InMemory backend raises ArgumentError for the same case —documented divergence).

Parameters:

  • chunks (Array<Chunk>)
  • vectors (Array<Array<Float>>)

Raises:

  • (ArgumentError)

    on empty input or length mismatch.

  • (RuntimeError)

    on HTTP failure.



161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# File 'lib/pikuri/vector_db/backend/chroma.rb', line 161

def upsert(chunks:, vectors:)
  raise ArgumentError, 'upsert called with empty chunks/vectors' if chunks.empty?
  if chunks.size != vectors.size
    raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors"
  end

  ensure_collection!

  metadatas = chunks.map do |c|
    # Serialize +source+ as a reserved key in Chroma's
    # +metadata+; merge in the user's metadata Hash with
    # keys stringified for JSON round-trip stability.
    base = { 'source' => c.source }
    c..each { |k, v| base[k.to_s] = v }
    base
  end

  body = {
    ids: chunks.map(&:id),
    embeddings: vectors,
    documents: chunks.map(&:text),
    metadatas: metadatas
  }

  post_json("#{collection_path}/upsert", body)
  nil
end