Class: Pikuri::VectorDb::Chunk

Inherits:
Data
  • Object
show all
Defined in:
lib/pikuri/vector_db/chunk.rb

Overview

A unit of text after chunking, before storage. The Indexer composes a Chunk + its vector and hands the pair to a Backend for storage; Search surfaces query hits back to the LLM as Chunks wrapped in Backend::Results.

Fields

  • id — opaque internal identifier, unique within a backend. The Indexer composes it from (source, offset, content_hash) so an eventual incremental reindex can replace specific chunks by id without scanning by text content. v1’s nuke-and-reload doesn’t depend on this, but the key scheme is forward-compatible. **Not for citation** — that’s source‘s job; the LLM never sees id.

  • source — user-facing reference for what was indexed. Relative file path for filesystem sources (+“notes/cooking.md”+), URL for fetched docs, doc ID for future connector sources (Google Drive, Notion). Surfaced verbatim by Search as the citation alongside each hit; the LLM uses it to say “see notes/cooking.md” rather than answering from a chunk of unknown origin. One source typically produces many chunks (a 512-token sliding window over a 20 KB Markdown file ≈ 10 chunks), so source is not unique within a backend — that’s the id‘s role.

  • text — the chunk’s text content; the unit fed to the embedder for vectorization, and surfaced verbatim to the LLM as the snippet alongside its source citation.

  • metadata — free-form Hash for the optional extras that some source types carry: { offset:, page:, anchor:, … }. Not every source type needs every field, so the open Hash beats a fixed shape. Surfaced to the LLM in the search observation when present.

Why no vector field

The vector lives alongside the chunk only in transit between Embedder and Backend#upsert, and disappears from view on the query side (callers see Backend::Result with a chunk + a score, never the raw vector). Keeping it off Chunk means query-result chunks carry no inert vector: nil field, and the upsert API takes parallel arrays the same shape every backend’s underlying client (Chroma’s Python collection.upsert included) already expects.

Why a Data.define

Immutable value types match pikuri’s convention for similar records (Persona, the various Event types). A Chunk cannot be mutated after construction; partial-reindex flows build new Chunks with the same id and let the backend #upsert overwrite the old text + vector.

Instance Attribute Summary collapse

Instance Attribute Details

#idObject (readonly)

Returns the value of attribute id

Returns:

  • (Object)

    the current value of id



58
59
60
# File 'lib/pikuri/vector_db/chunk.rb', line 58

def id
  @id
end

#metadataObject (readonly)

Returns the value of attribute metadata

Returns:

  • (Object)

    the current value of metadata



58
59
60
# File 'lib/pikuri/vector_db/chunk.rb', line 58

def 
  @metadata
end

#sourceObject (readonly)

Returns the value of attribute source

Returns:

  • (Object)

    the current value of source



58
59
60
# File 'lib/pikuri/vector_db/chunk.rb', line 58

def source
  @source
end

#textObject (readonly)

Returns the value of attribute text

Returns:

  • (Object)

    the current value of text



58
59
60
# File 'lib/pikuri/vector_db/chunk.rb', line 58

def text
  @text
end