Class: Pikuri::VectorDb::Chunk
- Inherits:
-
Data
- Object
- Data
- Pikuri::VectorDb::Chunk
- Defined in:
- lib/pikuri/vector_db/chunk.rb
Overview
A unit of text after chunking, before storage. The Indexer composes a Chunk + its vector and hands the pair to a Backend for storage; Search surfaces query hits back to the LLM as Chunks wrapped in Backend::Results.
Fields
-
id— opaque internal identifier, unique within a backend. The Indexer composes it from (source, offset, content_hash) so an eventual incremental reindex can replace specific chunks by id without scanning by text content. v1’s nuke-and-reload doesn’t depend on this, but the key scheme is forward-compatible. **Not for citation** — that’ssource‘s job; the LLM never seesid. -
source— user-facing reference for what was indexed. Relative file path for filesystem sources (+“notes/cooking.md”+), URL for fetched docs, doc ID for future connector sources (Google Drive, Notion). Surfaced verbatim by Search as the citation alongside each hit; the LLM uses it to say “see notes/cooking.md” rather than answering from a chunk of unknown origin. One source typically produces many chunks (a 512-token sliding window over a 20 KB Markdown file ≈ 10 chunks), sosourceis not unique within a backend — that’s theid‘s role. -
text— the chunk’s text content; the unit fed to the embedder for vectorization, and surfaced verbatim to the LLM as the snippet alongside itssourcecitation. -
metadata— free-form Hash for the optional extras that some source types carry: { offset:, page:, anchor:, … }. Not every source type needs every field, so the open Hash beats a fixed shape. Surfaced to the LLM in the search observation when present.
Why no vector field
The vector lives alongside the chunk only in transit between Embedder and Backend#upsert, and disappears from view on the query side (callers see Backend::Result with a chunk + a score, never the raw vector). Keeping it off Chunk means query-result chunks carry no inert vector: nil field, and the upsert API takes parallel arrays the same shape every backend’s underlying client (Chroma’s Python collection.upsert included) already expects.
Why a Data.define
Immutable value types match pikuri’s convention for similar records (Persona, the various Event types). A Chunk cannot be mutated after construction; partial-reindex flows build new Chunks with the same id and let the backend #upsert overwrite the old text + vector.
Instance Attribute Summary collapse
-
#id ⇒ Object
readonly
Returns the value of attribute id.
-
#metadata ⇒ Object
readonly
Returns the value of attribute metadata.
-
#source ⇒ Object
readonly
Returns the value of attribute source.
-
#text ⇒ Object
readonly
Returns the value of attribute text.
Instance Attribute Details
#id ⇒ Object (readonly)
Returns the value of attribute id
58 59 60 |
# File 'lib/pikuri/vector_db/chunk.rb', line 58 def id @id end |
#metadata ⇒ Object (readonly)
Returns the value of attribute metadata
58 59 60 |
# File 'lib/pikuri/vector_db/chunk.rb', line 58 def @metadata end |
#source ⇒ Object (readonly)
Returns the value of attribute source
58 59 60 |
# File 'lib/pikuri/vector_db/chunk.rb', line 58 def source @source end |
#text ⇒ Object (readonly)
Returns the value of attribute text
58 59 60 |
# File 'lib/pikuri/vector_db/chunk.rb', line 58 def text @text end |