Class: Pikuri::VectorDb::Chunk
- Inherits:
-
Data
- Object
- Data
- Pikuri::VectorDb::Chunk
- Defined in:
- lib/pikuri/vector_db/chunk.rb
Overview
A unit of text after chunking, before storage. The Indexer composes a Chunk + its vector and hands the pair to a Backend for storage; Tools::Search surfaces query hits back to the LLM as Chunks wrapped in Backend::Results.
Fields
-
id— opaque internal identifier, unique within a backend. The Indexer composes it as “source:offset” (the relative source path plus the chunk’s ordinal within that file) — readable in logs, deterministic, and stable across reindexes of the same file. Incremental reindex replaces a whole document bysource(not by id), so the file’s content hash rides inmetadatarather than in the id — seemetadatabelow. **Not for citation** — that’ssource‘s job; the LLM never seesid. -
source— user-facing reference for what was indexed. Relative file path for filesystem sources (+“notes/cooking.md”+), URL for fetched docs, doc ID for future connector sources (Google Drive, Notion). Surfaced verbatim by Tools::Search as the citation alongside each hit; the LLM uses it to say “see notes/cooking.md” rather than answering from a chunk of unknown origin. One source typically produces many chunks (a 512-token sliding window over a 20 KB Markdown file ≈ 10 chunks), sosourceis not unique within a backend — that’s theid‘s role. -
text— the chunk’s text content; the unit fed to the embedder for vectorization, and surfaced verbatim to the LLM as the snippet alongside itssourcecitation. -
metadata— free-form Hash for the optional extras that some source types carry: { offset:, hash:, page:, … }. Not every source type needs every field, so the open Hash beats a fixed shape. The Indexer always setsoffset:(ordinal within the file) andhash:(SHA-256 of the source file’s bytes — the same for every chunk of one file, so the incremental sweep can read one chunk per source and know whether the file changed). Surfaced to the LLM in the search observation when present.
Why no vector field
The vector lives alongside the chunk only in transit between Embedder and Backend#upsert, and disappears from view on the query side (callers see Backend::Result with a chunk + a score, never the raw vector). Keeping it off Chunk means query-result chunks carry no inert vector: nil field, and the upsert API takes parallel arrays the same shape every backend’s underlying client (Chroma’s Python collection.upsert included) already expects.
Why a Data.define
Immutable value types match pikuri’s convention for similar records (Persona, the various Event types). A Chunk cannot be mutated after construction; partial-reindex flows build new Chunks with the same id and let the backend #upsert overwrite the old text + vector.
Instance Attribute Summary collapse
-
#id ⇒ Object
readonly
Returns the value of attribute id.
-
#metadata ⇒ Object
readonly
Returns the value of attribute metadata.
-
#source ⇒ Object
readonly
Returns the value of attribute source.
-
#text ⇒ Object
readonly
Returns the value of attribute text.
Instance Attribute Details
#id ⇒ Object (readonly)
Returns the value of attribute id
63 64 65 |
# File 'lib/pikuri/vector_db/chunk.rb', line 63 def id @id end |
#metadata ⇒ Object (readonly)
Returns the value of attribute metadata
63 64 65 |
# File 'lib/pikuri/vector_db/chunk.rb', line 63 def @metadata end |
#source ⇒ Object (readonly)
Returns the value of attribute source
63 64 65 |
# File 'lib/pikuri/vector_db/chunk.rb', line 63 def source @source end |
#text ⇒ Object (readonly)
Returns the value of attribute text
63 64 65 |
# File 'lib/pikuri/vector_db/chunk.rb', line 63 def text @text end |