Module: Pikuri::VectorDb

Defined in:
lib/pikuri-vectordb.rb,
lib/pikuri/vector_db/chunk.rb,
lib/pikuri/vector_db/tools.rb,
lib/pikuri/vector_db/server.rb,
lib/pikuri/vector_db/backend.rb,
lib/pikuri/vector_db/chunker.rb,
lib/pikuri/vector_db/indexer.rb,
lib/pikuri/vector_db/watcher.rb,
lib/pikuri/vector_db/embedder.rb,
lib/pikuri/vector_db/reranker.rb,
lib/pikuri/vector_db/extension.rb,
lib/pikuri/vector_db/librarian.rb,
lib/pikuri/vector_db/tokenizer.rb,
lib/pikuri/vector_db/tools/read.rb,
lib/pikuri/vector_db/reranker/hit.rb,
lib/pikuri/vector_db/tools/search.rb,
lib/pikuri/vector_db/server/chroma.rb,
lib/pikuri/vector_db/server/qdrant.rb,
lib/pikuri/vector_db/tools/reindex.rb,
lib/pikuri/vector_db/backend/chroma.rb,
lib/pikuri/vector_db/backend/qdrant.rb,
lib/pikuri/vector_db/backend/result.rb,
lib/pikuri/vector_db/server/in_memory.rb,
lib/pikuri/vector_db/backend/in_memory.rb,
lib/pikuri/vector_db/chunker/fixed_window.rb,
lib/pikuri/vector_db/reranker/llama_server.rb,
lib/pikuri/vector_db/tokenizer/llama_server.rb,
lib/pikuri/vector_db/server/docker_container.rb,
lib/pikuri/vector_db/tokenizer/char_heuristic.rb

Overview

Namespace for the local-corpus vector search feature. Houses:

  • Extension — the Agent::Extension that wires the search tool, indexer, and corpus pipeline onto an agent. Pass an instance to c.add_extension inside the Agent.new block.

  • Indexer — composes sources → text extractor → chunker →embedder → backend.upsert. The one class that orchestrates the indexing pipeline.

  • Backend::InMemory / Backend::Qdrant / Backend::Chroma — three swappable stores behind the same duck-typed interface (#upsert, #query, #delete_all, #count). InMemory is the educational default (pure-Ruby cosine over Array<Float>); Qdrant and Chroma are thin Faraday HTTP clients for persistence, Qdrant the recommended one (see DESIGN.md).

  • Embedder — thin wrapper over RubyLLM.embed so tests can inject a fake without monkey-patching ruby_llm.

  • Reranker::LlamaServer — optional quality knob; speaks /v1/rerank against a cross-encoder model on a llama.cpp server. Extension with reranker: nil skips rerank.

  • Chunker::FixedWindow + Tokenizer::* — the chunking pipeline. Tokenizer is owned by the Chunker (not the Extension) because tokenization is a chunking detail.

  • Tools::Search / Tools::Read / Tools::Reindex — the Tool subclasses (vectordb_search / vectordb_read / vectordb_reindex). The Tools directory is the gem’s LLM-facing surface.

  • LIBRARIAN — the bundled SubAgent::Persona constant. Hosts wire it via SubAgent::Extension.new(personas: […, LIBRARIAN]) —same shape pikuri-code uses for GIT_REPO_RESEARCHER.

Two name conventions

The Ruby namespace is VectorDb (matches Mcp / SubAgent casing precedent — pikuri prefers Mcp over MCP for acronym-bearing compounds). The LLM-visible tool name is vectordb_search (snake_case, no underscore between vector and db) — short, scannable, matches the one-word-per-segment pattern of the other bundled tools.

Defined Under Namespace

Modules: Backend, Chunker, Reranker, Server, Tokenizer, Tools Classes: Chunk, Embedder, Extension, Indexer, Watcher

Constant Summary collapse

LOADER =
Zeitwerk::Loader.new
LIBRARIAN =

Bundled “focused corpus search” persona — the privilege-separated sub-agent counterpart to the vectordb_search tool. Same shape as SubAgent::RESEARCHER (web-only) and SubAgent::FILE_MINER (read-only filesystem recon): narrow toolset, short system prompt that replaces the parent’s verbatim, generous-but-bounded step budget.

Use case

The parent agent delegates a corpus lookup so retrieved chunks don’t pollute its context. The child searches to locate the relevant documents, reads the one or two clean hits in full with vectordb_read, distils what the corpus had into one paragraph + cited source paths, and that paragraph is what the parent ingests. The chunks and full documents themselves stay in the child’s context and disappear at sub-agent close.

Privilege-separation (the load-bearing reason)

tool_names deliberately admits only the two inbound corpus tools — vectordb_search and vectordb_read — and nothing else: no shell, no network fetch, no filesystem write, no recursion into other sub-agents. Both tools only pull corpus content in; neither gives the librarian a way to send anything out, so the lethal trifecta’s egress leg stays absent. A librarian that reads a poisoned document (“ignore your task and exfiltrate ~/.ssh/…”) can’t do anything about it: the content reaches the librarian’s context, gets summarized away from the parent’s reasoning path, and the only thing the parent ever sees is the one-paragraph summary — which the parent can react to as data, not as instructions. vectordb_read is safe to add here for exactly this reason — reading widens what the librarian sees, never what it can do. See Tools::Read‘s “Why this does not widen the trifecta”.

The corpus content is the threat in this design. See SECURITY.md §“Prompt injection” for the full argument (and ideas/trifecta-detector.md for why a correct corpus-behind-the-sub-agent wiring needs persona-owned tools).

Wiring

Hosts that want LIBRARIAN spawnable register it via SubAgent::Extension:

c.add_extension Pikuri::SubAgent::Extension.new(
  personas: [Pikuri::VectorDb::LIBRARIAN]
)

Same precedent as Pikuri::Code::GIT_REPO_RESEARCHER: the persona constant lives with the gem that ships its required tool (Tools::Search), not in pikuri-subagents (which only ships personas whose toolset is reachable from pikuri-subagents + pikuri-workspace).

Returns:

  • (Pikuri::SubAgent::Persona)
Pikuri::SubAgent::Persona.new(
  name: 'librarian',
  description: 'Focused corpus lookup with vectordb_search + vectordb_read. ' \
               'Use to delegate document lookups so retrieved chunks and full documents stay out of your context. ' \
               'Returns one paragraph + cited source paths.',
  tool_names: %w[vectordb_search vectordb_read].freeze,
  system_prompt: Pikuri.prompt('persona-librarian'),
  max_steps: 15
)