Class: Pikuri::VectorDb::Tools::Read

Inherits:

Tool

Object
Tool
Pikuri::VectorDb::Tools::Read

show all

Defined in:: lib/pikuri/vector_db/tools/read.rb

Overview

The LLM-facing “read a whole indexed document” tool, exposed as vectordb_read. The companion to Search: where vectordb_search returns ranked chunks (lossy fragments capped at Search::SNIPPET_LENGTH), vectordb_read pulls one full document into context by the source path that search already printed. The intended loop is *search to locate → read the one or two clean hits in full →distil* — so the agent stops re-querying with ever-more-elaborate phrasings just to reconstruct a document it already found.

Why this does not widen the trifecta

Reading is *inbound only*. The lethal trifecta needs three legs — private data, untrusted content, and an outbound channel — and the load-bearing one is egress. vectordb_read pours more (already-private, already-possibly-poisoned) corpus content into context but adds no new way to send anything out, so it is safe by construction in both wirings: the egress-free bin/pikuri-corpus agent stays egress-free, and the LIBRARIAN sub-agent’s privilege-separation argument survives intact — a poisoned full document still has no hand to act with. See LIBRARIAN and IDEAS.md §“Vector DB / RAG”.

The read domain is exactly the search domain

The source argument must be a key the Indexer actually produced — verified against Backend#source_indexed? before any disk access. So the only legal inputs are the citations vectordb_search handed the model; a poisoned chunk that says “read ../../.ssh/id_rsa” fails the membership gate (that string is not an indexed source). This is what keeps the tool from being a general-purpose file reader smuggled in under a friendly name. A lexical containment check (resolved path must sit under the corpus Indexer#root) backs the gate as defense-in-depth for any future non-filesystem source type.

Bounding / pagination

A full document can be far larger than the Search snippet — the Chunker exists precisely because documents are big. So reads are line-windowed by FileType.read_as_text_paged — the same windower (and the same DEFAULT_LIMIT-line / 50 KB caps) that backs Workspace::Read, returning a Extractor::Page this tool formats. Unlike Workspace::Read there is no cat -n line-number prefix: nothing downstream edits these documents (the citation unit is the source path, not a line), and dropping the prefix saves tokens on exactly the operation whose point is to spend context wisely.

Errors the LLM can react to

Extraction goes through FileType.read_as_text_paged, which routes through the same Extractor registry the Indexer‘s FileType.read_as_text does — so what you read matches what was indexed exactly (modulo edits to the file since), “— Page N —” PDF markers included. Images / binaries / a vanished file / a malformed PDF all come back as “Error: …” observations rather than raising, per CLAUDE.md “Errors are loud” (these are failures the agent reacts to, not pikuri bugs).

Constant Summary collapse

DEFAULT_LIMIT = Returns default value of the limit parameter (number of lines returned per call). Aliases the shared Extractor::PAGE_DEFAULT_LIMIT. Returns: (Integer) — default value of the limit parameter (number of lines returned per call). Aliases the shared Extractor::PAGE_DEFAULT_LIMIT.

Pikuri::Extractor::PAGE_DEFAULT_LIMIT

MAX_BYTES_LABEL = Returns human-readable form of the shared byte cap (Extractor::PAGE_MAX_BYTES) for the continuation marker. Returns: (String) — human-readable form of the shared byte cap (Extractor::PAGE_MAX_BYTES) for the continuation marker.

"#{Pikuri::Extractor::PAGE_MAX_BYTES / 1024} KB"

DESCRIPTION = Returns static description shown to the LLM, opencode-shape (summary + Usage: bullets). Returns: (String) — static description shown to the LLM, opencode-shape (summary + Usage: bullets).

<<~DESC
  Read a full indexed document by its `source` path (the citation from vectordb_search).

  Usage:
  - Use after vectordb_search surfaces a clean hit you want in full, instead of re-querying for more fragments of the same document.
  - `source` must be a path a vectordb_search result returned; you cannot read arbitrary files, only indexed documents.
  - Large documents are paged: when the output ends in `Use offset=N to continue`, call again with that offset.
  - Reading a whole document spends context — read the one or two best hits in full, not every result.
DESC

Class Method Summary collapse

.read(backend:, root:, source:, offset:, limit:) ⇒ String

Resolve source against the corpus root, enforce the membership gate + containment, extract text, and window it.

Instance Method Summary collapse

#initialize(backend:, root:) ⇒ Read constructor

Constructor Details

#initialize(backend:, root:) ⇒ `Read`

Parameters:

backend (#source_indexed?) —

any Backend implementation; consulted for the membership gate.
root (Pathname) —

the corpus root — Indexer#root. Relative source paths resolve against it.

# File 'lib/pikuri/vector_db/tools/read.rb', line 94

def initialize(backend:, root:)
  super(
    name: 'vectordb_read',
    description: DESCRIPTION,
    parameters: Pikuri::Tool::Parameters.build { |p|
      p.required_string :source,
                        'Source path from a vectordb_search result, ' \
                        'e.g. "notes/cooking.md".'
      p.optional_integer :offset,
                         'Line to start reading from (1-indexed). ' \
                         'Defaults to 1, e.g. 200.'
      p.optional_integer :limit,
                         'Maximum number of lines to read. Defaults to ' \
                         "#{DEFAULT_LIMIT}, e.g. 500."
    },
    execute: lambda { |source:, offset: 1, limit: DEFAULT_LIMIT|
      Read.read(backend: backend, root: root, source: source, offset: offset, limit: limit)
    }
  )
end

Class Method Details

.read(backend:, root:, source:, offset:, limit:) ⇒ `String`

Resolve source against the corpus root, enforce the membership gate + containment, extract text, and window it. Public so specs can exercise the read path without a Tool wrapper.