Class: Pikuri::VectorDb::Tools::Read
- Inherits:
-
Tool
- Object
- Tool
- Pikuri::VectorDb::Tools::Read
- Defined in:
- lib/pikuri/vector_db/tools/read.rb
Overview
The LLM-facing “read a whole indexed document” tool, exposed as vectordb_read. The companion to Search: where vectordb_search returns ranked chunks (lossy fragments capped at Search::SNIPPET_LENGTH), vectordb_read pulls one full document into context by the source path that search already printed. The intended loop is *search to locate → read the one or two clean hits in full →distil* — so the agent stops re-querying with ever-more-elaborate phrasings just to reconstruct a document it already found.
Why this does not widen the trifecta
Reading is *inbound only*. The lethal trifecta needs three legs — private data, untrusted content, and an outbound channel — and the load-bearing one is egress. vectordb_read pours more (already-private, already-possibly-poisoned) corpus content into context but adds no new way to send anything out, so it is safe by construction in both wirings: the egress-free bin/pikuri-corpus agent stays egress-free, and the LIBRARIAN sub-agent’s privilege-separation argument survives intact — a poisoned full document still has no hand to act with. See LIBRARIAN and IDEAS.md §“Vector DB / RAG”.
The read domain is exactly the search domain
The source argument must be a key the Indexer actually produced — verified against Backend#source_indexed? before any disk access. So the only legal inputs are the citations vectordb_search handed the model; a poisoned chunk that says “read ../../.ssh/id_rsa” fails the membership gate (that string is not an indexed source). This is what keeps the tool from being a general-purpose file reader smuggled in under a friendly name. A lexical containment check (resolved path must sit under the corpus Indexer#root) backs the gate as defense-in-depth for any future non-filesystem source type.
Bounding / pagination
A full document can be far larger than the Search snippet — the Chunker exists precisely because documents are big. So reads are line-windowed by FileType.read_as_text_paged — the same windower (and the same DEFAULT_LIMIT-line / 50 KB caps) that backs Workspace::Read, returning a FileType::Page this tool formats. Unlike Workspace::Read there is no cat -n line-number prefix: nothing downstream edits these documents (the citation unit is the source path, not a line), and dropping the prefix saves tokens on exactly the operation whose point is to spend context wisely.
Errors the LLM can react to
Extraction goes through FileType.read_as_text_paged, which routes the same way the Indexer‘s FileType.read_as_text does — so what you read matches what was indexed (modulo edits to the file since), plus “— Page N —” markers on PDFs. Images / binaries / a vanished file / a malformed PDF all come back as “Error: …” observations rather than raising, per CLAUDE.md “Errors are loud” (these are failures the agent reacts to, not pikuri bugs).
Constant Summary collapse
- DEFAULT_LIMIT =
Returns default value of the
limitparameter (number of lines returned per call). Aliases the shared FileType::PAGE_DEFAULT_LIMIT. Pikuri::FileType::PAGE_DEFAULT_LIMIT
- MAX_BYTES_LABEL =
Returns human-readable form of the shared byte cap (FileType::PAGE_MAX_BYTES) for the continuation marker.
"#{Pikuri::FileType::PAGE_MAX_BYTES / 1024} KB"- DESCRIPTION =
Returns static description shown to the LLM, opencode-shape (summary +
Usage:bullets). <<~DESC Read a full indexed document by its `source` path (the citation from vectordb_search). Usage: - Use after vectordb_search surfaces a clean hit you want in full, instead of re-querying for more fragments of the same document. - `source` must be a path a vectordb_search result returned; you cannot read arbitrary files, only indexed documents. - Large documents are paged: when the output ends in `Use offset=N to continue`, call again with that offset. - Reading a whole document spends context — read the one or two best hits in full, not every result. DESC
Class Method Summary collapse
-
.read(backend:, root:, source:, offset:, limit:) ⇒ String
Resolve
sourceagainst the corpus root, enforce the membership gate + containment, extract text, and window it.
Instance Method Summary collapse
- #initialize(backend:, root:) ⇒ Read constructor
Constructor Details
#initialize(backend:, root:) ⇒ Read
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/pikuri/vector_db/tools/read.rb', line 93 def initialize(backend:, root:) super( name: 'vectordb_read', description: DESCRIPTION, parameters: Pikuri::Tool::Parameters.build { |p| p.required_string :source, 'Source path from a vectordb_search result, ' \ 'e.g. "notes/cooking.md".' p.optional_integer :offset, 'Line to start reading from (1-indexed). ' \ 'Defaults to 1, e.g. 200.' p.optional_integer :limit, 'Maximum number of lines to read. Defaults to ' \ "#{DEFAULT_LIMIT}, e.g. 500." }, execute: lambda { |source:, offset: 1, limit: DEFAULT_LIMIT| Read.read(backend: backend, root: root, source: source, offset: offset, limit: limit) } ) end |
Class Method Details
.read(backend:, root:, source:, offset:, limit:) ⇒ String
Resolve source against the corpus root, enforce the membership gate + containment, extract text, and window it. Public so specs can exercise the read path without a Tool wrapper.
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
# File 'lib/pikuri/vector_db/tools/read.rb', line 126 def self.read(backend:, root:, source:, offset:, limit:) return "Error: offset must be >= 1, got #{offset}" if offset < 1 return "Error: limit must be >= 1, got #{limit}" if limit < 1 unless backend.source_indexed?(source) return "Error: \"#{source}\" is not in the indexed corpus. " \ 'Use vectordb_search to find a source path, then read that.' end resolved = root.join(source). unless contained?(resolved, root) return "Error: \"#{source}\" resolves outside the corpus root." end render(Pikuri::FileType.read_as_text_paged(resolved, offset: offset, limit: limit), source: source) rescue Errno::ENOENT "Error: indexed document \"#{source}\" is no longer on disk; the index may be stale " \ '(run vectordb_reindex to refresh).' rescue ArgumentError => e # read_as_text refusing an image / binary / directory. "Error: cannot read \"#{source}\" as text: #{e.}" rescue RuntimeError => e # read_as_text on a malformed / unsupported PDF. "Error: #{e.}" end |