Class: Pikuri::VectorDb::Tools::Read

Inherits:
Tool
  • Object
show all
Defined in:
lib/pikuri/vector_db/tools/read.rb

Overview

The LLM-facing “read a whole indexed document” tool, exposed as vectordb_read. The companion to Search: where vectordb_search returns ranked chunks (lossy fragments capped at Search::SNIPPET_LENGTH), vectordb_read pulls one full document into context by the source path that search already printed. The intended loop is *search to locate → read the one or two clean hits in full →distil* — so the agent stops re-querying with ever-more-elaborate phrasings just to reconstruct a document it already found.

Why this does not widen the trifecta

Reading is *inbound only*. The lethal trifecta needs three legs — private data, untrusted content, and an outbound channel — and the load-bearing one is egress. vectordb_read pours more (already-private, already-possibly-poisoned) corpus content into context but adds no new way to send anything out, so it is safe by construction in both wirings: the egress-free bin/pikuri-corpus agent stays egress-free, and the LIBRARIAN sub-agent’s privilege-separation argument survives intact — a poisoned full document still has no hand to act with. See LIBRARIAN and IDEAS.md §“Vector DB / RAG”.

The read domain is exactly the search domain

The source argument must be a key the Indexer actually produced — verified against Backend#source_indexed? before any disk access. So the only legal inputs are the citations vectordb_search handed the model; a poisoned chunk that says “read ../../.ssh/id_rsa” fails the membership gate (that string is not an indexed source). This is what keeps the tool from being a general-purpose file reader smuggled in under a friendly name. A lexical containment check (resolved path must sit under the corpus Indexer#root) backs the gate as defense-in-depth for any future non-filesystem source type.

Bounding / pagination

A full document can be far larger than the Search snippet — the Chunker exists precisely because documents are big. So reads are line-windowed by FileType.read_as_text_paged — the same windower (and the same DEFAULT_LIMIT-line / 50 KB caps) that backs Workspace::Read, returning a Extractor::Page this tool formats. Unlike Workspace::Read there is no cat -n line-number prefix: nothing downstream edits these documents (the citation unit is the source path, not a line), and dropping the prefix saves tokens on exactly the operation whose point is to spend context wisely.

Errors the LLM can react to

Extraction goes through FileType.read_as_text_paged, which routes through the same Extractor registry the Indexer‘s FileType.read_as_text does — so what you read matches what was indexed exactly (modulo edits to the file since), “— Page N —” PDF markers included. Images / binaries / a vanished file / a malformed PDF all come back as “Error: …” observations rather than raising, per CLAUDE.md “Errors are loud” (these are failures the agent reacts to, not pikuri bugs).

Constant Summary collapse

DEFAULT_LIMIT =

Returns default value of the limit parameter (number of lines returned per call). Aliases the shared Extractor::PAGE_DEFAULT_LIMIT.

Returns:

  • (Integer)

    default value of the limit parameter (number of lines returned per call). Aliases the shared Extractor::PAGE_DEFAULT_LIMIT.

Pikuri::Extractor::PAGE_DEFAULT_LIMIT
MAX_BYTES_LABEL =

Returns human-readable form of the shared byte cap (Extractor::PAGE_MAX_BYTES) for the continuation marker.

Returns:

  • (String)

    human-readable form of the shared byte cap (Extractor::PAGE_MAX_BYTES) for the continuation marker.

"#{Pikuri::Extractor::PAGE_MAX_BYTES / 1024} KB"
DESCRIPTION =

Returns static description shown to the LLM, opencode-shape (summary + Usage: bullets).

Returns:

  • (String)

    static description shown to the LLM, opencode-shape (summary + Usage: bullets).

<<~DESC
  Read a full indexed document by its `source` path (the citation from vectordb_search).

  Usage:
  - Use after vectordb_search surfaces a clean hit you want in full, instead of re-querying for more fragments of the same document.
  - `source` must be a path a vectordb_search result returned; you cannot read arbitrary files, only indexed documents.
  - Large documents are paged: when the output ends in `Use offset=N to continue`, call again with that offset.
  - Reading a whole document spends context — read the one or two best hits in full, not every result.
DESC

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(backend:, root:) ⇒ Read

Parameters:

  • backend (#source_indexed?)

    any Backend implementation; consulted for the membership gate.

  • root (Pathname)

    the corpus root — Indexer#root. Relative source paths resolve against it.



94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# File 'lib/pikuri/vector_db/tools/read.rb', line 94

def initialize(backend:, root:)
  super(
    name: 'vectordb_read',
    description: DESCRIPTION,
    parameters: Pikuri::Tool::Parameters.build { |p|
      p.required_string :source,
                        'Source path from a vectordb_search result, ' \
                        'e.g. "notes/cooking.md".'
      p.optional_integer :offset,
                         'Line to start reading from (1-indexed). ' \
                         'Defaults to 1, e.g. 200.'
      p.optional_integer :limit,
                         'Maximum number of lines to read. Defaults to ' \
                         "#{DEFAULT_LIMIT}, e.g. 500."
    },
    execute: lambda { |source:, offset: 1, limit: DEFAULT_LIMIT|
      Read.read(backend: backend, root: root, source: source, offset: offset, limit: limit)
    }
  )
end

Class Method Details

.read(backend:, root:, source:, offset:, limit:) ⇒ String

Resolve source against the corpus root, enforce the membership gate + containment, extract text, and window it. Public so specs can exercise the read path without a Tool wrapper.

Parameters:

  • backend (#source_indexed?)
  • root (Pathname)
  • source (String)

    the source path as supplied by the LLM

  • offset (Integer)

    1-indexed line to start at

  • limit (Integer)

    maximum lines to return

Returns:

  • (String)

    tool observation — the windowed text or an “Error: …” string.



127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# File 'lib/pikuri/vector_db/tools/read.rb', line 127

def self.read(backend:, root:, source:, offset:, limit:)
  return "Error: offset must be >= 1, got #{offset}" if offset < 1
  return "Error: limit must be >= 1, got #{limit}"   if limit < 1

  unless backend.source_indexed?(source)
    return "Error: \"#{source}\" is not in the indexed corpus. " \
           'Use vectordb_search to find a source path, then read that.'
  end

  resolved = root.join(source).expand_path
  unless contained?(resolved, root)
    return "Error: \"#{source}\" resolves outside the corpus root."
  end

  render(Pikuri::FileType.read_as_text_paged(resolved, offset: offset, limit: limit), source: source)
rescue Errno::ENOENT
  "Error: indexed document \"#{source}\" is no longer on disk; the index may be stale " \
    '(run vectordb_reindex to refresh).'
rescue ArgumentError => e
  # read_as_text refusing an image / binary / directory.
  "Error: cannot read \"#{source}\" as text: #{e.message}"
rescue RuntimeError => e
  # read_as_text on a malformed / unsupported PDF.
  "Error: #{e.message}"
end