Class: Pikuri::VectorDb::Indexer
- Inherits:
-
Object
- Object
- Pikuri::VectorDb::Indexer
- Defined in:
- lib/pikuri/vector_db/indexer.rb
Overview
The composing piece of the vectordb pipeline. Walks the configured sources, enumerates indexable files (filtering out the DENYLIST and dot-files), extracts text via FileType.read_as_text, chunks via the configured Chunker, embeds via the configured Embedder, and Backend#upserts the result.
Three public entry points
-
#index_all!— unconditionally index every source file. Caller is responsible for first nuking the backend if a full re-index is wanted. -
#reindex!— convenience for “nuke and re-index”:backend.delete_allfollowed byindex_all!. The v1 nuke-and-reload path; what Extension wires to the eventualvectordb_reindextool. -
#index_if_empty!— only index ifbackend.count.zero?. The boot-time path:Backend::InMemoryalways indexes (RAM-only, always empty);Backend::Chromaonly indexes on first boot or after a manual reindex.
Incremental entry points (auto-watch)
The Watcher daemon drives three further methods that touch one file at a time instead of the whole corpus:
-
#reindex_file!(path) — extract → chunk → embed → atomic Backend#replace_source. The unit of incremental work.
-
#remove_file!(path) — Backend#delete_by_source for a file that was deleted or became unindexable.
-
#reconcile_plan— the boot sweep: diff the files on disk (by content hash) against Backend#sources_with_hashes and return the {reindex:, remove:} work list, without executing it (the Watcher feeds it through its queue).
Logging
Each indexed file emits one INFO line through LOGGER, prefixed with a [i/total] progress counter (e.g. [3/50] cooking/risotto.md: 7 chunks). Indexing local corpora against a local llama.cpp embedder takes minutes, and the user is blocked on the agent boot, so progress visibility is load-bearing here. Hosts with a richer output channel (a future TUI, a web client) can mute or reroute the PIKURI_LOG_VECTORDB stream — see Pikuri.logger_for.
WARN lines surface skip reasons: image / binary file in the corpus, source path doesn’t exist, file produced no text (scanned-image PDF, empty file).
Chunk identity
Chunk.id is “source:index” — the relative source path plus the chunk’s ordinal within that file. Readable in logs, deterministic, makes the “one source, many chunks” rule visible, and stable across reindexes of the same file. Incremental reindex replaces a document by source (via Backend#replace_source), not by content-addressed id, so the file’s content hash lives in Chunk.metadata[:hash] instead — the same value on every chunk of one file, which is what lets #reconcile_plan read one chunk per source and still know whether the file changed.
Chunk.source is the path *relative to the source root the file was found under* — short, citation-friendly, survives moving the corpus. Absolute paths would tie the backend to a particular machine layout; relative travels.
Errors mid-indexing
If the embedder fails mid-run (network blip, provider 5xx), the exception propagates and indexing aborts. Per CLAUDE.md “Errors are loud” — caller is internal pikuri code, not the LLM. The backend is left in a partial state; the user’s recourse is reindex! which nukes and starts over. The InMemory backend resets on process restart anyway; only Chroma persists partial state, and even then a fresh reindex! recovers cleanly.
Constant Summary collapse
- LOGGER =
Pikuri.logger_for('VectorDb::Indexer')
- DENYLIST =
Basenames that are skipped during the walk. Targets the cruft people accidentally have inside a notes folder they’ve put under
sources:— a cloned repo, a Python workspace, a build directory. Conservative; configurable ignore rules are a follow-up (see IDEAS.md §“Vector DB / RAG” → “Open questions”). %w[ .git node_modules __pycache__ venv target build dist out vendor ].freeze
Instance Attribute Summary collapse
-
#source ⇒ Pathname
readonly
The configured source, tilde-expanded.
Instance Method Summary collapse
-
#index_all! ⇒ Integer
Walk every source, index every reachable non-denylisted file.
-
#index_if_empty! ⇒ Integer
Index only if the backend is currently empty.
-
#initialize(backend:, source:, embedder:, chunker:) ⇒ Indexer
constructor
Why a single source (not an array).
-
#reconcile_plan ⇒ Hash{Symbol => Array<Pathname>}
The boot reconciliation sweep, as a plan rather than an action: walk the source tree, hash every file, diff against Backend#sources_with_hashes, and return the work list.
-
#reindex! ⇒ Integer
backend.delete_allfollowed by #index_all!. -
#reindex_file!(path) ⇒ Integer
Re-index a single file in place: extract → chunk → embed → Backend#replace_source.
-
#remove_file!(path) ⇒ void
Drop a file’s chunks from the index — the Watcher‘s response to a delete (or move-away) event.
-
#root ⇒ Pathname
The directory whose tree is indexed — the anchor for every relative Chunk#source.
Constructor Details
#initialize(backend:, source:, embedder:, chunker:) ⇒ Indexer
Why a single source (not an array)
v0 of this API took sources: [‘~/notes’, ‘~/docs’] and used path.relative_path_from(root) to derive Chunk#source. Two files named cooking.md across different roots would produce identical source values, hence identical “#{source}:#{offset}” IDs; Backend#upsert‘s replace-by-id semantics would silently let the second file’s chunks overwrite the first’s. Single source eliminates the clash entirely. Multiple-roots support is deferred — see IDEAS.md §“Vector DB / RAG” → “Deferred”.
133 134 135 136 137 138 |
# File 'lib/pikuri/vector_db/indexer.rb', line 133 def initialize(backend:, source:, embedder:, chunker:) @backend = backend @source = Pathname.new(source). @embedder = @chunker = chunker end |
Instance Attribute Details
#source ⇒ Pathname (readonly)
Returns the configured source, tilde-expanded. A single file or a directory tree. The Watcher reads this to decide what to watch and how to filter events.
143 144 145 |
# File 'lib/pikuri/vector_db/indexer.rb', line 143 def source @source end |
Instance Method Details
#index_all! ⇒ Integer
Walk every source, index every reachable non-denylisted file. Returns the total chunk count emitted into the backend across this invocation.
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
# File 'lib/pikuri/vector_db/indexer.rb', line 161 def index_all! files = enumerate_files if files.empty? LOGGER.warn("no indexable files found under source: #{@source}") return 0 end LOGGER.info("indexing #{files.length} file(s) from #{@source}") started = Time.now total_chunks = 0 files.each_with_index do |(root, path), i| total_chunks += index_file(root: root, path: path, i: i + 1, total: files.length) end LOGGER.info(format( 'done: %d file(s), %d chunks, %.1fs', files.length, total_chunks, Time.now - started )) total_chunks end |
#index_if_empty! ⇒ Integer
Index only if the backend is currently empty. The boot-time entry point — InMemory backends always re-index (RAM-only); Chroma backends only re-index on first boot or after a manual reindex!.
198 199 200 201 202 203 204 205 |
# File 'lib/pikuri/vector_db/indexer.rb', line 198 def index_if_empty! existing = @backend.count if existing.positive? LOGGER.info("backend already has #{existing} chunk(s); skipping boot index") return 0 end index_all! end |
#reconcile_plan ⇒ Hash{Symbol => Array<Pathname>}
The boot reconciliation sweep, as a plan rather than an action: walk the source tree, hash every file, diff against Backend#sources_with_hashes, and return the work list. The Watcher feeds this through its single work queue so the sweep and live events share one last-intent-wins path (and so teardown can interrupt between files).
Uniform across backends with no is_a? branch: InMemory reports an empty manifest at boot (RAM reset) so every file reads as new; Chroma reports its persisted manifest so only genuinely-changed files come back. The sweep also closes the downtime gap — changes made while no Watcher was running are invisible to the filesystem events but caught by the hash diff.
283 284 285 286 287 288 289 290 291 |
# File 'lib/pikuri/vector_db/indexer.rb', line 283 def reconcile_plan on_disk = {} # source (String) => Pathname enumerate_files.each { |_root, p| on_disk[relative_source(p)] = p } indexed = @backend.sources_with_hashes reindex = on_disk.select { |source, path| indexed[source] != file_hash(path) }.values remove = (indexed.keys - on_disk.keys).map { |source| root.join(source) } { reindex: reindex, remove: remove } end |
#reindex! ⇒ Integer
backend.delete_all followed by #index_all!. The v1 nuke-and-reload reindex path.
185 186 187 188 189 |
# File 'lib/pikuri/vector_db/indexer.rb', line 185 def reindex! LOGGER.info('reindex: clearing backend') @backend.delete_all index_all! end |
#reindex_file!(path) ⇒ Integer
Re-index a single file in place: extract → chunk → embed →Backend#replace_source. The atomic unit of incremental work the Watcher drives on a modify/add event.
The embed happens before the backend write, so an embedder outage raises here and leaves the previously-indexed chunks untouched — see Backend::Chroma#replace_source for why the ordering matters. A file that no longer yields text (emptied, replaced by a binary, deleted under us) is treated as a removal: its stale chunks are dropped and nothing is written, so the index never retains orphans for it.
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
# File 'lib/pikuri/vector_db/indexer.rb', line 225 def reindex_file!(path) path = Pathname.new(path). source = relative_source(path) texts = begin chunk_texts_for(path) rescue ArgumentError, Errno::ENOENT, RuntimeError => e LOGGER.info("reindex #{source}: unindexable (#{e.}); removing from index") @backend.delete_by_source(source) return 0 end if texts.empty? LOGGER.info("reindex #{source}: no indexable text; removing from index") @backend.delete_by_source(source) return 0 end # embed_and_build is *outside* the rescue: an embedder outage # raises straight out, before replace_source touches the # backend, so the prior chunks survive (embed-before-delete). chunks, vectors = (source: source, path: path, chunk_texts: texts) @backend.replace_source(source: source, chunks: chunks, vectors: vectors) LOGGER.info("reindex #{source}: #{chunks.length} chunk(s)") chunks.length end |
#remove_file!(path) ⇒ void
This method returns an undefined value.
Drop a file’s chunks from the index — the Watcher‘s response to a delete (or move-away) event. Idempotent: removing a source that isn’t indexed is a no-op.
258 259 260 261 262 263 |
# File 'lib/pikuri/vector_db/indexer.rb', line 258 def remove_file!(path) source = relative_source(Pathname.new(path).) @backend.delete_by_source(source) LOGGER.info("removed #{source} from index") nil end |
#root ⇒ Pathname
The directory whose tree is indexed — the anchor for every relative Chunk#source. A directory source is its own root; a single-file source roots at its parent (so the citation is just the basename). The Watcher watches this directory.
152 153 154 |
# File 'lib/pikuri/vector_db/indexer.rb', line 152 def root @source.directory? ? @source : @source.parent end |