Class: Pikuri::VectorDb::Indexer
- Inherits:
-
Object
- Object
- Pikuri::VectorDb::Indexer
- Defined in:
- lib/pikuri/vector_db/indexer.rb
Overview
The composing piece of the vectordb pipeline. Walks the configured sources, enumerates indexable files (filtering out the DENYLIST and dot-files), extracts text via FileType.read_as_text, chunks via the configured Chunker, embeds via the configured Embedder, and Backend#upserts the result.
Three public entry points
-
#index_all!— unconditionally index every source file. Caller is responsible for first nuking the backend if a full re-index is wanted. -
#reindex!— convenience for “nuke and re-index”:backend.delete_allfollowed byindex_all!. The v1 nuke-and-reload path; what Extension wires to the eventualvectordb_reindextool. -
#index_if_empty!— only index ifbackend.count.zero?. The boot-time path:Backend::InMemoryalways indexes (RAM-only, always empty);Backend::Chromaonly indexes on first boot or after a manual reindex.
Logging
Each indexed file emits one INFO line through LOGGER, prefixed with a [i/total] progress counter (e.g. [3/50] cooking/risotto.md: 7 chunks). Indexing local corpora against a local llama.cpp embedder takes minutes, and the user is blocked on the agent boot, so progress visibility is load-bearing here. Hosts with a richer output channel (a future TUI, a web client) can mute or reroute the PIKURI_LOG_VECTORDB stream — see Pikuri.logger_for.
WARN lines surface skip reasons: image / binary file in the corpus, source path doesn’t exist, file produced no text (scanned-image PDF, empty file).
Chunk identity
Chunk.id is “source:index” — the relative source path plus the chunk’s ordinal within that file. Readable in logs, deterministic, makes the “one source, many chunks” rule visible. Hash-based IDs would be forward-compatible with content-addressing for incremental re-index, but v1 nuke-and-reload doesn’t need that and the readable form is more useful day-to-day.
Chunk.source is the path *relative to the source root the file was found under* — short, citation-friendly, survives moving the corpus. Absolute paths would tie the backend to a particular machine layout; relative travels.
Errors mid-indexing
If the embedder fails mid-run (network blip, provider 5xx), the exception propagates and indexing aborts. Per CLAUDE.md “Errors are loud” — caller is internal pikuri code, not the LLM. The backend is left in a partial state; the user’s recourse is reindex! which nukes and starts over. The InMemory backend resets on process restart anyway; only Chroma persists partial state, and even then a fresh reindex! recovers cleanly.
Constant Summary collapse
- LOGGER =
Pikuri.logger_for('VectorDb::Indexer')
- DENYLIST =
Basenames that are skipped during the walk. Targets the cruft people accidentally have inside a notes folder they’ve put under
sources:— a cloned repo, a Python workspace, a build directory. Conservative; configurable ignore rules are a follow-up (see IDEAS.md §“Vector DB / RAG” → “Open questions”). %w[ .git node_modules __pycache__ venv target build dist out vendor ].freeze
Instance Method Summary collapse
-
#index_all! ⇒ Integer
Walk every source, index every reachable non-denylisted file.
-
#index_if_empty! ⇒ Integer
Index only if the backend is currently empty.
-
#initialize(backend:, source:, embedder:, chunker:) ⇒ Indexer
constructor
Why a single source (not an array).
-
#reindex! ⇒ Integer
backend.delete_allfollowed by #index_all!.
Constructor Details
#initialize(backend:, source:, embedder:, chunker:) ⇒ Indexer
Why a single source (not an array)
v0 of this API took sources: [‘~/notes’, ‘~/docs’] and used path.relative_path_from(root) to derive Chunk#source. Two files named cooking.md across different roots would produce identical source values, hence identical “#{source}:#{offset}” IDs; Backend#upsert‘s replace-by-id semantics would silently let the second file’s chunks overwrite the first’s. Single source eliminates the clash entirely. Multiple-roots support is deferred — see IDEAS.md §“Vector DB / RAG” → “Deferred”.
115 116 117 118 119 120 |
# File 'lib/pikuri/vector_db/indexer.rb', line 115 def initialize(backend:, source:, embedder:, chunker:) @backend = backend @source = Pathname.new(source). @embedder = @chunker = chunker end |
Instance Method Details
#index_all! ⇒ Integer
Walk every source, index every reachable non-denylisted file. Returns the total chunk count emitted into the backend across this invocation.
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# File 'lib/pikuri/vector_db/indexer.rb', line 127 def index_all! files = enumerate_files if files.empty? LOGGER.warn("no indexable files found under source: #{@source}") return 0 end LOGGER.info("indexing #{files.length} file(s) from #{@source}") started = Time.now total_chunks = 0 files.each_with_index do |(root, path), i| total_chunks += index_file(root: root, path: path, i: i + 1, total: files.length) end LOGGER.info(format( 'done: %d file(s), %d chunks, %.1fs', files.length, total_chunks, Time.now - started )) total_chunks end |
#index_if_empty! ⇒ Integer
Index only if the backend is currently empty. The boot-time entry point — InMemory backends always re-index (RAM-only); Chroma backends only re-index on first boot or after a manual reindex!.
164 165 166 167 168 169 170 171 |
# File 'lib/pikuri/vector_db/indexer.rb', line 164 def index_if_empty! existing = @backend.count if existing.positive? LOGGER.info("backend already has #{existing} chunk(s); skipping boot index") return 0 end index_all! end |
#reindex! ⇒ Integer
backend.delete_all followed by #index_all!. The v1 nuke-and-reload reindex path.
151 152 153 154 155 |
# File 'lib/pikuri/vector_db/indexer.rb', line 151 def reindex! LOGGER.info('reindex: clearing backend') @backend.delete_all index_all! end |