Class: Pikuri::VectorDb::Indexer

Inherits:
Object
  • Object
show all
Defined in:
lib/pikuri/vector_db/indexer.rb

Overview

The composing piece of the vectordb pipeline. Walks the configured sources, enumerates indexable files (filtering out the DENYLIST and dot-files), extracts text via FileType.read_as_text, chunks via the configured Chunker, embeds via the configured Embedder, and Backend#upserts the result.

Three public entry points

  • #index_all! — unconditionally index every source file. Caller is responsible for first nuking the backend if a full re-index is wanted.

  • #reindex! — convenience for “nuke and re-index”: backend.delete_all followed by index_all!. The v1 nuke-and-reload path; what Extension wires to the eventual vectordb_reindex tool.

  • #index_if_empty! — only index if backend.count.zero?. The boot-time path: Backend::InMemory always indexes (RAM-only, always empty); Backend::Chroma only indexes on first boot or after a manual reindex.

Incremental entry points (auto-watch)

The Watcher daemon drives three further methods that touch one file at a time instead of the whole corpus:

  • #reindex_file!(path) — extract → chunk → embed → atomic Backend#replace_source. The unit of incremental work.

  • #remove_file!(path) — Backend#delete_by_source for a file that was deleted or became unindexable.

  • #reconcile_plan — the boot sweep: diff the files on disk (by content hash) against Backend#sources_with_hashes and return the {reindex:, remove:} work list, without executing it (the Watcher feeds it through its queue).

Logging

Each indexed file emits one INFO line through LOGGER, prefixed with a [i/total] progress counter (e.g. [3/50] cooking/risotto.md: 7 chunks). Indexing local corpora against a local llama.cpp embedder takes minutes, and the user is blocked on the agent boot, so progress visibility is load-bearing here. Hosts with a richer output channel (a future TUI, a web client) can mute or reroute the PIKURI_LOG_VECTORDB stream — see Pikuri.logger_for.

WARN lines surface skip reasons: image / binary file in the corpus, source path doesn’t exist, file produced no text (scanned-image PDF, empty file).

Chunk identity

Chunk.id is “source:index” — the relative source path plus the chunk’s ordinal within that file. Readable in logs, deterministic, makes the “one source, many chunks” rule visible, and stable across reindexes of the same file. Incremental reindex replaces a document by source (via Backend#replace_source), not by content-addressed id, so the file’s content hash lives in Chunk.metadata[:hash] instead — the same value on every chunk of one file, which is what lets #reconcile_plan read one chunk per source and still know whether the file changed.

Chunk.source is the path *relative to the source root the file was found under* — short, citation-friendly, survives moving the corpus. Absolute paths would tie the backend to a particular machine layout; relative travels.

Errors mid-indexing

If the embedder fails mid-run (network blip, provider 5xx), the exception propagates and indexing aborts. Per CLAUDE.md “Errors are loud” — caller is internal pikuri code, not the LLM. The backend is left in a partial state; the user’s recourse is reindex! which nukes and starts over. The InMemory backend resets on process restart anyway; only Chroma persists partial state, and even then a fresh reindex! recovers cleanly.

Constant Summary collapse

LOGGER =
Pikuri.logger_for('VectorDb::Indexer')
DENYLIST =

Basenames that are skipped during the walk. Targets the cruft people accidentally have inside a notes folder they’ve put under sources: — a cloned repo, a Python workspace, a build directory. Conservative; configurable ignore rules are a follow-up (see IDEAS.md §“Vector DB / RAG” → “Open questions”).

%w[
  .git
  node_modules
  __pycache__
  venv
  target
  build
  dist
  out
  vendor
].freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(backend:, source:, embedder:, chunker:) ⇒ Indexer

Why a single source (not an array)

v0 of this API took sources: [‘~/notes’, ‘~/docs’] and used path.relative_path_from(root) to derive Chunk#source. Two files named cooking.md across different roots would produce identical source values, hence identical “#{source}:#{offset}” IDs; Backend#upsert‘s replace-by-id semantics would silently let the second file’s chunks overwrite the first’s. Single source eliminates the clash entirely. Multiple-roots support is deferred — see IDEAS.md §“Vector DB / RAG” → “Deferred”.

Parameters:

  • backend (#upsert, #query, #delete_all, #count)

    any Backend implementation.

  • source (String, Pathname)

    path to index. A file indexes directly; a directory is walked recursively. Tilde-expanded.

  • embedder (#embed)

    an Embedder or anything else responding to embed(Array<String>) -> Array<Array<Float>>.

  • chunker (#chunk)

    a Chunker::FixedWindow or anything else responding to chunk(String) -> Array<String>.



133
134
135
136
137
138
# File 'lib/pikuri/vector_db/indexer.rb', line 133

def initialize(backend:, source:, embedder:, chunker:)
  @backend  = backend
  @source   = Pathname.new(source).expand_path
  @embedder = embedder
  @chunker  = chunker
end

Instance Attribute Details

#sourcePathname (readonly)

Returns the configured source, tilde-expanded. A single file or a directory tree. The Watcher reads this to decide what to watch and how to filter events.

Returns:

  • (Pathname)

    the configured source, tilde-expanded. A single file or a directory tree. The Watcher reads this to decide what to watch and how to filter events.



143
144
145
# File 'lib/pikuri/vector_db/indexer.rb', line 143

def source
  @source
end

Instance Method Details

#index_all!Integer

Walk every source, index every reachable non-denylisted file. Returns the total chunk count emitted into the backend across this invocation.

Returns:

  • (Integer)


161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# File 'lib/pikuri/vector_db/indexer.rb', line 161

def index_all!
  files = enumerate_files
  if files.empty?
    LOGGER.warn("no indexable files found under source: #{@source}")
    return 0
  end

  LOGGER.info("indexing #{files.length} file(s) from #{@source}")
  started = Time.now
  total_chunks = 0
  files.each_with_index do |(root, path), i|
    total_chunks += index_file(root: root, path: path, i: i + 1, total: files.length)
  end
  LOGGER.info(format(
                'done: %d file(s), %d chunks, %.1fs',
                files.length, total_chunks, Time.now - started
              ))
  total_chunks
end

#index_if_empty!Integer

Index only if the backend is currently empty. The boot-time entry point — InMemory backends always re-index (RAM-only); Chroma backends only re-index on first boot or after a manual reindex!.

Returns:

  • (Integer)

    total chunks indexed (0 if backend was non-empty and the indexer was skipped).



198
199
200
201
202
203
204
205
# File 'lib/pikuri/vector_db/indexer.rb', line 198

def index_if_empty!
  existing = @backend.count
  if existing.positive?
    LOGGER.info("backend already has #{existing} chunk(s); skipping boot index")
    return 0
  end
  index_all!
end

#reconcile_planHash{Symbol => Array<Pathname>}

The boot reconciliation sweep, as a plan rather than an action: walk the source tree, hash every file, diff against Backend#sources_with_hashes, and return the work list. The Watcher feeds this through its single work queue so the sweep and live events share one last-intent-wins path (and so teardown can interrupt between files).

Uniform across backends with no is_a? branch: InMemory reports an empty manifest at boot (RAM reset) so every file reads as new; Chroma reports its persisted manifest so only genuinely-changed files come back. The sweep also closes the downtime gap — changes made while no Watcher was running are invisible to the filesystem events but caught by the hash diff.

Returns:

  • (Hash{Symbol => Array<Pathname>})

    {reindex:, remove:} — files to (re)index because they are new or changed, and files (as root/source paths) whose chunks should be dropped because they are gone from disk.



283
284
285
286
287
288
289
290
291
# File 'lib/pikuri/vector_db/indexer.rb', line 283

def reconcile_plan
  on_disk = {} # source (String) => Pathname
  enumerate_files.each { |_root, p| on_disk[relative_source(p)] = p }
  indexed = @backend.sources_with_hashes

  reindex = on_disk.select { |source, path| indexed[source] != file_hash(path) }.values
  remove  = (indexed.keys - on_disk.keys).map { |source| root.join(source) }
  { reindex: reindex, remove: remove }
end

#reindex!Integer

backend.delete_all followed by #index_all!. The v1 nuke-and-reload reindex path.

Returns:

  • (Integer)

    total chunks indexed.



185
186
187
188
189
# File 'lib/pikuri/vector_db/indexer.rb', line 185

def reindex!
  LOGGER.info('reindex: clearing backend')
  @backend.delete_all
  index_all!
end

#reindex_file!(path) ⇒ Integer

Re-index a single file in place: extract → chunk → embed →Backend#replace_source. The atomic unit of incremental work the Watcher drives on a modify/add event.

The embed happens before the backend write, so an embedder outage raises here and leaves the previously-indexed chunks untouched — see Backend::Chroma#replace_source for why the ordering matters. A file that no longer yields text (emptied, replaced by a binary, deleted under us) is treated as a removal: its stale chunks are dropped and nothing is written, so the index never retains orphans for it.

Parameters:

  • path (String, Pathname)

    the file to reindex; need not exist (a vanished file resolves to a removal).

Returns:

  • (Integer)

    number of chunks now stored for the file (0 if it was removed).

Raises:

  • (RuntimeError)

    if the embedder or backend fails — the Watcher logs it and moves on (loud, per CLAUDE.md).



225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# File 'lib/pikuri/vector_db/indexer.rb', line 225

def reindex_file!(path)
  path   = Pathname.new(path).expand_path
  source = relative_source(path)

  texts = begin
    chunk_texts_for(path)
  rescue ArgumentError, Errno::ENOENT, RuntimeError => e
    LOGGER.info("reindex #{source}: unindexable (#{e.message}); removing from index")
    @backend.delete_by_source(source)
    return 0
  end

  if texts.empty?
    LOGGER.info("reindex #{source}: no indexable text; removing from index")
    @backend.delete_by_source(source)
    return 0
  end

  # embed_and_build is *outside* the rescue: an embedder outage
  # raises straight out, before replace_source touches the
  # backend, so the prior chunks survive (embed-before-delete).
  chunks, vectors = embed_and_build(source: source, path: path, chunk_texts: texts)
  @backend.replace_source(source: source, chunks: chunks, vectors: vectors)
  LOGGER.info("reindex #{source}: #{chunks.length} chunk(s)")
  chunks.length
end

#remove_file!(path) ⇒ void

This method returns an undefined value.

Drop a file’s chunks from the index — the Watcher‘s response to a delete (or move-away) event. Idempotent: removing a source that isn’t indexed is a no-op.

Parameters:

  • path (String, Pathname)

    the (now-absent) file.



258
259
260
261
262
263
# File 'lib/pikuri/vector_db/indexer.rb', line 258

def remove_file!(path)
  source = relative_source(Pathname.new(path).expand_path)
  @backend.delete_by_source(source)
  LOGGER.info("removed #{source} from index")
  nil
end

#rootPathname

The directory whose tree is indexed — the anchor for every relative Chunk#source. A directory source is its own root; a single-file source roots at its parent (so the citation is just the basename). The Watcher watches this directory.

Returns:

  • (Pathname)


152
153
154
# File 'lib/pikuri/vector_db/indexer.rb', line 152

def root
  @source.directory? ? @source : @source.parent
end