Class: Pikuri::VectorDb::Indexer

Inherits:
Object
  • Object
show all
Defined in:
lib/pikuri/vector_db/indexer.rb

Overview

The composing piece of the vectordb pipeline. Walks the configured sources, enumerates indexable files (filtering out the DENYLIST and dot-files), extracts text via FileType.read_as_text, chunks via the configured Chunker, embeds via the configured Embedder, and Backend#upserts the result.

Three public entry points

  • #index_all! — unconditionally index every source file. Caller is responsible for first nuking the backend if a full re-index is wanted.

  • #reindex! — convenience for “nuke and re-index”: backend.delete_all followed by index_all!. The v1 nuke-and-reload path; what Extension wires to the eventual vectordb_reindex tool.

  • #index_if_empty! — only index if backend.count.zero?. The boot-time path: Backend::InMemory always indexes (RAM-only, always empty); Backend::Chroma only indexes on first boot or after a manual reindex.

Logging

Each indexed file emits one INFO line through LOGGER, prefixed with a [i/total] progress counter (e.g. [3/50] cooking/risotto.md: 7 chunks). Indexing local corpora against a local llama.cpp embedder takes minutes, and the user is blocked on the agent boot, so progress visibility is load-bearing here. Hosts with a richer output channel (a future TUI, a web client) can mute or reroute the PIKURI_LOG_VECTORDB stream — see Pikuri.logger_for.

WARN lines surface skip reasons: image / binary file in the corpus, source path doesn’t exist, file produced no text (scanned-image PDF, empty file).

Chunk identity

Chunk.id is “source:index” — the relative source path plus the chunk’s ordinal within that file. Readable in logs, deterministic, makes the “one source, many chunks” rule visible. Hash-based IDs would be forward-compatible with content-addressing for incremental re-index, but v1 nuke-and-reload doesn’t need that and the readable form is more useful day-to-day.

Chunk.source is the path *relative to the source root the file was found under* — short, citation-friendly, survives moving the corpus. Absolute paths would tie the backend to a particular machine layout; relative travels.

Errors mid-indexing

If the embedder fails mid-run (network blip, provider 5xx), the exception propagates and indexing aborts. Per CLAUDE.md “Errors are loud” — caller is internal pikuri code, not the LLM. The backend is left in a partial state; the user’s recourse is reindex! which nukes and starts over. The InMemory backend resets on process restart anyway; only Chroma persists partial state, and even then a fresh reindex! recovers cleanly.

Constant Summary collapse

LOGGER =
Pikuri.logger_for('VectorDb::Indexer')
DENYLIST =

Basenames that are skipped during the walk. Targets the cruft people accidentally have inside a notes folder they’ve put under sources: — a cloned repo, a Python workspace, a build directory. Conservative; configurable ignore rules are a follow-up (see IDEAS.md §“Vector DB / RAG” → “Open questions”).

%w[
  .git
  node_modules
  __pycache__
  venv
  target
  build
  dist
  out
  vendor
].freeze

Instance Method Summary collapse

Constructor Details

#initialize(backend:, source:, embedder:, chunker:) ⇒ Indexer

Why a single source (not an array)

v0 of this API took sources: [‘~/notes’, ‘~/docs’] and used path.relative_path_from(root) to derive Chunk#source. Two files named cooking.md across different roots would produce identical source values, hence identical “#{source}:#{offset}” IDs; Backend#upsert‘s replace-by-id semantics would silently let the second file’s chunks overwrite the first’s. Single source eliminates the clash entirely. Multiple-roots support is deferred — see IDEAS.md §“Vector DB / RAG” → “Deferred”.

Parameters:

  • backend (#upsert, #query, #delete_all, #count)

    any Backend implementation.

  • source (String, Pathname)

    path to index. A file indexes directly; a directory is walked recursively. Tilde-expanded.

  • embedder (#embed)

    an Embedder or anything else responding to embed(Array<String>) -> Array<Array<Float>>.

  • chunker (#chunk)

    a Chunker::FixedWindow or anything else responding to chunk(String) -> Array<String>.



115
116
117
118
119
120
# File 'lib/pikuri/vector_db/indexer.rb', line 115

def initialize(backend:, source:, embedder:, chunker:)
  @backend  = backend
  @source   = Pathname.new(source).expand_path
  @embedder = embedder
  @chunker  = chunker
end

Instance Method Details

#index_all!Integer

Walk every source, index every reachable non-denylisted file. Returns the total chunk count emitted into the backend across this invocation.

Returns:

  • (Integer)


127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# File 'lib/pikuri/vector_db/indexer.rb', line 127

def index_all!
  files = enumerate_files
  if files.empty?
    LOGGER.warn("no indexable files found under source: #{@source}")
    return 0
  end

  LOGGER.info("indexing #{files.length} file(s) from #{@source}")
  started = Time.now
  total_chunks = 0
  files.each_with_index do |(root, path), i|
    total_chunks += index_file(root: root, path: path, i: i + 1, total: files.length)
  end
  LOGGER.info(format(
                'done: %d file(s), %d chunks, %.1fs',
                files.length, total_chunks, Time.now - started
              ))
  total_chunks
end

#index_if_empty!Integer

Index only if the backend is currently empty. The boot-time entry point — InMemory backends always re-index (RAM-only); Chroma backends only re-index on first boot or after a manual reindex!.

Returns:

  • (Integer)

    total chunks indexed (0 if backend was non-empty and the indexer was skipped).



164
165
166
167
168
169
170
171
# File 'lib/pikuri/vector_db/indexer.rb', line 164

def index_if_empty!
  existing = @backend.count
  if existing.positive?
    LOGGER.info("backend already has #{existing} chunk(s); skipping boot index")
    return 0
  end
  index_all!
end

#reindex!Integer

backend.delete_all followed by #index_all!. The v1 nuke-and-reload reindex path.

Returns:

  • (Integer)

    total chunks indexed.



151
152
153
154
155
# File 'lib/pikuri/vector_db/indexer.rb', line 151

def reindex!
  LOGGER.info('reindex: clearing backend')
  @backend.delete_all
  index_all!
end