Class: Woods::Embedding::Indexer

Inherits:
Object
  • Object
show all
Defined in:
lib/woods/embedding/indexer.rb

Overview

Orchestrates the indexing pipeline: reads extracted units, prepares text, generates embeddings, and stores vectors. Supports full and incremental modes with checkpoint-based resumability.

When the vector store is an in-memory adapter (responds to #each_entry and #bulk_load) and output_dir is set, a successful #index_all run also persists the stores to disk via the Snapshotter pair and atomically flips the dumps/latest pointer. Persistent backends (pgvector, Qdrant) see zero behaviour change — no Snapshotter is invoked.

Instance Method Summary collapse

Constructor Details

#initialize(provider:, text_preparer:, vector_store:, output_dir:, chunker: Chunking::SemanticChunker.new, batch_size: 32, checkpoint_interval: 10, metadata_store: nil, resolved_config: nil, dump_retention_count: 3) ⇒ Indexer

Returns a new instance of Indexer.

Parameters:

  • chunker (Chunking::SemanticChunker, nil) (defaults to: Chunking::SemanticChunker.new)

    Splits oversize units into semantically coherent chunks before embedding. nil disables chunking — units go to the provider whole (useful in tests).

  • checkpoint_interval (Integer) (defaults to: 10)

    Save checkpoint every N batches (default: 10)

  • metadata_store (#each_entry, #bulk_load, nil) (defaults to: nil)

    Optional metadata store. When present alongside an in-memory vector store, both are persisted at the end of a successful #index_all run.

  • resolved_config (Woods::ResolvedConfig, nil) (defaults to: nil)

    Captured config for woods.json — written to output_dir on #index_all completion.

  • dump_retention_count (Integer) (defaults to: 3)

    Number of completed dump directories to keep under output_dir/dumps/. Older dumps are removed after a successful #index_all run (default: 3).



34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# File 'lib/woods/embedding/indexer.rb', line 34

def initialize(provider:, text_preparer:, vector_store:, output_dir:, # rubocop:disable Metrics/ParameterLists
               chunker: Chunking::SemanticChunker.new,
               batch_size: 32, checkpoint_interval: 10,
               metadata_store: nil,
               resolved_config: nil,
               dump_retention_count: 3)
  @provider = provider
  @text_preparer = text_preparer
  @vector_store = vector_store
  @output_dir = output_dir
  @chunker = chunker
  @batch_size = batch_size
  @checkpoint_interval = checkpoint_interval
  @metadata_store = 
  @resolved_config = resolved_config
  @dump_retention_count = dump_retention_count
end

Instance Method Details

#index_allHash

Index all extracted units (full mode). Returns stats hash.

When the vector store is an in-memory adapter, persists the embedded vectors (and metadata, if a metadata store was provided) to disk under output_dir/dumps/<timestamp>/ and atomically flips the latest pointer. Writes woods.json when resolved_config was supplied.

Returns:

  • (Hash)

    Stats with :processed, :skipped, :errors counts



60
61
62
63
64
# File 'lib/woods/embedding/indexer.rb', line 60

def index_all
  stats = process_units(load_units, incremental: false)
  persist_snapshot if persistable?
  stats
end

#index_incrementalHash

Index only changed units (incremental mode). Returns stats hash.

Returns:

  • (Hash)

    Stats with :processed, :skipped, :errors counts



68
69
70
# File 'lib/woods/embedding/indexer.rb', line 68

def index_incremental
  process_units(load_units, incremental: true)
end