Class: Woods::Embedding::Indexer
- Inherits:
-
Object
- Object
- Woods::Embedding::Indexer
- Defined in:
- lib/woods/embedding/indexer.rb
Overview
Orchestrates the indexing pipeline: reads extracted units, prepares text, generates embeddings, and stores vectors. Supports full and incremental modes with checkpoint-based resumability.
When the vector store is an in-memory adapter (responds to #each_entry and #bulk_load) and output_dir is set, a successful #index_all run also persists the stores to disk via the Snapshotter pair and atomically flips the dumps/latest pointer. Persistent backends (pgvector, Qdrant) see zero behaviour change — no Snapshotter is invoked.
Instance Method Summary collapse
-
#index_all ⇒ Hash
Index all extracted units (full mode).
-
#index_incremental ⇒ Hash
Index only changed units (incremental mode).
-
#initialize(provider:, text_preparer:, vector_store:, output_dir:, chunker: Chunking::SemanticChunker.new, batch_size: 32, checkpoint_interval: 10, metadata_store: nil, resolved_config: nil, dump_retention_count: 3) ⇒ Indexer
constructor
A new instance of Indexer.
Constructor Details
#initialize(provider:, text_preparer:, vector_store:, output_dir:, chunker: Chunking::SemanticChunker.new, batch_size: 32, checkpoint_interval: 10, metadata_store: nil, resolved_config: nil, dump_retention_count: 3) ⇒ Indexer
Returns a new instance of Indexer.
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/woods/embedding/indexer.rb', line 34 def initialize(provider:, text_preparer:, vector_store:, output_dir:, # rubocop:disable Metrics/ParameterLists chunker: Chunking::SemanticChunker.new, batch_size: 32, checkpoint_interval: 10, metadata_store: nil, resolved_config: nil, dump_retention_count: 3) @provider = provider @text_preparer = text_preparer @vector_store = vector_store @output_dir = output_dir @chunker = chunker @batch_size = batch_size @checkpoint_interval = checkpoint_interval @metadata_store = @resolved_config = resolved_config @dump_retention_count = dump_retention_count end |
Instance Method Details
#index_all ⇒ Hash
Index all extracted units (full mode). Returns stats hash.
When the vector store is an in-memory adapter, persists the embedded vectors (and metadata, if a metadata store was provided) to disk under output_dir/dumps/<timestamp>/ and atomically flips the latest pointer. Writes woods.json when resolved_config was supplied.
60 61 62 63 64 |
# File 'lib/woods/embedding/indexer.rb', line 60 def index_all stats = process_units(load_units, incremental: false) persist_snapshot if persistable? stats end |
#index_incremental ⇒ Hash
Index only changed units (incremental mode). Returns stats hash.
68 69 70 |
# File 'lib/woods/embedding/indexer.rb', line 68 def index_incremental process_units(load_units, incremental: true) end |