Class: Woods::Builder
- Inherits:
-
Object
- Object
- Woods::Builder
- Defined in:
- lib/woods/builder.rb
Overview
Builder reads a Configuration and instantiates the appropriate adapters, returning a fully wired Retriever ready for use.
Named presets are provided for common deployment scenarios. All presets can be further customized with a block passed to configure_with_preset.
Constant Summary collapse
- PRESETS =
Named presets mapping to default adapter types.
:local — fully local, no external services required (requires sqlite3 gem) :shared_filesystem — Shape 2: rake embed → separate MCP server reads from disk.
All stores in-memory + persisted to output_dir via the Snapshotter. No sqlite3 gem needed. Requires output_dir set AND readable by both the embed process and the MCP server.:postgresql — pgvector for vectors, OpenAI for embeddings :production — Qdrant for vectors, OpenAI for embeddings
{ local: { vector_store: :in_memory, metadata_store: :sqlite, graph_store: :in_memory, embedding_provider: :ollama }, shared_filesystem: { vector_store: :in_memory, metadata_store: :in_memory, graph_store: :in_memory, embedding_provider: :ollama }, postgresql: { vector_store: :pgvector, metadata_store: :sqlite, graph_store: :in_memory, embedding_provider: :openai }, production: { vector_store: :qdrant, metadata_store: :sqlite, graph_store: :in_memory, embedding_provider: :openai } }.freeze
Class Method Summary collapse
-
.preset_config(name) ⇒ Configuration
Build a Configuration populated with the named preset’s adapter types.
Instance Method Summary collapse
-
#build_chunker(provider) ⇒ Chunking::SemanticChunker
Build a Chunking::SemanticChunker sized to a given provider.
-
#build_embedding_provider ⇒ Embedding::Provider::Interface
Instantiate the embedding provider specified by the configuration.
-
#build_graph_store ⇒ Storage::GraphStore::Interface
Instantiate the graph store adapter specified by the configuration.
-
#build_metadata_store ⇒ Storage::MetadataStore::Interface
Instantiate the metadata store adapter specified by the configuration.
-
#build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil) ⇒ Retriever, Cache::CachedRetriever
Build a Retriever wired with adapters from the configuration.
-
#build_text_preparer(provider) ⇒ Embedding::TextPreparer
Build a Embedding::TextPreparer calibrated to a given provider.
-
#build_vector_store ⇒ Storage::VectorStore::Interface
Instantiate the vector store adapter specified by the configuration.
-
#initialize(config = Woods.configuration) ⇒ Builder
constructor
A new instance of Builder.
Constructor Details
#initialize(config = Woods.configuration) ⇒ Builder
Returns a new instance of Builder.
85 86 87 |
# File 'lib/woods/builder.rb', line 85 def initialize(config = Woods.configuration) @config = config end |
Class Method Details
.preset_config(name) ⇒ Configuration
Build a Configuration populated with the named preset’s adapter types.
75 76 77 78 79 80 81 82 |
# File 'lib/woods/builder.rb', line 75 def self.preset_config(name) preset = PRESETS.fetch(name) do raise ArgumentError, "Unknown preset: #{name}. Valid: #{PRESETS.keys.join(', ')}" end config = Configuration.new preset.each { |key, value| config.public_send(:"#{key}=", value) } config end |
Instance Method Details
#build_chunker(provider) ⇒ Chunking::SemanticChunker
Build a Chunking::SemanticChunker sized to a given provider.
‘max_chars` is derived from the provider’s input budget and the matching chars-per-token ratio, minus the context-prefix allowance the Indexer accounts for separately. Units that exceed this ceiling get sliced so no single chunk can blow the provider’s input cap.
For Ollama (and other BERT/WordPiece-backed models), char-based estimation is unreliable — CamelCase, ‘::` separators, and symbol literals tokenize much denser than chars/token averages suggest. When the optional `tokenizers` gem is installed, pass a Embedding::TokenCounter and `max_tokens` so the chunker can verify every slice with the real tokenizer and re-split any piece that still exceeds `num_ctx`. See docs/EMBEDDING_MODELS.md.
Ollama v0.13.5+ stopped honouring ‘truncate: true` on `/api/embed` (ollama/ollama#14186), so any chunk that exceeds `num_ctx` returns a 400 rather than being silently truncated. Exact client-side sizing is the only reliable path until the regression is fixed upstream.
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
# File 'lib/woods/builder.rb', line 217 def build_chunker(provider) budget = provider.respond_to?(:max_input_tokens) ? provider.max_input_tokens : nil max_chars = ((budget * chars_per_token_for(provider)).floor - CHUNKER_PREFIX_ALLOWANCE if budget) # Guard against a budget so small that the prefix allowance leaves # no room for content. Without this, SemanticChunker#slice_by_lines # passes a negative repeat count to String#scan, which returns [] # — every chunk becomes empty and is silently dropped, producing # zero embeddings with no error. Surface the misconfiguration loudly. raise ArgumentError, (provider, budget) if max_chars && max_chars <= 0 token_counter = token_counter_for(provider) max_tokens = token_counter && budget ? budget - PREFIX_TOKEN_ALLOWANCE : nil Chunking::SemanticChunker.new( max_chars: max_chars, token_counter: token_counter, max_tokens: max_tokens ) end |
#build_embedding_provider ⇒ Embedding::Provider::Interface
Instantiate the embedding provider specified by the configuration.
Strips ‘embedding_options` keys that belong to the ResolvedConfig layer (like `:dimension`) before splatting into the provider’s constructor —those keys are useful for the Snapshotter’s schema header but aren’t part of the provider’s API.
147 148 149 150 151 152 153 154 |
# File 'lib/woods/builder.rb', line 147 def opts = provider_kwargs case @config. when :openai then Embedding::Provider::OpenAI.new(**opts) when :ollama then Embedding::Provider::Ollama.new(**opts) else raise ArgumentError, "Unknown embedding_provider: #{@config.}" end end |
#build_graph_store ⇒ Storage::GraphStore::Interface
Instantiate the graph store adapter specified by the configuration.
304 305 306 307 308 309 |
# File 'lib/woods/builder.rb', line 304 def build_graph_store case @config.graph_store when :in_memory then Storage::GraphStore::Memory.new else raise ArgumentError, "Unknown graph_store: #{@config.graph_store}" end end |
#build_metadata_store ⇒ Storage::MetadataStore::Interface
Instantiate the metadata store adapter specified by the configuration.
292 293 294 295 296 297 298 |
# File 'lib/woods/builder.rb', line 292 def case @config. when :in_memory then Storage::MetadataStore::InMemory.new when :sqlite then Storage::MetadataStore::SQLite.new(**(@config. || {})) else raise ArgumentError, "Unknown metadata_store: #{@config.}" end end |
#build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil) ⇒ Retriever, Cache::CachedRetriever
Build a Retriever wired with adapters from the configuration.
When ‘cache_enabled` is true, the embedding provider is wrapped with Cache::CachedEmbeddingProvider and the retriever is wrapped with Cache::CachedRetriever for transparent caching of expensive operations.
Callers that need stores pre-populated from a dump (the Shape-2 MCP-serve path) can inject them via vector_store: / metadata_store:. Without these, fresh empty stores are constructed from config. This is how the Bootstrapper hydrates from ‘Snapshotter.load_or_empty` without Builder needing to know the Snapshotter exists.
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'lib/woods/builder.rb', line 109 def build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil) provider = cache = build_cache_store provider = (provider, cache) if cache retriever = Retriever.new( vector_store: vector_store || build_vector_store, metadata_store: || , graph_store: graph_store || build_graph_store, embedding_provider: provider ) cache ? wrap_with_retriever_cache(retriever, cache) : retriever end |
#build_text_preparer(provider) ⇒ Embedding::TextPreparer
Build a Embedding::TextPreparer calibrated to a given provider.
OpenAI embedders use tiktoken (cl100k_base) — 4.0 chars/token is a good conservative average. Ollama BERT/WordPiece tokenizers (nomic-embed-text, bge-*) run much hotter on dense Ruby/Rails source — long CamelCase constants, docstrings, callback DSLs, and heavy symbol use all sit below 2.0 chars/token in practice. Empirically, a 16 KB chunk of ‘ActionMailer::Base` still blows the 8192-token budget at 2.0 chars/token, so we budget at 1.5 to stay clear of tokenizer surprises even on the densest Rails internals.
‘max_tokens` tracks the provider’s actual input budget when it reports one, falling back to the TextPreparer default otherwise.
185 186 187 188 189 190 191 |
# File 'lib/woods/builder.rb', line 185 def build_text_preparer(provider) chars_per_token = chars_per_token_for(provider) budget = provider.respond_to?(:max_input_tokens) ? provider.max_input_tokens : nil max_tokens = budget || Embedding::TextPreparer::DEFAULT_MAX_TOKENS Embedding::TextPreparer.new(max_tokens: max_tokens, chars_per_token: chars_per_token) end |
#build_vector_store ⇒ Storage::VectorStore::Interface
Instantiate the vector store adapter specified by the configuration.
129 130 131 132 133 134 135 136 |
# File 'lib/woods/builder.rb', line 129 def build_vector_store case @config.vector_store when :in_memory then Storage::VectorStore::InMemory.new when :pgvector then Storage::VectorStore::Pgvector.new(**(@config. || {})) when :qdrant then Storage::VectorStore::Qdrant.new(**(@config. || {})) else raise ArgumentError, "Unknown vector_store: #{@config.vector_store}" end end |