Class: Woods::Builder

Inherits:
Object
  • Object
show all
Defined in:
lib/woods/builder.rb

Overview

Builder reads a Configuration and instantiates the appropriate adapters, returning a fully wired Retriever ready for use.

Named presets are provided for common deployment scenarios. All presets can be further customized with a block passed to configure_with_preset.

Examples:

Using a preset

Woods.configure_with_preset(:local)
result = Woods.retrieve("How does the User model work?")

Using a preset with block customization

Woods.configure_with_preset(:production) do |config|
  config.embedding_options = { api_key: ENV['OPENAI_API_KEY'] }
  config.vector_store_options = { url: ENV['QDRANT_URL'], collection: 'myapp' }
end

Constant Summary collapse

PRESETS =

Named presets mapping to default adapter types.

:local — fully local, no external services required (requires sqlite3 gem) :shared_filesystem — Shape 2: rake embed → separate MCP server reads from disk.

All stores in-memory + persisted to output_dir via the
Snapshotter. No sqlite3 gem needed. Requires output_dir set
AND readable by both the embed process and the MCP server.

:postgresql — pgvector for vectors, OpenAI for embeddings :production — Qdrant for vectors, OpenAI for embeddings

{
  local: {
    vector_store: :in_memory,
    metadata_store: :sqlite,
    graph_store: :in_memory,
    embedding_provider: :ollama
  },
  shared_filesystem: {
    vector_store: :in_memory,
    metadata_store: :in_memory,
    graph_store: :in_memory,
    embedding_provider: :ollama
  },
  postgresql: {
    vector_store: :pgvector,
    metadata_store: :sqlite,
    graph_store: :in_memory,
    embedding_provider: :openai
  },
  production: {
    vector_store: :qdrant,
    metadata_store: :sqlite,
    graph_store: :in_memory,
    embedding_provider: :openai
  }
}.freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(config = Woods.configuration) ⇒ Builder

Returns a new instance of Builder.

Parameters:

  • config (Configuration) (defaults to: Woods.configuration)

    Configuration to read adapter types from



85
86
87
# File 'lib/woods/builder.rb', line 85

def initialize(config = Woods.configuration)
  @config = config
end

Class Method Details

.preset_config(name) ⇒ Configuration

Build a Configuration populated with the named preset’s adapter types.

Parameters:

  • name (Symbol)

    Preset name — one of :local, :postgresql, or :production

Returns:

  • (Configuration)

    A new Configuration with preset values applied

Raises:

  • (ArgumentError)

    if the preset name is not recognized



75
76
77
78
79
80
81
82
# File 'lib/woods/builder.rb', line 75

def self.preset_config(name)
  preset = PRESETS.fetch(name) do
    raise ArgumentError, "Unknown preset: #{name}. Valid: #{PRESETS.keys.join(', ')}"
  end
  config = Configuration.new
  preset.each { |key, value| config.public_send(:"#{key}=", value) }
  config
end

Instance Method Details

#build_chunker(provider) ⇒ Chunking::SemanticChunker

Build a Chunking::SemanticChunker sized to a given provider.

‘max_chars` is derived from the provider’s input budget and the matching chars-per-token ratio, minus the context-prefix allowance the Indexer accounts for separately. Units that exceed this ceiling get sliced so no single chunk can blow the provider’s input cap.

For Ollama (and other BERT/WordPiece-backed models), char-based estimation is unreliable — CamelCase, ‘::` separators, and symbol literals tokenize much denser than chars/token averages suggest. When the optional `tokenizers` gem is installed, pass a Embedding::TokenCounter and `max_tokens` so the chunker can verify every slice with the real tokenizer and re-split any piece that still exceeds `num_ctx`. See docs/EMBEDDING_MODELS.md.

Ollama v0.13.5+ stopped honouring ‘truncate: true` on `/api/embed` (ollama/ollama#14186), so any chunk that exceeds `num_ctx` returns a 400 rather than being silently truncated. Exact client-side sizing is the only reliable path until the regression is fixed upstream.

Parameters:

Returns:

Raises:

  • (ArgumentError)


217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
# File 'lib/woods/builder.rb', line 217

def build_chunker(provider)
  budget = provider.respond_to?(:max_input_tokens) ? provider.max_input_tokens : nil
  max_chars = ((budget * chars_per_token_for(provider)).floor - CHUNKER_PREFIX_ALLOWANCE if budget)

  # Guard against a budget so small that the prefix allowance leaves
  # no room for content. Without this, SemanticChunker#slice_by_lines
  # passes a negative repeat count to String#scan, which returns []
  # — every chunk becomes empty and is silently dropped, producing
  # zero embeddings with no error. Surface the misconfiguration loudly.
  raise ArgumentError, chunker_budget_message(provider, budget) if max_chars && max_chars <= 0

  token_counter = token_counter_for(provider)
  max_tokens = token_counter && budget ? budget - PREFIX_TOKEN_ALLOWANCE : nil

  Chunking::SemanticChunker.new(
    max_chars: max_chars,
    token_counter: token_counter,
    max_tokens: max_tokens
  )
end

#build_embedding_providerEmbedding::Provider::Interface

Instantiate the embedding provider specified by the configuration.

Strips ‘embedding_options` keys that belong to the ResolvedConfig layer (like `:dimension`) before splatting into the provider’s constructor —those keys are useful for the Snapshotter’s schema header but aren’t part of the provider’s API.

Returns:

Raises:

  • (ArgumentError)

    if the configured type is not recognized



147
148
149
150
151
152
153
154
# File 'lib/woods/builder.rb', line 147

def build_embedding_provider
  opts = provider_kwargs
  case @config.embedding_provider
  when :openai then Embedding::Provider::OpenAI.new(**opts)
  when :ollama then Embedding::Provider::Ollama.new(**opts)
  else raise ArgumentError, "Unknown embedding_provider: #{@config.embedding_provider}"
  end
end

#build_graph_storeStorage::GraphStore::Interface

Instantiate the graph store adapter specified by the configuration.

Returns:

Raises:

  • (ArgumentError)

    if the configured type is not recognized



304
305
306
307
308
309
# File 'lib/woods/builder.rb', line 304

def build_graph_store
  case @config.graph_store
  when :in_memory then Storage::GraphStore::Memory.new
  else raise ArgumentError, "Unknown graph_store: #{@config.graph_store}"
  end
end

#build_metadata_storeStorage::MetadataStore::Interface

Instantiate the metadata store adapter specified by the configuration.

Returns:

Raises:

  • (ArgumentError)

    if the configured type is not recognized



292
293
294
295
296
297
298
# File 'lib/woods/builder.rb', line 292

def 
  case @config.
  when :in_memory then Storage::MetadataStore::InMemory.new
  when :sqlite then Storage::MetadataStore::SQLite.new(**(@config. || {}))
  else raise ArgumentError, "Unknown metadata_store: #{@config.}"
  end
end

#build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil) ⇒ Retriever, Cache::CachedRetriever

Build a Retriever wired with adapters from the configuration.

When ‘cache_enabled` is true, the embedding provider is wrapped with Cache::CachedEmbeddingProvider and the retriever is wrapped with Cache::CachedRetriever for transparent caching of expensive operations.

Callers that need stores pre-populated from a dump (the Shape-2 MCP-serve path) can inject them via vector_store: / metadata_store:. Without these, fresh empty stores are constructed from config. This is how the Bootstrapper hydrates from ‘Snapshotter.load_or_empty` without Builder needing to know the Snapshotter exists.

Parameters:

  • vector_store (Storage::VectorStore::Interface, nil) (defaults to: nil)
  • metadata_store (Storage::MetadataStore::Interface, nil) (defaults to: nil)
  • graph_store (Storage::GraphStore::Interface, nil) (defaults to: nil)

    Pre-populated graph store. Without this, the retriever gets a fresh empty graph, which silently degrades :hybrid retrieval (graph expansion returns no candidates). The Bootstrapper hydrates from dependency_graph.json on disk and passes the populated store here.

Returns:



109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/woods/builder.rb', line 109

def build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil)
  provider = build_embedding_provider
  cache = build_cache_store

  provider = wrap_with_embedding_cache(provider, cache) if cache

  retriever = Retriever.new(
    vector_store: vector_store || build_vector_store,
    metadata_store:  || ,
    graph_store: graph_store || build_graph_store,
    embedding_provider: provider
  )

  cache ? wrap_with_retriever_cache(retriever, cache) : retriever
end

#build_text_preparer(provider) ⇒ Embedding::TextPreparer

Build a Embedding::TextPreparer calibrated to a given provider.

OpenAI embedders use tiktoken (cl100k_base) — 4.0 chars/token is a good conservative average. Ollama BERT/WordPiece tokenizers (nomic-embed-text, bge-*) run much hotter on dense Ruby/Rails source — long CamelCase constants, docstrings, callback DSLs, and heavy symbol use all sit below 2.0 chars/token in practice. Empirically, a 16 KB chunk of ‘ActionMailer::Base` still blows the 8192-token budget at 2.0 chars/token, so we budget at 1.5 to stay clear of tokenizer surprises even on the densest Rails internals.

‘max_tokens` tracks the provider’s actual input budget when it reports one, falling back to the TextPreparer default otherwise.

Parameters:

Returns:



185
186
187
188
189
190
191
# File 'lib/woods/builder.rb', line 185

def build_text_preparer(provider)
  chars_per_token = chars_per_token_for(provider)
  budget = provider.respond_to?(:max_input_tokens) ? provider.max_input_tokens : nil
  max_tokens = budget || Embedding::TextPreparer::DEFAULT_MAX_TOKENS

  Embedding::TextPreparer.new(max_tokens: max_tokens, chars_per_token: chars_per_token)
end

#build_vector_storeStorage::VectorStore::Interface

Instantiate the vector store adapter specified by the configuration.

Returns:

Raises:

  • (ArgumentError)

    if the configured type is not recognized



129
130
131
132
133
134
135
136
# File 'lib/woods/builder.rb', line 129

def build_vector_store
  case @config.vector_store
  when :in_memory then Storage::VectorStore::InMemory.new
  when :pgvector then Storage::VectorStore::Pgvector.new(**(@config.vector_store_options || {}))
  when :qdrant then Storage::VectorStore::Qdrant.new(**(@config.vector_store_options || {}))
  else raise ArgumentError, "Unknown vector_store: #{@config.vector_store}"
  end
end