Class: Woods::Builder

Inherits:

Object

Object
Woods::Builder

show all

Defined in:: lib/woods/builder.rb

Overview

Builder reads a Configuration and instantiates the appropriate adapters, returning a fully wired Retriever ready for use.

Named presets are provided for common deployment scenarios. All presets can be further customized with a block passed to configure_with_preset.

Examples:

Using a preset

Woods.configure_with_preset(:local)
result = Woods.retrieve("How does the User model work?")

Using a preset with block customization

Woods.configure_with_preset(:production) do |config|
  config.embedding_options = { api_key: ENV['OPENAI_API_KEY'] }
  config.vector_store_options = { url: ENV['QDRANT_URL'], collection: 'myapp' }
end

Constant Summary collapse

PRESETS = Named presets mapping to default adapter types. :local — fully local, no external services required (requires sqlite3 gem) :shared_filesystem — Shape 2: rake embed → separate MCP server reads from disk. All stores in-memory + persisted to output_dir via the Snapshotter. No sqlite3 gem needed. Requires output_dir set AND readable by both the embed process and the MCP server. :postgresql — pgvector for vectors, OpenAI for embeddings :production — Qdrant for vectors, OpenAI for embeddings

{
  local: {
    vector_store: :in_memory,
    metadata_store: :sqlite,
    graph_store: :in_memory,
    embedding_provider: :ollama
  },
  shared_filesystem: {
    vector_store: :in_memory,
    metadata_store: :in_memory,
    graph_store: :in_memory,
    embedding_provider: :ollama
  },
  postgresql: {
    vector_store: :pgvector,
    metadata_store: :sqlite,
    graph_store: :in_memory,
    embedding_provider: :openai
  },
  production: {
    vector_store: :qdrant,
    metadata_store: :sqlite,
    graph_store: :in_memory,
    embedding_provider: :openai
  }
}.freeze

Class Method Summary collapse

.preset_config(name) ⇒ Configuration

Build a Configuration populated with the named preset’s adapter types.

Instance Method Summary collapse

#build_chunker(provider) ⇒ Chunking::SemanticChunker

Build a Chunking::SemanticChunker sized to a given provider.
#build_embedding_provider ⇒ Embedding::Provider::Interface

Instantiate the embedding provider specified by the configuration.
#build_graph_store ⇒ Storage::GraphStore::Interface

Instantiate the graph store adapter specified by the configuration.
#build_metadata_store ⇒ Storage::MetadataStore::Interface

Instantiate the metadata store adapter specified by the configuration.
#build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil) ⇒ Retriever, Cache::CachedRetriever

Build a Retriever wired with adapters from the configuration.
#build_text_preparer(provider) ⇒ Embedding::TextPreparer

Build a Embedding::TextPreparer calibrated to a given provider.
#build_vector_store ⇒ Storage::VectorStore::Interface

Instantiate the vector store adapter specified by the configuration.
#initialize(config = Woods.configuration) ⇒ Builder constructor

A new instance of Builder.

Constructor Details

#initialize(config = Woods.configuration) ⇒ `Builder`

Returns a new instance of Builder.

Parameters:

config (Configuration) (defaults to: Woods.configuration) —

Configuration to read adapter types from



85
86
87

# File 'lib/woods/builder.rb', line 85

def initialize(config = Woods.configuration)
  @config = config
end

Class Method Details

.preset_config(name) ⇒ `Configuration`

Build a Configuration populated with the named preset’s adapter types.

Parameters:

name (Symbol) —

Preset name — one of :local, :postgresql, or :production

Returns:

(Configuration) —

A new Configuration with preset values applied

Raises:

(ArgumentError) —

if the preset name is not recognized

# File 'lib/woods/builder.rb', line 75

def self.preset_config(name)
  preset = PRESETS.fetch(name) do
    raise ArgumentError, "Unknown preset: #{name}. Valid: #{PRESETS.keys.join(', ')}"
  end
  config = Configuration.new
  preset.each { |key, value| config.public_send(:"#{key}=", value) }
  config
end

Instance Method Details

#build_chunker(provider) ⇒ `Chunking::SemanticChunker`

Build a Chunking::SemanticChunker sized to a given provider.

‘max_chars` is derived from the provider’s input budget and the matching chars-per-token ratio, minus the context-prefix allowance the Indexer accounts for separately. Units that exceed this ceiling get sliced so no single chunk can blow the provider’s input cap.

For Ollama (and other BERT/WordPiece-backed models), char-based estimation is unreliable — CamelCase, ‘::` separators, and symbol literals tokenize much denser than chars/token averages suggest. When the optional `tokenizers` gem is installed, pass a Embedding::TokenCounter and `max_tokens` so the chunker can verify every slice with the real tokenizer and re-split any piece that still exceeds `num_ctx`. See docs/EMBEDDING_MODELS.md.

Ollama v0.13.5+ stopped honouring ‘truncate: true` on `/api/embed` (ollama/ollama#14186), so any chunk that exceeds `num_ctx` returns a 400 rather than being silently truncated. Exact client-side sizing is the only reliable path until the regression is fixed upstream.

Parameters:

provider (Embedding::Provider::Interface)

Returns:

(Chunking::SemanticChunker)

Raises:

(ArgumentError)

# File 'lib/woods/builder.rb', line 217

def build_chunker(provider)
  budget = provider.respond_to?(:max_input_tokens) ? provider.max_input_tokens : nil
  max_chars = ((budget * chars_per_token_for(provider)).floor - CHUNKER_PREFIX_ALLOWANCE if budget)

  # Guard against a budget so small that the prefix allowance leaves
  # no room for content. Without this, SemanticChunker#slice_by_lines
  # passes a negative repeat count to String#scan, which returns []
  # — every chunk becomes empty and is silently dropped, producing
  # zero embeddings with no error. Surface the misconfiguration loudly.
  raise ArgumentError, chunker_budget_message(provider, budget) if max_chars && max_chars <= 0

  token_counter = token_counter_for(provider)
  max_tokens = token_counter && budget ? budget - PREFIX_TOKEN_ALLOWANCE : nil

  Chunking::SemanticChunker.new(
    max_chars: max_chars,
    token_counter: token_counter,
    max_tokens: max_tokens
  )
end

#build_embedding_provider ⇒ `Embedding::Provider::Interface`

Instantiate the embedding provider specified by the configuration.

Strips ‘embedding_options` keys that belong to the ResolvedConfig layer (like `:dimension`) before splatting into the provider’s constructor —those keys are useful for the Snapshotter’s schema header but aren’t part of the provider’s API.

Returns:

(Embedding::Provider::Interface) —

Embedding provider instance

Raises:

(ArgumentError) —

if the configured type is not recognized

# File 'lib/woods/builder.rb', line 147

def build_embedding_provider
  opts = provider_kwargs
  case @config.embedding_provider
  when :openai then Embedding::Provider::OpenAI.new(**opts)
  when :ollama then Embedding::Provider::Ollama.new(**opts)
  else raise ArgumentError, "Unknown embedding_provider: #{@config.embedding_provider}"
  end
end

#build_graph_store ⇒ `Storage::GraphStore::Interface`

Instantiate the graph store adapter specified by the configuration.

Returns:

(Storage::GraphStore::Interface) —

Graph store adapter instance

Raises:

(ArgumentError) —

if the configured type is not recognized

# File 'lib/woods/builder.rb', line 304

def build_graph_store
  case @config.graph_store
  when :in_memory then Storage::GraphStore::Memory.new
  else raise ArgumentError, "Unknown graph_store: #{@config.graph_store}"
  end
end

#build_metadata_store ⇒ `Storage::MetadataStore::Interface`

Instantiate the metadata store adapter specified by the configuration.

Returns:

(Storage::MetadataStore::Interface) —

Metadata store adapter instance

Raises:

(ArgumentError) —

if the configured type is not recognized

# File 'lib/woods/builder.rb', line 292

def build_metadata_store
  case @config.metadata_store
  when :in_memory then Storage::MetadataStore::InMemory.new
  when :sqlite then Storage::MetadataStore::SQLite.new(**(@config.metadata_store_options || {}))
  else raise ArgumentError, "Unknown metadata_store: #{@config.metadata_store}"
  end
end

#build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil) ⇒ `Retriever`, `Cache::CachedRetriever`

Build a Retriever wired with adapters from the configuration.

When ‘cache_enabled` is true, the embedding provider is wrapped with Cache::CachedEmbeddingProvider and the retriever is wrapped with Cache::CachedRetriever for transparent caching of expensive operations.

Callers that need stores pre-populated from a dump (the Shape-2 MCP-serve path) can inject them via vector_store: / metadata_store:. Without these, fresh empty stores are constructed from config. This is how the Bootstrapper hydrates from ‘Snapshotter.load_or_empty` without Builder needing to know the Snapshotter exists.

Parameters:

vector_store (Storage::VectorStore::Interface, nil) (defaults to: nil)
metadata_store (Storage::MetadataStore::Interface, nil) (defaults to: nil)
graph_store (Storage::GraphStore::Interface, nil) (defaults to: nil) —

Pre-populated graph store. Without this, the retriever gets a fresh empty graph, which silently degrades :hybrid retrieval (graph expansion returns no candidates). The Bootstrapper hydrates from dependency_graph.json on disk and passes the populated store here.

Returns:

(Retriever, Cache::CachedRetriever) —

A fully wired retriever

# File 'lib/woods/builder.rb', line 109

def build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil)
  provider = build_embedding_provider
  cache = build_cache_store

  provider = wrap_with_embedding_cache(provider, cache) if cache

  retriever = Retriever.new(
    vector_store: vector_store || build_vector_store,
    metadata_store: metadata_store || build_metadata_store,
    graph_store: graph_store || build_graph_store,
    embedding_provider: provider
  )

  cache ? wrap_with_retriever_cache(retriever, cache) : retriever
end

#build_text_preparer(provider) ⇒ `Embedding::TextPreparer`

Build a Embedding::TextPreparer calibrated to a given provider.

OpenAI embedders use tiktoken (cl100k_base) — 4.0 chars/token is a good conservative average. Ollama BERT/WordPiece tokenizers (nomic-embed-text, bge-*) run much hotter on dense Ruby/Rails source — long CamelCase constants, docstrings, callback DSLs, and heavy symbol use all sit below 2.0 chars/token in practice. Empirically, a 16 KB chunk of ‘ActionMailer::Base` still blows the 8192-token budget at 2.0 chars/token, so we budget at 1.5 to stay clear of tokenizer surprises even on the densest Rails internals.

‘max_tokens` tracks the provider’s actual input budget when it reports one, falling back to the TextPreparer default otherwise.

Parameters:

provider (Embedding::Provider::Interface)

Returns:

(Embedding::TextPreparer)

# File 'lib/woods/builder.rb', line 185

def build_text_preparer(provider)
  chars_per_token = chars_per_token_for(provider)
  budget = provider.respond_to?(:max_input_tokens) ? provider.max_input_tokens : nil
  max_tokens = budget || Embedding::TextPreparer::DEFAULT_MAX_TOKENS

  Embedding::TextPreparer.new(max_tokens: max_tokens, chars_per_token: chars_per_token)
end

#build_vector_store ⇒ `Storage::VectorStore::Interface`

Instantiate the vector store adapter specified by the configuration.

Returns:

(Storage::VectorStore::Interface) —

Vector store adapter instance

Raises:

(ArgumentError) —

if the configured type is not recognized

# File 'lib/woods/builder.rb', line 129

def build_vector_store
  case @config.vector_store
  when :in_memory then Storage::VectorStore::InMemory.new
  when :pgvector then Storage::VectorStore::Pgvector.new(**(@config.vector_store_options || {}))
  when :qdrant then Storage::VectorStore::Qdrant.new(**(@config.vector_store_options || {}))
  else raise ArgumentError, "Unknown vector_store: #{@config.vector_store}"
  end
end

Class: Woods::Builder

Overview

Examples:

Using a preset

Using a preset with block customization

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(config = Woods.configuration) ⇒ Builder

Class Method Details

.preset_config(name) ⇒ Configuration

Instance Method Details

#build_chunker(provider) ⇒ Chunking::SemanticChunker

#build_embedding_provider ⇒ Embedding::Provider::Interface

#build_graph_store ⇒ Storage::GraphStore::Interface

#build_metadata_store ⇒ Storage::MetadataStore::Interface

#build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil) ⇒ Retriever, Cache::CachedRetriever

#build_text_preparer(provider) ⇒ Embedding::TextPreparer

#build_vector_store ⇒ Storage::VectorStore::Interface

#initialize(config = Woods.configuration) ⇒ `Builder`

.preset_config(name) ⇒ `Configuration`

#build_chunker(provider) ⇒ `Chunking::SemanticChunker`

#build_embedding_provider ⇒ `Embedding::Provider::Interface`

#build_graph_store ⇒ `Storage::GraphStore::Interface`

#build_metadata_store ⇒ `Storage::MetadataStore::Interface`

#build_retriever(vector_store: nil, metadata_store: nil, graph_store: nil) ⇒ `Retriever`, `Cache::CachedRetriever`

#build_text_preparer(provider) ⇒ `Embedding::TextPreparer`

#build_vector_store ⇒ `Storage::VectorStore::Interface`