Class: Woods::Retriever

Inherits:
Object
  • Object
show all
Defined in:
lib/woods/retriever.rb

Overview

Retriever orchestrates the full retrieval pipeline: classify, execute, rank, and assemble context from a natural language query.

Coordinates four internal components:

Optionally builds a structural context overview (codebase unit counts by type) that is prepended to the assembled context.

Examples:

retriever = Woods::Retriever.new(
  vector_store: vector_store,
  metadata_store: ,
  graph_store: graph_store,
  embedding_provider: embedding_provider
)
result = retriever.retrieve("How does the User model work?")
result.context        # => "Codebase: 42 units (10 models, ...)\n\n---\n\n## User (model)..."
result.strategy       # => :vector
result.tokens_used    # => 4200

Defined Under Namespace

Classes: RetrievalResult, RetrievalTrace

Constant Summary collapse

OLLAMA_EMBEDDING_MODELS =

BERT / WordPiece-family embedders Ollama commonly serves. Matched against ‘provider.model_name` to decide whether to use the 1.5 chars/token ratio and wire in an exact Embedding::TokenCounter. Extend this list when new WordPiece-family models become popular —the tiktoken 4.0 default remains the safe fallback for unknowns.

Regexp.union(
  /\Anomic-embed/, /\Abge-/, /\Amxbai-embed/,
  /\Asnowflake-arctic/, /\Aall-minilm/, /\Aparaphrase-/,
  /\Ae5-/, /\Agte-/, /\Astella/,
  /\Agranite-embedding/, /\Ajina-embeddings/
).freeze
STRUCTURAL_TYPES =

Unit types queried for the structural context overview.

%w[model controller service job mailer component graphql].freeze
DEFAULT_EXCLUDE_TYPES =

Unit types excluded from retrieval by default. test_mapping units make up ~33% of a typical index and lexically dominate semantic rank for production queries (“stripe webhook” often surfaces stripe_webhook_spec.rb above the actual controller). Callers can override by passing types: (include-only) or an explicit exclude_types:.

%w[test_mapping].freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(vector_store:, metadata_store:, graph_store:, embedding_provider:, formatter: nil) ⇒ Retriever

Returns a new instance of Retriever.

Parameters:



106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# File 'lib/woods/retriever.rb', line 106

def initialize(vector_store:, metadata_store:, graph_store:, embedding_provider:, formatter: nil)
  @vector_store = vector_store
  @metadata_store = 
  @graph_store = graph_store
  @formatter = formatter

  @classifier = Retrieval::QueryClassifier.new
  @executor = Retrieval::SearchExecutor.new(
    vector_store: vector_store,
    metadata_store: ,
    graph_store: graph_store,
    embedding_provider: embedding_provider
  )
  @ranker = Retrieval::Ranker.new(metadata_store: , graph_store: graph_store)
  # Match truncation sizing to the embedding provider's tokenizer so
  # Ollama-indexed corpora (ratio ~1.5) don't get over-truncated by
  # an OpenAI-sized default (4.0). Unknown/missing providers fall
  # back to the OpenAI-friendly default.
  chars_per_token = infer_chars_per_token(embedding_provider)
  @assembler = Retrieval::ContextAssembler.new(
    metadata_store: ,
    chars_per_token: chars_per_token,
    token_counter: infer_token_counter(embedding_provider)
  )
end

Instance Attribute Details

#graph_storeObject (readonly)

Direct handles to the injected stores. The sub-components (Woods::Retrieval::SearchExecutor, Woods::Retrieval::Ranker, Woods::Retrieval::ContextAssembler) hold their own references too, but those are implementation details — callers that want to mutate store contents (e.g. the MCP reload tool) read through these accessors. All three refer to the same Ruby objects the sub-components were initialised with, so in-place #clear! + #bulk_load propagates through the entire pipeline without re-instantiating sub-components.



99
100
101
# File 'lib/woods/retriever.rb', line 99

def graph_store
  @graph_store
end

#metadata_storeObject (readonly)

Direct handles to the injected stores. The sub-components (Woods::Retrieval::SearchExecutor, Woods::Retrieval::Ranker, Woods::Retrieval::ContextAssembler) hold their own references too, but those are implementation details — callers that want to mutate store contents (e.g. the MCP reload tool) read through these accessors. All three refer to the same Ruby objects the sub-components were initialised with, so in-place #clear! + #bulk_load propagates through the entire pipeline without re-instantiating sub-components.



99
100
101
# File 'lib/woods/retriever.rb', line 99

def 
  @metadata_store
end

#vector_storeObject (readonly)

Direct handles to the injected stores. The sub-components (Woods::Retrieval::SearchExecutor, Woods::Retrieval::Ranker, Woods::Retrieval::ContextAssembler) hold their own references too, but those are implementation details — callers that want to mutate store contents (e.g. the MCP reload tool) read through these accessors. All three refer to the same Ruby objects the sub-components were initialised with, so in-place #clear! + #bulk_load propagates through the entire pipeline without re-instantiating sub-components.



99
100
101
# File 'lib/woods/retriever.rb', line 99

def vector_store
  @vector_store
end

Instance Method Details

#retrieve(query, budget: 8000, types: nil, exclude_types: nil) ⇒ RetrievalResult

Execute the full retrieval pipeline for a natural language query.

Pipeline: classify -> execute -> rank -> filter -> (fallback within-type when filter emptied everything) -> assemble -> format.

When types: is set, the response carries type_rank_context —per-type rank metadata the caller uses to tell a strong match from a weak one without Woods imposing a score threshold.

Parameters:

  • query (String)

    Natural language query

  • budget (Integer) (defaults to: 8000)

    Token budget for context assembly

  • types (Array<String, Symbol>, nil) (defaults to: nil)

    If set, restrict results to these unit types (overrides DEFAULT_EXCLUDE_TYPES).

  • exclude_types (Array<String, Symbol>, nil) (defaults to: nil)

    Additional types to exclude. Applied on top of DEFAULT_EXCLUDE_TYPES unless types: is set.

Returns:



202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
# File 'lib/woods/retriever.rb', line 202

def retrieve(query, budget: 8000, types: nil, exclude_types: nil)
  start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  classification = @classifier.classify(query)
  execution_result = @executor.execute(query: query, classification: classification)
  ranked = @ranker.rank(execution_result.candidates, classification: classification)

  type_list = normalize_type_list(types)
  filtered, fallback_ran = apply_type_filter(
    ranked, query, classification, types: types, type_list: type_list, exclude_types: exclude_types
  )
  type_rank_context = type_list ? build_type_rank_context(ranked, type_list, fallback_ran: fallback_ran) : nil

  assembled = assemble_context(filtered, classification, budget)
  trace = build_trace(classification, execution_result, filtered, assembled, start_time)

  build_result(
    assembled: assembled, classification: classification, strategy: execution_result.strategy,
    budget: budget, trace: trace, type_rank_context: type_rank_context
  )
end