Class: Woods::Retrieval::ContextAssembler

Inherits:
Object
  • Object
show all
Defined in:
lib/woods/retrieval/context_assembler.rb

Overview

Transforms ranked search candidates into a token-budgeted context string for LLM consumption.

Allocates a fixed token budget across four sections:

  • Structural (10%): Always-included codebase overview

  • Primary (50%): Direct query results

  • Supporting (25%): Dependencies and related context

  • Framework (15%): Rails/gem source when query has framework context

When framework context is not needed, primary and supporting sections receive the framework allocation proportionally.

Examples:

assembler = ContextAssembler.new(metadata_store: store)
result = assembler.assemble(candidates: ranked, classification: cls)
result.context     # => "## User (model)\n..."
result.tokens_used # => 4200
result.sections    # => [:structural, :primary, :supporting]

Constant Summary collapse

DEFAULT_BUDGET =

tokens

8000
BUDGET_ALLOCATION =
{
  structural: 0.10,
  primary: 0.50,
  supporting: 0.25,
  framework: 0.15
}.freeze
MIN_USEFUL_TOKENS =

Minimum token count for a section to be worth including.

200
DEFAULT_CHARS_PER_TOKEN =

Default chars-per-token ratio. Delegates to TokenUtils —the single source of truth — which uses 4.0 (OpenAI / tiktoken cl100k_base average for Ruby source; see docs/TOKEN_BENCHMARK.md). Callers embedding with BERT/WordPiece tokenizers (nomic-embed-text, bge-*) should pass the tighter ratio from their TextPreparer (~1.5–2.5) so truncation stays honest for that provider — or use TokenUtils.chars_per_token_for(:ollama) for the shipped default.

TokenUtils::DEFAULT_CHARS_PER_TOKEN

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(metadata_store:, budget: DEFAULT_BUDGET, chars_per_token: DEFAULT_CHARS_PER_TOKEN, token_counter: nil) ⇒ ContextAssembler

Returns a new instance of ContextAssembler.

Parameters:

  • metadata_store (#find)

    Store that resolves identifiers to unit data

  • budget (Integer) (defaults to: DEFAULT_BUDGET)

    Total token budget

  • chars_per_token (Float) (defaults to: DEFAULT_CHARS_PER_TOKEN)

    Tokenizer-calibrated char/token ratio used for truncation sizing. Match this to the embedding provider in use —Embedding::TextPreparer#chars_per_token exposes the live value from the indexing-time preparer.

  • token_counter (#count, nil) (defaults to: nil)

    Optional exact tokenizer (typically Embedding::TokenCounter). When provided, token estimation uses the model’s real WordPiece/BPE output instead of the ‘chars / chars_per_token` heuristic, which matters most for the Ollama path (ratios vary widely across Rails source, 1.5–2.5). The heuristic remains the fallback when the counter is nil or the tokenizer gem isn’t installed.



62
63
64
65
66
67
68
69
70
71
72
73
74
# File 'lib/woods/retrieval/context_assembler.rb', line 62

def initialize(metadata_store:, budget: DEFAULT_BUDGET,
               chars_per_token: DEFAULT_CHARS_PER_TOKEN,
               token_counter: nil)
  @metadata_store = 
  @budget = budget
  # Guard against 0 / negative / NaN ratios — any of those would make
  # `estimate_tokens` div-by-zero or return a negative budget, which
  # would silently truncate every section to empty. Fall back to the
  # default ratio rather than propagate the bogus input.
  ratio = chars_per_token.to_f
  @chars_per_token = ratio.positive? ? ratio : DEFAULT_CHARS_PER_TOKEN
  @token_counter = token_counter
end

Instance Attribute Details

#chars_per_tokenFloat (readonly)

Returns the configured chars-per-token ratio.

Returns:

  • (Float)

    the configured chars-per-token ratio



77
78
79
# File 'lib/woods/retrieval/context_assembler.rb', line 77

def chars_per_token
  @chars_per_token
end

#token_counter#count? (readonly)

Returns the exact tokenizer, if one was injected.

Returns:

  • (#count, nil)

    the exact tokenizer, if one was injected



80
81
82
# File 'lib/woods/retrieval/context_assembler.rb', line 80

def token_counter
  @token_counter
end

Instance Method Details

#assemble(candidates:, classification:, structural_context: nil, budget: nil) ⇒ AssembledContext

Assemble context from ranked candidates within token budget.

Parameters:

  • candidates (Array<Candidate>)

    Ranked search candidates

  • classification (QueryClassifier::Classification)

    Query classification

  • structural_context (String, nil) (defaults to: nil)

    Optional codebase overview text

  • budget (Integer, nil) (defaults to: nil)

    Override token budget; falls back to @budget

Returns:



89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/woods/retrieval/context_assembler.rb', line 89

def assemble(candidates:, classification:, structural_context: nil, budget: nil)
  effective_budget = budget || @budget
  sections = []
  sources = []
  tokens_used = 0

  # Collapse +User#chunk_0+, +User#chunk_1+, … back to their base unit
  # BEFORE metadata lookup and section assembly. Chunk IDs are an
  # embedding-side concern — the metadata store is keyed by the base
  # identifier, and callers don't want the same unit formatted twice
  # just because multiple chunks matched the query.
  candidates = collapse_chunk_candidates(candidates)

  # Pre-fetch all candidate metadata in one batch query
  @unit_cache = @metadata_store.find_batch(candidates.map(&:identifier))

  # 1. Structural context (always first if provided)
  tokens_used = add_structural_section(sections, structural_context, tokens_used, effective_budget)

  # 2. Compute per-section budgets from remaining tokens
  budgets = compute_section_budgets(effective_budget - tokens_used, classification)

  # 3. Primary, supporting, and framework sections
  add_candidate_section(sections, sources, :primary,
                        candidates.reject { |c| c.source == :graph_expansion }, budgets[:primary])
  add_candidate_section(sections, sources, :supporting,
                        candidates.select { |c| c.source == :graph_expansion }, budgets[:supporting])
  if budgets[:framework].positive?
    add_candidate_section(sections, sources, :framework,
                          candidates.select { |c| framework_candidate?(c) }, budgets[:framework])
  end

  build_result(sections, sources, effective_budget)
end