Module: SqlChatbot::Grammar::IntentExtractor

Defined in:: lib/sql_chatbot/grammar/intent_extractor.rb

Constant Summary collapse

PRIMITIVE_DESCRIPTIONS =

<<~TEXT.strip
  COUNT — how many rows of X
  LIST — show/list rows of X
  SUM — total of numeric field on X
  AVG — average of numeric field on X
  MIN_MAX — lowest/highest value of field on X
  TOP_N — top N rows of X ordered by a ranking field
  RANK — window-ranked rows of X within groups
TEXT

MODIFIER_DESCRIPTIONS =

<<~TEXT.strip
  where     — filter by a field value (op: eq/neq/lt/lte/gt/gte/like/in)
              shape: {"kind":"where","field":"<col>","op":"<op>","value":<lit>}
  time      — filter by time window on a timestamp field
              shape: {"kind":"time","field":"<timestamp_col>","window":"<keyword>"}
              windows: today | yesterday | last_7_days | last_30_days | this_week | this_month | this_year
  join      — include a related entity via association
              shape: {"kind":"join","association":"<assoc_name>"}
  group_by  — group results by field
              shape: {"kind":"group_by","field":"<col>"}
  having    — filter grouped results by aggregate op+value
  order_by  — order results by field
              shape: {"kind":"order_by","field":"<col>","direction":"asc|desc"}
  limit     — cap result count
              shape: {"kind":"limit","value":<int>}
  distinct  — deduplicate rows
              shape: {"kind":"distinct"}
  scope     — apply a named scope from the entity. Use this when the user's word
              matches an entity scope name; the grammar compiles the filter for you.
              shape: {"kind":"scope","name":"<scope_name>"}
TEXT

Class Method Summary collapse

.build_system_prompt ⇒ Object
.build_user_prompt(question, candidates, history) ⇒ Object
.extract(question:, registry:, history:, call_llm:, confidence_threshold: 0.7) ⇒ Object

call_llm: a proc/lambda taking Array<Hashcontent:> and returning the raw LLM string.
.format_entity_brief(candidate, is_primary = false) ⇒ Object

Class Method Details

.build_system_prompt ⇒ `Object`

# File 'lib/sql_chatbot/grammar/intent_extractor.rb', line 111

def self.build_system_prompt
  <<~PROMPT.strip
    You are an intent classifier for a SQL chatbot. Given a user question and a list of available entities, extract a structured intent.

    Primitives:
    #{PRIMITIVE_DESCRIPTIONS}

    Modifiers:
    #{MODIFIER_DESCRIPTIONS}

    You MUST output JSON. If the question doesn't fit any primitive cleanly, set status=unmatched.

    Output shape — generic placeholders, fill in real values from the candidates:
    - COUNT: {"status":"matched","primitive":"COUNT","entity":"<E>","modifiers":[],"confidence":0.9}
    - LIST:  {"status":"matched","primitive":"LIST","entity":"<E>","modifiers":[],"confidence":0.9}
    - SUM:   {"status":"matched","primitive":"SUM","entity":"<E>","field":"<numeric_col>","modifiers":[],"confidence":0.9}
    - AVG:   {"status":"matched","primitive":"AVG","entity":"<E>","field":"<numeric_col>","modifiers":[],"confidence":0.9}
    - MIN_MAX: {"status":"matched","primitive":"MIN_MAX","entity":"<E>","field":"<numeric_col>","which":"MAX","modifiers":[],"confidence":0.9}
    - TOP_N: {"status":"matched","primitive":"TOP_N","entity":"<E>","rankField":"<numeric_col>","n":5,"modifiers":[],"confidence":0.9}
    - unmatched: {"status":"unmatched","confidence":0.3,"reason":"<short reason>"}

    SUM/AVG/MIN_MAX REQUIRE a "field" slot (a top-level "field" key, not inside modifiers).
    TOP_N REQUIRES a "rankField" (a top-level "rankField" key, not an order_by modifier).
    Pick "<E>" by reading the entity candidates — never invent a name. Pluralized
    forms in the user's question ("orders", "users") map to the singular entity
    name from the candidates list.

    Casual phrasing maps to primitives like this (use the same patterns when the user speaks informally):
    - "stuff we got" / "what we have" / "show me X" → LIST X
    - "how many", "count of", "number of" → COUNT
    - "biggest" / "largest" / "top" / "smallest" / "lowest" / "least" → TOP_N (returns the row with name + value). Use TOP_N for ANY question asking for an entity (a row) ranked by some field — even when n=1 ("smallest repo", "biggest order"). For ASC ranking, set order_by direction=asc; for DESC, direction=desc.
    - MIN_MAX is ONLY for questions asking for the VALUE of an aggregate, not the row. "what's the highest price?" → MIN_MAX of price. "biggest order" → TOP_N of orders (the user wants the row, not just a number).
    - "how's the team doing" / "how are things looking" → unmatched (vague — no clear primitive)

    Rules:
    - Only reference entities from the provided candidates. Never invent.
    - Only use fields listed under the chosen entity.
    - For enum filters, pass the enum key (e.g. "active"), NOT the underlying integer.
    - When the user's word matches an enum-value KEY of any field on the chosen entity (case-insensitive), bind it via a "where" modifier with op:"eq" and value:"<the-key>". Example: user says "urgent" and field priority_id has enum={"Urgent":4,...} → emit {"kind":"where","field":"priority_id","op":"eq","value":"Urgent"}.
    - When the user's word matches a scope name listed under the entity, prefer a {"kind":"scope","name":"<name>"} modifier over reconstructing the scope's filter manually. The grammar already knows how to compile the scope.
    - Set confidence honestly: 0.9+ only when every slot has a clear mapping. If the question is vague and you're guessing the entity or aggregation, return unmatched.
    - ENTITY-BINDING RULE: If a user word matches the canonical name OR any listed alias of exactly one candidate entity, the entity IS bound. Commit to a primitive — default to COUNT for "any/how many/got any" phrasings and LIST for "show me/list/which". Do NOT return unmatched just because the question is casual; the alias resolved the ambiguity. The "When in doubt, return unmatched" rule applies to ambiguous PRIMITIVES (e.g., "what's up with X"), not to ambiguous ENTITIES — entity ambiguity is already resolved by the candidate set.
    - PRIMARY MATCH RULE: When a candidate is annotated "[PRIMARY MATCH — prefer this entity ...]", use that entity unless the question explicitly names something only a different candidate has. Do NOT pick a non-primary candidate just because it has more rows or a more specific token match — the scorer already weighed those factors. Example: "smallest repo" with the "repository" entity marked PRIMARY (alias match) and "repo_unit" as alternate (token match) → pick "repository". The user said "repo", and "repo" is the alias of "repository".
    - When a candidate is annotated with "fuzzy-matched from '<typo>' → '<entity>'", the entity has already been chosen as the resolution for that typed word. Use this entity at confidence ≥ 0.85. Do NOT pick a different candidate just because its name appears to contain the user's word — unrelated tables (audit logs, bot tables, event tables) can have similar names. The fuzzy resolution was done before you saw these candidates.
    - For any time phrasing ("today", "yesterday", "last 7 days", "last week", "this week", "last month", "this year", etc.), use a "time" modifier with a named window. NEVER put a literal date string OR a window keyword into a "where" modifier value — both produce broken SQL. Supported windows: today, yesterday, last_7_days, last_30_days, this_week, this_month, this_year. The "time" modifier shape is {"kind":"time","field":"<timestamp_col>","window":"<keyword>"} — do NOT use "value" instead of "window", and ALWAYS include "field". If the user names a window not in this list, return unmatched.
    - When in doubt, return unmatched.
  PROMPT
end

.build_user_prompt(question, candidates, history) ⇒ `Object`

# File 'lib/sql_chatbot/grammar/intent_extractor.rb', line 160

def self.build_user_prompt(question, candidates, history)
  history_text = (history || []).last(2).map { |m| "#{m[:role] || m['role']}: #{m[:content] || m['content']}" }.join("\n")
  # Mark the top candidate PRIMARY when there's more than one. Mirror
  # of TS buildUserPrompt — see comments there for the rationale.
  entity_blocks = candidates.each_with_index.map { |c, i| format_entity_brief(c, i.zero? && candidates.length > 1) }.join("\n\n")
  "History:\n#{history_text}\n\nEntity candidates:\n#{entity_blocks}\n\nQuestion: #{question}"
end

.extract(question:, registry:, history:, call_llm:, confidence_threshold: 0.7) ⇒ `Object`

call_llm: a proc/lambda taking Array<Hashcontent:> and returning the raw LLM string. Returns a Hash with :status and related keys (mirrors TS Intent discriminated union).

# File 'lib/sql_chatbot/grammar/intent_extractor.rb', line 44

def self.extract(question:, registry:, history:, call_llm:, confidence_threshold: 0.7)
  candidates = EntityCandidates.select_with_meta(question: question, registry: registry, top_n: 5)
  messages = [
    { role: "system", content: build_system_prompt },
    { role: "user", content: build_user_prompt(question, candidates, history) },
  ]
  raw =
    begin
      call_llm.call(messages)
    rescue => e
      return { status: "unmatched", confidence: 0, reason: "llm_error: #{e.message}" }
    end

  parsed =
    begin
      JSON.parse(raw)
    rescue JSON::ParserError
      return { status: "unmatched", confidence: 0, reason: "malformed_json" }
    end

  if parsed["status"] == "unmatched"
    return {
      status: "unmatched",
      confidence: parsed["confidence"] || 0,
      reason: parsed["reason"] || "unmatched",
    }
  end

  conf = parsed["confidence"]
  if !conf.is_a?(Numeric) || conf < confidence_threshold
    return { status: "unmatched", confidence: conf || 0, reason: "low_confidence:#{conf}" }
  end

  # V1.3-R: validate that the LLM's chosen entity is actually in the
  # candidate set we showed it. See TS extractIntent for rationale.
  if parsed["entity"].is_a?(String)
    candidate_names = Set.new
    candidates.each do |c|
      candidate_names << c[:entity].name
      registry.aliases.each do |alias_term, target|
        candidate_names << alias_term if target == c[:entity].name
      end
    end
    unless candidate_names.include?(parsed["entity"])
      return {
        status: "unmatched",
        confidence: conf,
        reason: "entity_not_in_candidates: '#{parsed["entity"]}' not among #{candidate_names.first(5).to_a.join(", ")}",
      }
    end
  end

  # Normalize keys to symbols for consumers
  {
    status: "matched",
    primitive: parsed["primitive"]&.to_sym,
    entity: parsed["entity"],
    modifiers: (parsed["modifiers"] || []).map { |m| m.transform_keys(&:to_sym) },
    field: parsed["field"],
    which: parsed["which"]&.to_sym,
    n: parsed["n"],
    rank_field: parsed["rankField"] || parsed["rank_field"],
    group_by: parsed["groupBy"] || parsed["group_by"],
    confidence: conf,
  }.compact
end

.format_entity_brief(candidate, is_primary = false) ⇒ `Object`

# File 'lib/sql_chatbot/grammar/intent_extractor.rb', line 168

def self.format_entity_brief(candidate, is_primary = false)
  e = candidate[:entity]
  fields = e.fields.first(15).map do |n, f|
    enum_part = f.enum_values ? " enum=#{JSON.generate(f.enum_values)}" : ""
    "  #{n}: #{f.type}#{enum_part}"
  end.join("\n")
  assocs = e.associations.keys.first(8).join(", ")
  assocs = "(none)" if assocs.empty?
  scope_names = e.scopes.keys.first(8)
  scopes_text =
    if scope_names.any?
      %(Scopes (when a user word matches one of these names, emit {"kind":"scope","name":"<name>"} as a modifier): #{scope_names.join(", ")})
    else
      "Scopes: (none)"
    end
  fuzzy_hint = candidate[:fuzzy_match] ?
    %(\nNOTE: fuzzy-matched from "#{candidate[:fuzzy_match][:typed]}" → "#{candidate[:fuzzy_match][:corrected]}" (treat the user word as a typo of this entity).) :
    ""
  primary_marker = is_primary ? "  [PRIMARY MATCH — prefer this entity unless the question explicitly names a different one]\n" : ""
  "Entity: #{e.name} (table=#{e.table}, rows=#{e.row_count})#{fuzzy_hint}\n#{primary_marker}Fields:\n#{fields}\nAssociations: #{assocs}\n#{scopes_text}"
end

Module: SqlChatbot::Grammar::IntentExtractor

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.build_system_prompt ⇒ Object

.build_user_prompt(question, candidates, history) ⇒ Object

.extract(question:, registry:, history:, call_llm:, confidence_threshold: 0.7) ⇒ Object

.format_entity_brief(candidate, is_primary = false) ⇒ Object

.build_system_prompt ⇒ `Object`

.build_user_prompt(question, candidates, history) ⇒ `Object`

.extract(question:, registry:, history:, call_llm:, confidence_threshold: 0.7) ⇒ `Object`

.format_entity_brief(candidate, is_primary = false) ⇒ `Object`