Module: SqlChatbot::Grammar::IntentExtractor
- Defined in:
- lib/sql_chatbot/grammar/intent_extractor.rb
Constant Summary collapse
- PRIMITIVE_DESCRIPTIONS =
<<~TEXT.strip COUNT — how many rows of X LIST — show/list rows of X SUM — total of numeric field on X AVG — average of numeric field on X MIN_MAX — lowest/highest value of field on X TOP_N — top N rows of X ordered by a ranking field RANK — window-ranked rows of X within groups TEXT
- MODIFIER_DESCRIPTIONS =
<<~TEXT.strip where — filter by a field value (op: eq/neq/lt/lte/gt/gte/like/in) shape: {"kind":"where","field":"<col>","op":"<op>","value":<lit>} time — filter by time window on a timestamp field shape: {"kind":"time","field":"<timestamp_col>","window":"<keyword>"} windows: today | yesterday | last_7_days | last_30_days | this_week | this_month | this_year join — include a related entity via association shape: {"kind":"join","association":"<assoc_name>"} group_by — group results by field shape: {"kind":"group_by","field":"<col>"} having — filter grouped results by aggregate op+value order_by — order results by field shape: {"kind":"order_by","field":"<col>","direction":"asc|desc"} limit — cap result count shape: {"kind":"limit","value":<int>} distinct — deduplicate rows shape: {"kind":"distinct"} scope — apply a named scope from the entity. Use this when the user's word matches an entity scope name; the grammar compiles the filter for you. shape: {"kind":"scope","name":"<scope_name>"} TEXT
Class Method Summary collapse
- .build_system_prompt ⇒ Object
- .build_user_prompt(question, candidates, history) ⇒ Object
-
.extract(question:, registry:, history:, call_llm:, confidence_threshold: 0.7) ⇒ Object
call_llm: a proc/lambda taking Array<Hashcontent:> and returning the raw LLM string.
- .format_entity_brief(candidate, is_primary = false) ⇒ Object
Class Method Details
.build_system_prompt ⇒ Object
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# File 'lib/sql_chatbot/grammar/intent_extractor.rb', line 111 def self.build_system_prompt <<~PROMPT.strip You are an intent classifier for a SQL chatbot. Given a user question and a list of available entities, extract a structured intent. Primitives: #{PRIMITIVE_DESCRIPTIONS} Modifiers: #{MODIFIER_DESCRIPTIONS} You MUST output JSON. If the question doesn't fit any primitive cleanly, set status=unmatched. Output shape — generic placeholders, fill in real values from the candidates: - COUNT: {"status":"matched","primitive":"COUNT","entity":"<E>","modifiers":[],"confidence":0.9} - LIST: {"status":"matched","primitive":"LIST","entity":"<E>","modifiers":[],"confidence":0.9} - SUM: {"status":"matched","primitive":"SUM","entity":"<E>","field":"<numeric_col>","modifiers":[],"confidence":0.9} - AVG: {"status":"matched","primitive":"AVG","entity":"<E>","field":"<numeric_col>","modifiers":[],"confidence":0.9} - MIN_MAX: {"status":"matched","primitive":"MIN_MAX","entity":"<E>","field":"<numeric_col>","which":"MAX","modifiers":[],"confidence":0.9} - TOP_N: {"status":"matched","primitive":"TOP_N","entity":"<E>","rankField":"<numeric_col>","n":5,"modifiers":[],"confidence":0.9} - unmatched: {"status":"unmatched","confidence":0.3,"reason":"<short reason>"} SUM/AVG/MIN_MAX REQUIRE a "field" slot (a top-level "field" key, not inside modifiers). TOP_N REQUIRES a "rankField" (a top-level "rankField" key, not an order_by modifier). Pick "<E>" by reading the entity candidates — never invent a name. Pluralized forms in the user's question ("orders", "users") map to the singular entity name from the candidates list. Casual phrasing maps to primitives like this (use the same patterns when the user speaks informally): - "stuff we got" / "what we have" / "show me X" → LIST X - "how many", "count of", "number of" → COUNT - "biggest" / "largest" / "top" / "smallest" / "lowest" / "least" → TOP_N (returns the row with name + value). Use TOP_N for ANY question asking for an entity (a row) ranked by some field — even when n=1 ("smallest repo", "biggest order"). For ASC ranking, set order_by direction=asc; for DESC, direction=desc. - MIN_MAX is ONLY for questions asking for the VALUE of an aggregate, not the row. "what's the highest price?" → MIN_MAX of price. "biggest order" → TOP_N of orders (the user wants the row, not just a number). - "how's the team doing" / "how are things looking" → unmatched (vague — no clear primitive) Rules: - Only reference entities from the provided candidates. Never invent. - Only use fields listed under the chosen entity. - For enum filters, pass the enum key (e.g. "active"), NOT the underlying integer. - When the user's word matches an enum-value KEY of any field on the chosen entity (case-insensitive), bind it via a "where" modifier with op:"eq" and value:"<the-key>". Example: user says "urgent" and field priority_id has enum={"Urgent":4,...} → emit {"kind":"where","field":"priority_id","op":"eq","value":"Urgent"}. - When the user's word matches a scope name listed under the entity, prefer a {"kind":"scope","name":"<name>"} modifier over reconstructing the scope's filter manually. The grammar already knows how to compile the scope. - Set confidence honestly: 0.9+ only when every slot has a clear mapping. If the question is vague and you're guessing the entity or aggregation, return unmatched. - ENTITY-BINDING RULE: If a user word matches the canonical name OR any listed alias of exactly one candidate entity, the entity IS bound. Commit to a primitive — default to COUNT for "any/how many/got any" phrasings and LIST for "show me/list/which". Do NOT return unmatched just because the question is casual; the alias resolved the ambiguity. The "When in doubt, return unmatched" rule applies to ambiguous PRIMITIVES (e.g., "what's up with X"), not to ambiguous ENTITIES — entity ambiguity is already resolved by the candidate set. - PRIMARY MATCH RULE: When a candidate is annotated "[PRIMARY MATCH — prefer this entity ...]", use that entity unless the question explicitly names something only a different candidate has. Do NOT pick a non-primary candidate just because it has more rows or a more specific token match — the scorer already weighed those factors. Example: "smallest repo" with the "repository" entity marked PRIMARY (alias match) and "repo_unit" as alternate (token match) → pick "repository". The user said "repo", and "repo" is the alias of "repository". - When a candidate is annotated with "fuzzy-matched from '<typo>' → '<entity>'", the entity has already been chosen as the resolution for that typed word. Use this entity at confidence ≥ 0.85. Do NOT pick a different candidate just because its name appears to contain the user's word — unrelated tables (audit logs, bot tables, event tables) can have similar names. The fuzzy resolution was done before you saw these candidates. - For any time phrasing ("today", "yesterday", "last 7 days", "last week", "this week", "last month", "this year", etc.), use a "time" modifier with a named window. NEVER put a literal date string OR a window keyword into a "where" modifier value — both produce broken SQL. Supported windows: today, yesterday, last_7_days, last_30_days, this_week, this_month, this_year. The "time" modifier shape is {"kind":"time","field":"<timestamp_col>","window":"<keyword>"} — do NOT use "value" instead of "window", and ALWAYS include "field". If the user names a window not in this list, return unmatched. - When in doubt, return unmatched. PROMPT end |
.build_user_prompt(question, candidates, history) ⇒ Object
160 161 162 163 164 165 166 |
# File 'lib/sql_chatbot/grammar/intent_extractor.rb', line 160 def self.build_user_prompt(question, candidates, history) history_text = (history || []).last(2).map { |m| "#{m[:role] || m['role']}: #{m[:content] || m['content']}" }.join("\n") # Mark the top candidate PRIMARY when there's more than one. Mirror # of TS buildUserPrompt — see comments there for the rationale. entity_blocks = candidates.each_with_index.map { |c, i| format_entity_brief(c, i.zero? && candidates.length > 1) }.join("\n\n") "History:\n#{history_text}\n\nEntity candidates:\n#{entity_blocks}\n\nQuestion: #{question}" end |
.extract(question:, registry:, history:, call_llm:, confidence_threshold: 0.7) ⇒ Object
call_llm: a proc/lambda taking Array<Hashcontent:> and returning the raw LLM string. Returns a Hash with :status and related keys (mirrors TS Intent discriminated union).
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
# File 'lib/sql_chatbot/grammar/intent_extractor.rb', line 44 def self.extract(question:, registry:, history:, call_llm:, confidence_threshold: 0.7) candidates = EntityCandidates.(question: question, registry: registry, top_n: 5) = [ { role: "system", content: build_system_prompt }, { role: "user", content: build_user_prompt(question, candidates, history) }, ] raw = begin call_llm.call() rescue => e return { status: "unmatched", confidence: 0, reason: "llm_error: #{e.}" } end parsed = begin JSON.parse(raw) rescue JSON::ParserError return { status: "unmatched", confidence: 0, reason: "malformed_json" } end if parsed["status"] == "unmatched" return { status: "unmatched", confidence: parsed["confidence"] || 0, reason: parsed["reason"] || "unmatched", } end conf = parsed["confidence"] if !conf.is_a?(Numeric) || conf < confidence_threshold return { status: "unmatched", confidence: conf || 0, reason: "low_confidence:#{conf}" } end # V1.3-R: validate that the LLM's chosen entity is actually in the # candidate set we showed it. See TS extractIntent for rationale. if parsed["entity"].is_a?(String) candidate_names = Set.new candidates.each do |c| candidate_names << c[:entity].name registry.aliases.each do |alias_term, target| candidate_names << alias_term if target == c[:entity].name end end unless candidate_names.include?(parsed["entity"]) return { status: "unmatched", confidence: conf, reason: "entity_not_in_candidates: '#{parsed["entity"]}' not among #{candidate_names.first(5).to_a.join(", ")}", } end end # Normalize keys to symbols for consumers { status: "matched", primitive: parsed["primitive"]&.to_sym, entity: parsed["entity"], modifiers: (parsed["modifiers"] || []).map { |m| m.transform_keys(&:to_sym) }, field: parsed["field"], which: parsed["which"]&.to_sym, n: parsed["n"], rank_field: parsed["rankField"] || parsed["rank_field"], group_by: parsed["groupBy"] || parsed["group_by"], confidence: conf, }.compact end |
.format_entity_brief(candidate, is_primary = false) ⇒ Object
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
# File 'lib/sql_chatbot/grammar/intent_extractor.rb', line 168 def self.format_entity_brief(candidate, is_primary = false) e = candidate[:entity] fields = e.fields.first(15).map do |n, f| enum_part = f.enum_values ? " enum=#{JSON.generate(f.enum_values)}" : "" " #{n}: #{f.type}#{enum_part}" end.join("\n") assocs = e.associations.keys.first(8).join(", ") assocs = "(none)" if assocs.empty? scope_names = e.scopes.keys.first(8) scopes_text = if scope_names.any? %(Scopes (when a user word matches one of these names, emit {"kind":"scope","name":"<name>"} as a modifier): #{scope_names.join(", ")}) else "Scopes: (none)" end fuzzy_hint = candidate[:fuzzy_match] ? %(\nNOTE: fuzzy-matched from "#{candidate[:fuzzy_match][:typed]}" → "#{candidate[:fuzzy_match][:corrected]}" (treat the user word as a typo of this entity).) : "" primary_marker = is_primary ? " [PRIMARY MATCH — prefer this entity unless the question explicitly names a different one]\n" : "" "Entity: #{e.name} (table=#{e.table}, rows=#{e.row_count})#{fuzzy_hint}\n#{primary_marker}Fields:\n#{fields}\nAssociations: #{assocs}\n#{scopes_text}" end |