Module: SqlChatbot::Grammar::EntityCandidates

Defined in:: lib/sql_chatbot/grammar/entity_candidates.rb

Class Method Summary collapse

.levenshtein(a, b) ⇒ Object

Damerau-Levenshtein edit distance: insertions, deletions, substitutions, and adjacent transpositions.
.name_segments(entity) ⇒ Object
.pluralize_simple(word) ⇒ Object
.score_entity(question, entity, registry) ⇒ Object

Score an entity against the question.
.select(question:, registry:, top_n:) ⇒ Object
.select_with_meta(question:, registry:, top_n:) ⇒ Object

Returns rows of ‘{ entity:, score:, fuzzy_match: nil|typed:,corrected: }`.

Class Method Details

.levenshtein(a, b) ⇒ `Object`

Damerau-Levenshtein edit distance: insertions, deletions, substitutions, and adjacent transpositions. Transposition counted as 1 (vs 2 in plain Levenshtein) because keyboard typos like “lables” ↔ “labels” are extremely common and should match at distance 1.

# File 'lib/sql_chatbot/grammar/entity_candidates.rb', line 102

def self.levenshtein(a, b)
  return 0 if a == b
  return b.length if a.empty?
  return a.length if b.empty?
  n = b.length
  prev2 = Array.new(n + 1, 0)
  prev = (0..n).to_a
  curr = Array.new(n + 1, 0)
  (1..a.length).each do |i|
    curr[0] = i
    (1..n).each do |j|
      cost = (a[i - 1] == b[j - 1]) ? 0 : 1
      v = [prev[j] + 1, curr[j - 1] + 1, prev[j - 1] + cost].min
      if i > 1 && j > 1 && a[i - 1] == b[j - 2] && a[i - 2] == b[j - 1]
        v = [v, prev2[j - 2] + 1].min
      end
      curr[j] = v
    end
    prev2, prev, curr = prev, curr, prev2
  end
  prev[n]
end

.name_segments(entity) ⇒ `Object`



131
132
133

# File 'lib/sql_chatbot/grammar/entity_candidates.rb', line 131

def self.name_segments(entity)
  entity.name.to_s.split("_").length
end

.pluralize_simple(word) ⇒ `Object`

# File 'lib/sql_chatbot/grammar/entity_candidates.rb', line 125

def self.pluralize_simple(word)
  return word + "es" if word.end_with?("s", "x", "ch", "sh")
  return word[0..-2] + "ies" if word.end_with?("y") && !%w[a e i o u].include?(word[-2])
  word + "s"
end

.score_entity(question, entity, registry) ⇒ `Object`

Score an entity against the question. Tokenizes name on ‘_’ so e.g. ‘projects_project` token “project” matches the question “how many projects”. Tie-breakers: fewer name segments, then higher row count.

# File 'lib/sql_chatbot/grammar/entity_candidates.rb', line 12

def self.score_entity(question, entity, registry)
  q = question.to_s.downcase
  singular = entity.name.to_s.downcase
  plural = entity.table.to_s.downcase
  score = 0

  score += 12 if q.include?(" #{singular} ") || q.start_with?("#{singular} ") || q.end_with?(" #{singular}")
  score += 10 if q.include?(plural)
  score += 5 if q.include?(singular)

  registry.aliases.each do |alias_term, target|
    next unless target == entity.name
    score += 8 if q.include?(alias_term.to_s.downcase)
  end

  # Token-level matching for compound names like `projects_project`.
  tokens = singular.split("_").select { |t| t.length >= 3 }
  tokens.each do |tok|
    tok_plural = pluralize_simple(tok)
    if q =~ /\b(#{Regexp.escape(tok)}|#{Regexp.escape(tok_plural)})\b/
      score += 4
    end
  end

  # Whitespace-collapsed match — length-weighted so longer matches win.
  q_compact = q.gsub(/\s+/, "")
  best_len = 0
  tokens.each do |tok|
    next if tok.length < 5
    [tok, pluralize_simple(tok)].each do |c|
      best_len = c.length if q_compact.include?(c) && c.length > best_len
    end
  end
  score += best_len

  # Fuzzy match for typos. Two-tier — aliases first (strongest signal,
  # same tier as exact-alias just approximate), then name/plural/tokens
  # (weaker — tokens are shared across many entities in compound-named
  # schemas like Saleor's `product_product` / `product_category`).
  # 4-char threshold catches "usrs" → "users" while rejecting 3-char
  # noise ("lon" → "log"). Distance ≤ 25% of length further filters.
  fuzzy_min_len = 4
  if score == 0
    alias_targets = []
    registry.aliases.each do |alias_term, target|
      next unless target == entity.name
      alias_targets << alias_term.to_s.downcase if alias_term.to_s.length >= fuzzy_min_len
    end
    name_targets = []
    name_targets << singular if singular.length >= fuzzy_min_len
    name_targets << plural   if plural.length >= fuzzy_min_len
    tokens.each do |tok|
      next if tok.length < fuzzy_min_len
      name_targets << tok
      tp = pluralize_simple(tok)
      name_targets << tp if tp.length >= fuzzy_min_len
    end

    words = q.split(/\W+/).select { |w| w.length >= fuzzy_min_len }
    # Aliases first — score 5
    words.each do |word|
      alias_targets.each do |t|
        d = levenshtein(word, t)
        next if d == 0
        max = [word.length, t.length].max
        if d <= 2 && d.to_f / max <= 0.25
          return [5, { typed: word, corrected: t }]
        end
      end
    end
    # Then name/tokens — score 3
    words.each do |word|
      name_targets.each do |t|
        d = levenshtein(word, t)
        next if d == 0
        max = [word.length, t.length].max
        if d <= 2 && d.to_f / max <= 0.25
          return [3, { typed: word, corrected: t }]
        end
      end
    end
  end

  [score, nil]
end

.select(question:, registry:, top_n:) ⇒ `Object`



135
136
137

# File 'lib/sql_chatbot/grammar/entity_candidates.rb', line 135

def self.select(question:, registry:, top_n:)
  select_with_meta(question: question, registry: registry, top_n: top_n).map { |row| row[:entity] }
end

.select_with_meta(question:, registry:, top_n:) ⇒ `Object`

Returns rows of ‘{ entity:, score:, fuzzy_match: nil|typed:,corrected: }`. Used by the intent-extractor prompt to tell the LLM “the user word `<typo>` is likely a typo of `<entity>`” so it commits to the candidate instead of returning unmatched on a stray typo.

# File 'lib/sql_chatbot/grammar/entity_candidates.rb', line 143

def self.select_with_meta(question:, registry:, top_n:)
  rows = registry.entities.values.map do |entity|
    score, fuzzy = score_entity(question, entity, registry)
    {
      entity: entity,
      score: score,
      fuzzy_match: fuzzy,
      segments: name_segments(entity),
      row_count: entity.row_count,
    }
  end

  rows.sort_by! { |r| [-r[:score], r[:segments], -r[:row_count]] }

  if rows.first && rows.first[:score] == 0
    return registry.entities.values.sort_by { |e| -e.row_count }.first(top_n).map do |entity|
      { entity: entity, score: 0, fuzzy_match: nil }
    end
  end

  # When a typed word has a strong alias-fuzzy resolution (score 5),
  # drop OTHER candidates that scored only via a weaker token-fuzzy
  # match on the same typed word — they represent unrelated tables
  # whose presence in the prompt tempts the LLM to override the
  # resolution.
  strong = {}
  rows.each do |r|
    if r[:fuzzy_match] && r[:score] == 5
      strong[r[:fuzzy_match][:typed]] = true
    end
  end
  filtered =
    if strong.empty?
      rows
    else
      rows.reject { |r| r[:fuzzy_match] && r[:score] == 3 && strong[r[:fuzzy_match][:typed]] }
    end

  # V1.3-R: alternate-suppression when a primary clearly dominates.
  # Mirror of TS selectEntityCandidatesWithMeta. When the top entity
  # has a strong alias/name match (score ≥ 8), drop alternates whose
  # score is at most half the primary's. Keeps the LLM focused on the
  # right binding instead of weighing distractor token matches.
  primary_dominance_floor = 8
  alternate_keep_ratio = 0.5
  top = filtered.first
  if top && top[:score] >= primary_dominance_floor
    cutoff = top[:score] * alternate_keep_ratio
    filtered = filtered.each_with_index.reject { |r, i| i > 0 && r[:score] <= cutoff }.map(&:first)
  end

  # Dedup fuzzy_match annotation per typed word.
  claimed = {}
  filtered.first(top_n).map do |r|
    fuzzy = r[:fuzzy_match]
    if fuzzy
      if claimed[fuzzy[:typed]]
        fuzzy = nil
      else
        claimed[fuzzy[:typed]] = true
      end
    end
    { entity: r[:entity], score: r[:score], fuzzy_match: fuzzy }
  end
end

Module: SqlChatbot::Grammar::EntityCandidates

Class Method Summary collapse

Class Method Details

.levenshtein(a, b) ⇒ Object

.name_segments(entity) ⇒ Object

.pluralize_simple(word) ⇒ Object

.score_entity(question, entity, registry) ⇒ Object