Class: SmartCsvImport::Strategies::Llm

Inherits:
SmartCsvImport::Strategy show all
Includes:
Logging
Defined in:
lib/smart_csv_import/strategies/llm.rb

Instance Method Summary collapse

Instance Method Details

#match(csv_headers:, form_class:, sample_rows: []) ⇒ Object

Why we do NOT use HyDE (Hypothetical Document Embeddings) here:

HyDE would ask the LLM to generate a description of each header in isolation, then compare those descriptions to field descriptions via embeddings. It was trialled and rejected for two reasons:

  1. It throws away the best signal we have. The LLM here already sees both sides — all headers AND all field definitions — in one prompt. That cross-field context is what disambiguates genuinely ambiguous headers. “Cell” next to first_name/last_name/email is clearly a phone number. “Cell” described in isolation could be a phone, a prison cell, or a biological cell — the LLM can’t know which.

  2. It adds indirection without benefit. Direct matching lets the LLM reason holistically. HyDE turns that into a blind embedding lookup that loses the reasoning context.

The right path for genuinely ambiguous headers is: enrich this prompt with business context (csv_source, csv_context on the form class) so the LLM has more signal — not strip signal away via HyDE. If even that isn’t enough, surface the header as UnmatchedResult for human review.



32
33
34
35
36
37
38
39
40
41
42
# File 'lib/smart_csv_import/strategies/llm.rb', line 32

def match(csv_headers:, form_class:, sample_rows: [])
  field_definitions = form_class.csv_fields
  return {} if field_definitions.empty?

  prompt = build_prompt(csv_headers, field_definitions, form_class)
  response = fetch_llm_response(prompt)
  parse_response(response, csv_headers)
rescue StandardError => e
  log_error("LLM strategy failed: #{e.message}")
  {}
end