Class: SmartCsvImport::Strategies::Llm
- Inherits:
-
SmartCsvImport::Strategy
- Object
- SmartCsvImport::Strategy
- SmartCsvImport::Strategies::Llm
- Includes:
- Logging
- Defined in:
- lib/smart_csv_import/strategies/llm.rb
Instance Method Summary collapse
-
#match(csv_headers:, form_class:, sample_rows: []) ⇒ Object
Why we do NOT use HyDE (Hypothetical Document Embeddings) here:.
Instance Method Details
#match(csv_headers:, form_class:, sample_rows: []) ⇒ Object
Why we do NOT use HyDE (Hypothetical Document Embeddings) here:
HyDE would ask the LLM to generate a description of each header in isolation, then compare those descriptions to field descriptions via embeddings. It was trialled and rejected for two reasons:
-
It throws away the best signal we have. The LLM here already sees both sides — all headers AND all field definitions — in one prompt. That cross-field context is what disambiguates genuinely ambiguous headers. “Cell” next to first_name/last_name/email is clearly a phone number. “Cell” described in isolation could be a phone, a prison cell, or a biological cell — the LLM can’t know which.
-
It adds indirection without benefit. Direct matching lets the LLM reason holistically. HyDE turns that into a blind embedding lookup that loses the reasoning context.
The right path for genuinely ambiguous headers is: enrich this prompt with business context (csv_source, csv_context on the form class) so the LLM has more signal — not strip signal away via HyDE. If even that isn’t enough, surface the header as UnmatchedResult for human review.
32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/smart_csv_import/strategies/llm.rb', line 32 def match(csv_headers:, form_class:, sample_rows: []) field_definitions = form_class.csv_fields return {} if field_definitions.empty? prompt = build_prompt(csv_headers, field_definitions, form_class) response = fetch_llm_response(prompt) parse_response(response, csv_headers) rescue StandardError => e log_error("LLM strategy failed: #{e.}") {} end |