SmartCsvImport
A Rails Engine for CSV importing with AI-powered header matching. Drop in a form object, describe your fields in plain English, and SmartCsvImport automatically maps whatever column names show up in the CSV — messy, abbreviated, or domain-specific — to your fields. No brittle column-name checks to maintain.
Matching uses a two-tier strategy: vector similarity (fast, cached embeddings) handles the common case, and LLM fallback handles ambiguous headers that need cross-field reasoning. Row data is never sent to an AI service — only headers and field descriptions.
Table of Contents
- Quickstart
- Reviewing Failed Rows
- Form Object DSL
- Configuration
- Matching Strategies
- Processing Modes
- Advanced
- Contributing
Quickstart
1. Configure an AI provider
SmartCsvImport delegates all AI calls to ruby_llm. Configure it with your provider credentials before using vector or LLM matching strategies.
# config/initializers/ruby_llm.rb
RubyLLM.configure do |config|
# OpenAI
config.openai_api_key = ENV["OPENAI_API_KEY"]
# — or Anthropic —
config.anthropic_api_key = ENV["ANTHROPIC_API_KEY"]
# — or Google (Gemini embedding is free tier, good for getting started) —
config.gemini_api_key = ENV["GEMINI_API_KEY"]
end
Free tier option: The default models (
gemini-embedding-001for embeddings,claude-haiku-4-5-20251001for LLM fallback) work well for development and small workloads. A Google AI Studio key gives you free Gemini embeddings. See Model Selection for upgrading.
2. Install the gem
# Gemfile
gem "smart_csv_import"
bundle install
3. Run the install generator
rails generate smart_csv_import:install
This creates:
config/initializers/smart_csv_import.rb— configuration with all options commented out at their defaultsdb/migrate/YYYYMMDDHHMMSS_create_smart_csv_import_imports.rb— import tracking tabletmp/smart_csv_import/andtmp/smart_csv_import/embeddings_cache/— storage directories
rails db:migrate
4. Generate an import form
rails generate smart_csv_import:import Employee first_name last_name email
This creates app/forms/employee_import_row.rb:
class EmployeeImportRow
include ActiveModel::Validations
include SmartCsvImport::Matchable
attr_accessor :first_name, :last_name, :email
# The description is what the AI matches against — be specific.
# "Employee first/given name" beats "first name" for ambiguous CSVs.
csv_field :first_name, description: "Employee first/given name", required: true
csv_field :last_name, description: "Employee last/family name", required: true
csv_field :email, description: "Employee email address"
validates :first_name, :last_name, presence: true
def save
Employee.create!(first_name: first_name, last_name: last_name, email: email)
true
rescue ActiveRecord::RecordInvalid
false
end
end
Edit the save method to persist however your app needs. SmartCsvImport calls save once per row.
5. Process a CSV
result = SmartCsvImport.process("path/to/employees.csv", form_class: EmployeeImportRow)
result.completed? # => true
result.imported # => 98
result.failed # => 2
result.total # => 100
result.errors # => [#<RowError row=42 column=:email messages=["is invalid"]>]
result.warnings # => [#<RowWarning message="Column 'Nickname' was not mapped">]
result.header_mappings # => {"First Name" => "first_name", "Surname" => "last_name", ...}
That's it. SmartCsvImport reads the CSV headers, maps them to your fields, and calls save on a form object for each row.
Reviewing Failed Rows
Every row that fails — whether due to CSV parsing (malformed quotes, encoding issues) or form validation — is captured as structured data so you can build whatever review experience you want: a paginated admin UI, a corrective re-export, a retry workflow.
In-memory: the Result object (sync callers)
result = SmartCsvImport.process("employees.csv", form_class: EmployeeImportRow)
result.errors # => [#<RowError row=42 column=:email messages=["is invalid"]>, ...]
result.parse_errors # => [#<ParseError line_number=87 raw_line='"Bob,Jones...' error_message="Unclosed quoted field">, ...]
RowError(row, column, messages)— one per field that failed validation. A single CSV row that fails on multiple columns produces multipleRowErrorrecords.ParseError(line_number, raw_line, error_message)— one per CSV row that couldn't be parsed at all.
Persistent: the Import#row_errors association (async, later, or both)
Every failure is persisted to the smart_csv_import_import_row_errors table, giving you an ActiveRecord association to query at any time — including after an async ImportJob has completed and the in-memory Result is gone.
import = SmartCsvImport::Import.find(import_id)
import.row_errors.count # => 7
import.row_errors.validation_errors.count # => 5
import.row_errors.parse_errors.count # => 2
# Pagination, filtering, grouping — all plain ActiveRecord:
import.row_errors.validation_errors.where(column_name: "email").limit(50).offset(100)
import.row_errors.group(:error_type, :column_name).count
Each SmartCsvImport::ImportRowError record:
| Field | Type | Populated for | Description |
|---|---|---|---|
import_id |
integer | both | FK to smart_csv_import_imports (cascade delete) |
row_number |
integer | both | Physical CSV line number (1-indexed, headers on line 1) |
error_type |
string | both | "validation" or "parse" |
column_name |
string | validation | Form field that failed (e.g. "email") |
messages |
json (array) | validation | Error messages from the form object (e.g. ["is invalid"]) |
raw_line |
text | parse | Literal CSV row text that failed to parse |
error_message |
text | parse | Parser's description of the failure |
Downloadable "fix and re-upload" CSV
Re-export failed rows to a CSV with the original headers plus an _error column — drop it in front of a user, let them fix the rows, and re-upload.
# From a sync Result:
output_path = SmartCsvImport::FailedRowExporter.new(
result: result,
csv_path: "path/to/original.csv"
).call
# Or from a persisted Import (e.g. after an async job):
output_path = SmartCsvImport::FailedRowExporter.new(
import: SmartCsvImport::Import.find(import_id),
csv_path: "path/to/original.csv"
).call
# => "tmp/smart_csv_import/failed_rows/20260423142530_failed.csv"
The exporter writes only validation failures — parse errors keep their raw_line intact on the row_errors record, which is usually more useful for inspecting malformed CSV than re-exporting.
Form Object DSL
Include SmartCsvImport::Matchable in any class with attr_accessor declarations and use csv_field to register fields for matching.
csv_field
csv_field :field_name, description: "...", required: false
| Parameter | Required | Description |
|---|---|---|
name |
yes | Must match an attr_accessor on the class |
description: |
yes | Plain-English description — this is what the AI matches CSV headers against |
required: |
no | When true, a failed match transitions the import to mapping_review instead of processing rows (see Import Tracking) |
Write good descriptions. The description is the only signal the AI has. Vague descriptions produce weaker matches.
# Weaker — too generic
csv_field :amount, description: "amount"
# Stronger — specific and unambiguous
csv_field :amount, description: "Invoice total amount in dollars"
csv_source and csv_context
Optional class-level hints that give the LLM domain knowledge for disambiguating headers it can't resolve from descriptions alone.
class EmployeeImportRow
include SmartCsvImport::Matchable
# Where the CSV comes from
csv_source "ADP Workforce payroll export"
# The business domain of your app
csv_context "HR platform for staffing agencies"
csv_field :mobile_phone, description: "Employee mobile phone number"
# ...
end
Without context, the LLM sees "Cell" and has no way to know if it means a mobile number, a prison cell, or a biological cell. With csv_source and csv_context, it can reason correctly.
Configuration
# config/initializers/smart_csv_import.rb
SmartCsvImport.configure do |config|
config.confidence_threshold = 0.80
config.batch_size = 500
config.storage_path = "tmp/smart_csv_import"
config.default_strategy = :vector
config.llm_model = "claude-haiku-4-5-20251001"
config. = "gemini-embedding-001"
config.value_hint_rows = 5
end
| Option | Default | Description |
|---|---|---|
confidence_threshold |
0.80 |
Minimum cosine similarity score to accept a vector or LLM match. Headers below this threshold fall through to the next strategy tier or become unmatched. |
batch_size |
500 |
How often (in rows) the import record is updated with progress counts during processing. |
storage_path |
"tmp/smart_csv_import" |
Root directory for stored CSV files and the embedding cache. |
default_strategy |
:vector |
Which strategy tier to start with when no custom strategy is set on the form class. |
llm_model |
"claude-haiku-4-5-20251001" |
The LLM used for fallback matching. Any model supported by ruby_llm can be used. |
embedding_model |
"gemini-embedding-001" |
The embedding model used by the vector strategy. |
value_hint_rows |
5 |
Number of sample rows inspected to apply value-based confidence adjustments (e.g. boosting confidence when cell values look like dates or emails). |
Model selection
Better models produce better matching accuracy on ambiguous or domain-specific headers. The defaults are suitable for development and light workloads. To upgrade:
config. = "text-embedding-3-large" # OpenAI — higher dimensionality
config.llm_model = "claude-sonnet-4-6" # Anthropic — stronger reasoning
Any model listed in the ruby_llm documentation works — no other changes needed.
Matching Strategies
How the fallback chain works
For each unmatched header, SmartCsvImport tries strategies in order and accepts the first result that meets the confidence threshold:
CSV headers
│
▼
Custom strategy (if set on form class)
│ unmatched or below threshold
▼
Vector strategy (embedding cosine similarity)
│ unmatched or below threshold
▼
LLM strategy (structured prompt)
│ unmatched
▼
UnmatchedResult → warning on the result object
Once a header is matched, it does not pass to the next tier. A header that clears all three tiers unmatched becomes an UnmatchedResult and generates a warning — it does not cause the import to fail.
Vector strategy
Computes embeddings for your field descriptions and the incoming CSV headers, then accepts the highest-scoring mutual match above the confidence threshold.
Field embeddings are cached to disk (keyed by your field definitions) so the API is only called once per unique set of fields — subsequent imports of the same type are fast.
Only headers and field descriptions are sent to the embedding API. Row data is never transmitted.
LLM strategy
Fires for headers that the vector strategy couldn't match confidently. Sends all remaining headers and all field definitions together in a single prompt, letting the LLM reason across the full set:
"Cell" next to
first_name,last_name,
"Cell" in isolation → ambiguous.
Cross-field context is what makes this effective. Only headers and descriptions are sent — never row data.
Lookup strategy
Zero AI. For systems with fixed, known column names.
class HrSystemMapping < SmartCsvImport::Strategies::Lookup
mappings(
"EMP_ID" => :employee_id,
"FNAME" => :first_name,
"LNAME" => :last_name,
"DOB" => :date_of_birth
)
end
class EmployeeImportRow
include SmartCsvImport::Matchable
self.matching_strategy = HrSystemMapping.new
# ...
end
Because a custom strategy runs first in the chain, matches from the Lookup table skip vector and LLM entirely.
Custom strategy
Subclass SmartCsvImport::Strategy and implement match:
class MyStrategy < SmartCsvImport::Strategy
def match(csv_headers:, form_class:, sample_rows: [])
csv_headers.each_with_object({}) do |header, results|
next unless header.downcase == "emp_id"
results[header] = SmartCsvImport::MatchResult.matched(
target_field: :employee_id,
confidence: 1.0,
strategy_name: "my_strategy"
)
end
end
end
Return only the headers your strategy can confidently match. Headers you omit fall through to the next tier.
Processing Modes
Synchronous (default)
Blocks until all rows are processed. Returns a result object.
result = SmartCsvImport.process("file.csv", form_class: MyImportRow)
result.completed? # => true
Use for small files, scripts, or rake tasks.
Asynchronous
Enqueues a background job and returns immediately.
result = SmartCsvImport.process("file.csv", form_class: MyImportRow, mode: :async)
result.queued? # => true
result.import_id # => 42
The job runs on the :smart_csv_import queue. Requires a queue backend — Sidekiq, GoodJob, or any Active Job adapter. Use for user-facing uploads where you don't want to block a web request.
Dry run
Validates every row without persisting anything.
result = SmartCsvImport.process("file.csv", form_class: MyImportRow, dry_run: true)
result.dry_run? # => true
result.imported # => 95 (would succeed)
result.failed # => 5 (would fail, with errors)
Use to preview results before committing an import.
Advanced
Header matching only
Inspect the raw mapping decisions without processing any rows:
mappings = SmartCsvImport.match_headers("file.csv", form_class: MyImportRow)
# => {
# "First Name" => #<MatchResult target_field=:first_name confidence=0.97 strategy="vector">,
# "Cell" => #<MatchResult target_field=:mobile_phone confidence=0.91 strategy="llm">,
# "Nickname" => #<UnmatchedResult csv_header="Nickname" attempted_strategies=["vector", "llm"]>
# }
Useful for building a review UI before committing large imports.
Import tracking
Every SmartCsvImport.process call creates a SmartCsvImport::Import record:
| Status | Meaning |
|---|---|
pending |
Created, not yet started |
processing |
Actively running |
completed |
All rows processed successfully |
partial_failure |
Some rows failed validation |
failed |
Processing stopped due to a database error |
mapping_review |
A required: field could not be matched — no rows were processed |
The record also stores the header mappings used, row counts, and a SHA-256 hash of the file. Duplicate file detection compares this hash before processing — a warning is added to the result if a match is found.
Stability analysis
After running several imports of the same type, check which header mappings have solidified:
report = SmartCsvImport::StabilityReport.new(import_type: "EmployeeImportRow")
analysis = report.analyze
analysis.imports_analyzed # => 20
analysis.stable_fields # => fields consistent >= 90% of the time
analysis.unstable_fields # => fields with varying resolutions
puts report.summary
# Stability report for EmployeeImportRow (20 imports analyzed):
# Stable fields (3):
# - First Name -> first_name (100.0% consistent)
# - Last Name -> last_name (100.0% consistent)
# - Email -> email (95.0% consistent)
Fields stable at >= 90% are good candidates for promotion to a Lookup strategy. Doing so eliminates AI calls for those fields entirely on future imports.
Normalizers
Built-in converters for common CSV data types. Use them in your save method:
SmartCsvImport::Normalizers::DateConverter.call("03/15/2024") # => #<Date: 2024-03-15>
SmartCsvImport::Normalizers::DateConverter.call("2024-03-15") # => #<Date: 2024-03-15>
SmartCsvImport::Normalizers::BooleanConverter.call("yes") # => true
SmartCsvImport::Normalizers::BooleanConverter.call("0") # => false
# In your form object:
def save
Employee.create!(
name: name,
hired_on: SmartCsvImport::Normalizers::DateConverter.call(hired_on),
active: SmartCsvImport::Normalizers::BooleanConverter.call(active)
)
true
rescue ActiveRecord::RecordInvalid
false
end
Contributing
Architecture overview
SmartCsvImport.process(file, form_class:)
└── Processor
├── FileStorage stores file, computes hash, checks duplicates
├── Matcher runs strategy chain, returns header → MatchResult map
│ ├── Strategies::Vector cosine similarity on embeddings (cached)
│ └── Strategies::Llm structured LLM prompt
└── (per row) form_class.new(attrs).save
Key files:
| File | Purpose |
|---|---|
lib/smart_csv_import.rb |
Public API: .process, .match_headers, .configure |
lib/smart_csv_import/processor.rb |
Orchestrates matching + row processing + result building |
lib/smart_csv_import/matcher.rb |
Runs the strategy chain, applies value hints |
lib/smart_csv_import/matchable.rb |
csv_field DSL mixed into form objects |
lib/smart_csv_import/strategies/ |
Vector, LLM, Lookup, and base Strategy class |
lib/smart_csv_import/result.rb |
Result value objects returned from .process |
The Configuration object is a global singleton accessed via SmartCsvImport.configuration. The Engine class hooks it into Rails' initializer and migration loading.
Getting started
git clone https://github.com/Nroulston/smart_csv_import
cd smart_csv_import
bin/setup
bin/rake # runs the full test suite
bin/console # IRB with all gem code loaded
Design decisions
Before sending a PR that touches the matching strategies, read ROADMAP.md. It documents two approaches that were evaluated and explicitly rejected (HyDE, asymmetric embedding augmentation) with the reasoning — understanding why they were ruled out will save you from rediscovering the same dead ends.