SmartCsvImport

CI Gem Version License: MIT

A Rails Engine for CSV importing with AI-powered header matching. Drop in a form object, describe your fields in plain English, and SmartCsvImport automatically maps whatever column names show up in the CSV — messy, abbreviated, or domain-specific — to your fields. No brittle column-name checks to maintain.

Matching uses a two-tier strategy: vector similarity (fast, cached embeddings) handles the common case, and LLM fallback handles ambiguous headers that need cross-field reasoning. Row data is never sent to an AI service — only headers and field descriptions.


Table of Contents


Quickstart

1. Configure an AI provider

SmartCsvImport delegates all AI calls to ruby_llm. Configure it with your provider credentials before using vector or LLM matching strategies.

# config/initializers/ruby_llm.rb
RubyLLM.configure do |config|
  # OpenAI
  config.openai_api_key = ENV["OPENAI_API_KEY"]

  # — or Anthropic —
  config.anthropic_api_key = ENV["ANTHROPIC_API_KEY"]

  # — or Google (Gemini embedding is free tier, good for getting started) —
  config.gemini_api_key = ENV["GEMINI_API_KEY"]
end

Free tier option: The default models (gemini-embedding-001 for embeddings, claude-haiku-4-5-20251001 for LLM fallback) work well for development and small workloads. A Google AI Studio key gives you free Gemini embeddings. See Model Selection for upgrading.

2. Install the gem

# Gemfile
gem "smart_csv_import"
bundle install

3. Run the install generator

rails generate smart_csv_import:install

This creates:

  • config/initializers/smart_csv_import.rb — configuration with all options commented out at their defaults
  • db/migrate/YYYYMMDDHHMMSS_create_smart_csv_import_imports.rb — import tracking table
  • tmp/smart_csv_import/ and tmp/smart_csv_import/embeddings_cache/ — storage directories
rails db:migrate

4. Generate an import form

rails generate smart_csv_import:import Employee first_name last_name email

This creates app/forms/employee_import_row.rb:

class EmployeeImportRow
  include ActiveModel::Validations
  include SmartCsvImport::Matchable

  attr_accessor :first_name, :last_name, :email

  # The description is what the AI matches against — be specific.
  # "Employee first/given name" beats "first name" for ambiguous CSVs.
  csv_field :first_name, description: "Employee first/given name", required: true
  csv_field :last_name,  description: "Employee last/family name",  required: true
  csv_field :email,      description: "Employee email address"

  validates :first_name, :last_name, presence: true

  def save
    Employee.create!(first_name: first_name, last_name: last_name, email: email)
    true
  rescue ActiveRecord::RecordInvalid
    false
  end
end

Edit the save method to persist however your app needs. SmartCsvImport calls save once per row.

5. Process a CSV

result = SmartCsvImport.process("path/to/employees.csv", form_class: EmployeeImportRow)

result.completed?       # => true
result.imported         # => 98
result.failed           # => 2
result.total            # => 100
result.errors           # => [#<RowError row=42 column=:email messages=["is invalid"]>]
result.warnings         # => [#<RowWarning message="Column 'Nickname' was not mapped">]
result.header_mappings  # => {"First Name" => "first_name", "Surname" => "last_name", ...}

That's it. SmartCsvImport reads the CSV headers, maps them to your fields, and calls save on a form object for each row.


Reviewing Failed Rows

Every row that fails — whether due to CSV parsing (malformed quotes, encoding issues) or form validation — is captured as structured data so you can build whatever review experience you want: a paginated admin UI, a corrective re-export, a retry workflow.

In-memory: the Result object (sync callers)

result = SmartCsvImport.process("employees.csv", form_class: EmployeeImportRow)

result.errors        # => [#<RowError row=42 column=:email messages=["is invalid"]>, ...]
result.parse_errors  # => [#<ParseError line_number=87 raw_line='"Bob,Jones...' error_message="Unclosed quoted field">, ...]
  • RowError(row, column, messages) — one per field that failed validation. A single CSV row that fails on multiple columns produces multiple RowError records.
  • ParseError(line_number, raw_line, error_message) — one per CSV row that couldn't be parsed at all.

Persistent: the Import#row_errors association (async, later, or both)

Every failure is persisted to the smart_csv_import_import_row_errors table, giving you an ActiveRecord association to query at any time — including after an async ImportJob has completed and the in-memory Result is gone.

import = SmartCsvImport::Import.find(import_id)

import.row_errors.count                        # => 7
import.row_errors.validation_errors.count      # => 5
import.row_errors.parse_errors.count           # => 2

# Pagination, filtering, grouping — all plain ActiveRecord:
import.row_errors.validation_errors.where(column_name: "email").limit(50).offset(100)
import.row_errors.group(:error_type, :column_name).count

Each SmartCsvImport::ImportRowError record:

Field Type Populated for Description
import_id integer both FK to smart_csv_import_imports (cascade delete)
row_number integer both Physical CSV line number (1-indexed, headers on line 1)
error_type string both "validation" or "parse"
column_name string validation Form field that failed (e.g. "email")
messages json (array) validation Error messages from the form object (e.g. ["is invalid"])
raw_line text parse Literal CSV row text that failed to parse
error_message text parse Parser's description of the failure

Downloadable "fix and re-upload" CSV

Re-export failed rows to a CSV with the original headers plus an _error column — drop it in front of a user, let them fix the rows, and re-upload.

# From a sync Result:
output_path = SmartCsvImport::FailedRowExporter.new(
  result: result,
  csv_path: "path/to/original.csv"
).call

# Or from a persisted Import (e.g. after an async job):
output_path = SmartCsvImport::FailedRowExporter.new(
  import: SmartCsvImport::Import.find(import_id),
  csv_path: "path/to/original.csv"
).call
# => "tmp/smart_csv_import/failed_rows/20260423142530_failed.csv"

The exporter writes only validation failures — parse errors keep their raw_line intact on the row_errors record, which is usually more useful for inspecting malformed CSV than re-exporting.


Form Object DSL

Include SmartCsvImport::Matchable in any class with attr_accessor declarations and use csv_field to register fields for matching.

csv_field

csv_field :field_name, description: "...", required: false
Parameter Required Description
name yes Must match an attr_accessor on the class
description: yes Plain-English description — this is what the AI matches CSV headers against
required: no When true, a failed match transitions the import to mapping_review instead of processing rows (see Import Tracking)

Write good descriptions. The description is the only signal the AI has. Vague descriptions produce weaker matches.

# Weaker — too generic
csv_field :amount, description: "amount"

# Stronger — specific and unambiguous
csv_field :amount, description: "Invoice total amount in dollars"

csv_source and csv_context

Optional class-level hints that give the LLM domain knowledge for disambiguating headers it can't resolve from descriptions alone.

class EmployeeImportRow
  include SmartCsvImport::Matchable

  # Where the CSV comes from
  csv_source "ADP Workforce payroll export"

  # The business domain of your app
  csv_context "HR platform for staffing agencies"

  csv_field :mobile_phone, description: "Employee mobile phone number"
  # ...
end

Without context, the LLM sees "Cell" and has no way to know if it means a mobile number, a prison cell, or a biological cell. With csv_source and csv_context, it can reason correctly.


Configuration

# config/initializers/smart_csv_import.rb
SmartCsvImport.configure do |config|
  config.confidence_threshold = 0.80
  config.batch_size            = 500
  config.storage_path          = "tmp/smart_csv_import"
  config.default_strategy      = :vector
  config.llm_model             = "claude-haiku-4-5-20251001"
  config.embedding_model       = "gemini-embedding-001"
  config.value_hint_rows       = 5
end
Option Default Description
confidence_threshold 0.80 Minimum cosine similarity score to accept a vector or LLM match. Headers below this threshold fall through to the next strategy tier or become unmatched.
batch_size 500 How often (in rows) the import record is updated with progress counts during processing.
storage_path "tmp/smart_csv_import" Root directory for stored CSV files and the embedding cache.
default_strategy :vector Which strategy tier to start with when no custom strategy is set on the form class.
llm_model "claude-haiku-4-5-20251001" The LLM used for fallback matching. Any model supported by ruby_llm can be used.
embedding_model "gemini-embedding-001" The embedding model used by the vector strategy.
value_hint_rows 5 Number of sample rows inspected to apply value-based confidence adjustments (e.g. boosting confidence when cell values look like dates or emails).

Model selection

Better models produce better matching accuracy on ambiguous or domain-specific headers. The defaults are suitable for development and light workloads. To upgrade:

config.embedding_model = "text-embedding-3-large"  # OpenAI — higher dimensionality
config.llm_model       = "claude-sonnet-4-6"        # Anthropic — stronger reasoning

Any model listed in the ruby_llm documentation works — no other changes needed.


Matching Strategies

How the fallback chain works

For each unmatched header, SmartCsvImport tries strategies in order and accepts the first result that meets the confidence threshold:

CSV headers
    │
    ▼
Custom strategy (if set on form class)
    │ unmatched or below threshold
    ▼
Vector strategy (embedding cosine similarity)
    │ unmatched or below threshold
    ▼
LLM strategy (structured prompt)
    │ unmatched
    ▼
UnmatchedResult → warning on the result object

Once a header is matched, it does not pass to the next tier. A header that clears all three tiers unmatched becomes an UnmatchedResult and generates a warning — it does not cause the import to fail.

Vector strategy

Computes embeddings for your field descriptions and the incoming CSV headers, then accepts the highest-scoring mutual match above the confidence threshold.

Field embeddings are cached to disk (keyed by your field definitions) so the API is only called once per unique set of fields — subsequent imports of the same type are fast.

Only headers and field descriptions are sent to the embedding API. Row data is never transmitted.

LLM strategy

Fires for headers that the vector strategy couldn't match confidently. Sends all remaining headers and all field definitions together in a single prompt, letting the LLM reason across the full set:

"Cell" next to first_name, last_name, email → clearly a phone number.
"Cell" in isolation → ambiguous.

Cross-field context is what makes this effective. Only headers and descriptions are sent — never row data.

Lookup strategy

Zero AI. For systems with fixed, known column names.

class HrSystemMapping < SmartCsvImport::Strategies::Lookup
  mappings(
    "EMP_ID"  => :employee_id,
    "FNAME"   => :first_name,
    "LNAME"   => :last_name,
    "DOB"     => :date_of_birth
  )
end

class EmployeeImportRow
  include SmartCsvImport::Matchable

  self.matching_strategy = HrSystemMapping.new
  # ...
end

Because a custom strategy runs first in the chain, matches from the Lookup table skip vector and LLM entirely.

Custom strategy

Subclass SmartCsvImport::Strategy and implement match:

class MyStrategy < SmartCsvImport::Strategy
  def match(csv_headers:, form_class:, sample_rows: [])
    csv_headers.each_with_object({}) do |header, results|
      next unless header.downcase == "emp_id"

      results[header] = SmartCsvImport::MatchResult.matched(
        target_field: :employee_id,
        confidence:   1.0,
        strategy_name: "my_strategy"
      )
    end
  end
end

Return only the headers your strategy can confidently match. Headers you omit fall through to the next tier.


Processing Modes

Synchronous (default)

Blocks until all rows are processed. Returns a result object.

result = SmartCsvImport.process("file.csv", form_class: MyImportRow)
result.completed?  # => true

Use for small files, scripts, or rake tasks.

Asynchronous

Enqueues a background job and returns immediately.

result = SmartCsvImport.process("file.csv", form_class: MyImportRow, mode: :async)
result.queued?    # => true
result.import_id  # => 42

The job runs on the :smart_csv_import queue. Requires a queue backend — Sidekiq, GoodJob, or any Active Job adapter. Use for user-facing uploads where you don't want to block a web request.

Dry run

Validates every row without persisting anything.

result = SmartCsvImport.process("file.csv", form_class: MyImportRow, dry_run: true)
result.dry_run?  # => true
result.imported  # => 95  (would succeed)
result.failed    # => 5   (would fail, with errors)

Use to preview results before committing an import.


Advanced

Header matching only

Inspect the raw mapping decisions without processing any rows:

mappings = SmartCsvImport.match_headers("file.csv", form_class: MyImportRow)
# => {
#      "First Name" => #<MatchResult target_field=:first_name confidence=0.97 strategy="vector">,
#      "Cell"       => #<MatchResult target_field=:mobile_phone confidence=0.91 strategy="llm">,
#      "Nickname"   => #<UnmatchedResult csv_header="Nickname" attempted_strategies=["vector", "llm"]>
#    }

Useful for building a review UI before committing large imports.

Import tracking

Every SmartCsvImport.process call creates a SmartCsvImport::Import record:

Status Meaning
pending Created, not yet started
processing Actively running
completed All rows processed successfully
partial_failure Some rows failed validation
failed Processing stopped due to a database error
mapping_review A required: field could not be matched — no rows were processed

The record also stores the header mappings used, row counts, and a SHA-256 hash of the file. Duplicate file detection compares this hash before processing — a warning is added to the result if a match is found.

Stability analysis

After running several imports of the same type, check which header mappings have solidified:

report   = SmartCsvImport::StabilityReport.new(import_type: "EmployeeImportRow")
analysis = report.analyze

analysis.imports_analyzed  # => 20
analysis.stable_fields     # => fields consistent >= 90% of the time
analysis.unstable_fields   # => fields with varying resolutions

puts report.summary
# Stability report for EmployeeImportRow (20 imports analyzed):
#   Stable fields (3):
#     - First Name -> first_name (100.0% consistent)
#     - Last Name  -> last_name  (100.0% consistent)
#     - Email      -> email      (95.0% consistent)

Fields stable at >= 90% are good candidates for promotion to a Lookup strategy. Doing so eliminates AI calls for those fields entirely on future imports.

Normalizers

Built-in converters for common CSV data types. Use them in your save method:

SmartCsvImport::Normalizers::DateConverter.call("03/15/2024")  # => #<Date: 2024-03-15>
SmartCsvImport::Normalizers::DateConverter.call("2024-03-15")  # => #<Date: 2024-03-15>
SmartCsvImport::Normalizers::BooleanConverter.call("yes")      # => true
SmartCsvImport::Normalizers::BooleanConverter.call("0")        # => false

# In your form object:
def save
  Employee.create!(
    name:       name,
    hired_on:   SmartCsvImport::Normalizers::DateConverter.call(hired_on),
    active:     SmartCsvImport::Normalizers::BooleanConverter.call(active)
  )
  true
rescue ActiveRecord::RecordInvalid
  false
end

Contributing

Architecture overview

SmartCsvImport.process(file, form_class:)
    └── Processor
            ├── FileStorage        stores file, computes hash, checks duplicates
            ├── Matcher            runs strategy chain, returns header → MatchResult map
            │       ├── Strategies::Vector   cosine similarity on embeddings (cached)
            │       └── Strategies::Llm      structured LLM prompt
            └── (per row) form_class.new(attrs).save

Key files:

File Purpose
lib/smart_csv_import.rb Public API: .process, .match_headers, .configure
lib/smart_csv_import/processor.rb Orchestrates matching + row processing + result building
lib/smart_csv_import/matcher.rb Runs the strategy chain, applies value hints
lib/smart_csv_import/matchable.rb csv_field DSL mixed into form objects
lib/smart_csv_import/strategies/ Vector, LLM, Lookup, and base Strategy class
lib/smart_csv_import/result.rb Result value objects returned from .process

The Configuration object is a global singleton accessed via SmartCsvImport.configuration. The Engine class hooks it into Rails' initializer and migration loading.

Getting started

git clone https://github.com/Nroulston/smart_csv_import
cd smart_csv_import
bin/setup
bin/rake        # runs the full test suite
bin/console     # IRB with all gem code loaded

Design decisions

Before sending a PR that touches the matching strategies, read ROADMAP.md. It documents two approaches that were evaluated and explicitly rejected (HyDE, asymmetric embedding augmentation) with the reasoning — understanding why they were ruled out will save you from rediscovering the same dead ends.


License

MIT