Module: SmartCsvImport::HeaderNormalizer

Defined in:
lib/smart_csv_import/header_normalizer.rb

Constant Summary collapse

ABBREVIATIONS =

Unambiguous abbreviations only — terms that reliably mean one thing in a business CSV context. This list is intentionally small and conservative.

DO NOT add entries that could mean two different things depending on domain (e.g. “ext” = file extension or phone extension, “co” = company or county, “apt” = apartment or adjective, “sal” / “val” = names).

This list is not meant to be comprehensive. The LLM fallback strategy handles the long tail of ambiguous and domain-specific abbreviations far better than any static dictionary can.

{
  # Personal
  "dob"    => "date of birth",
  "dod"    => "date of death",
  "ssn"    => "social security number",
  "nin"    => "national insurance number",
  "dba"    => "doing business as",
  # Contact
  "tel"    => "telephone",
  # Location
  "addr"   => "address",
  "zip"    => "zip code",
  "ste"    => "suite",
  # Organisation / HR
  "dept"   => "department",
  "mgr"    => "manager",
  "emp"    => "employee",
  "org"    => "organization",
  "corp"   => "corporation",
  # Quantities / identifiers
  "qty"    => "quantity",
  "amt"    => "amount",
  "num"    => "number",
  "ref"    => "reference",
  "acct"   => "account",
  # Finance
  "bal"    => "balance",
  "pmt"    => "payment",
  "inv"    => "invoice",
  # Misc
  "desc"   => "description",
  "info"   => "information",
  "misc"   => "miscellaneous",
}.freeze

Class Method Summary collapse

Class Method Details

.normalize(header) ⇒ Object



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# File 'lib/smart_csv_import/header_normalizer.rb', line 50

def self.normalize(header)
  text = header.to_s

  # Split camelCase and PascalCase: "CustomerDOB" → "Customer DOB"
  text = text
    .gsub(/([a-z])([A-Z])/, '\1 \2')
    .gsub(/([A-Z]{2,})([A-Z][a-z])/, '\1 \2')

  # Underscores, dashes, dots, slashes → spaces
  text = text.tr("_./\\-", " ")

  # Strip non-alphanumeric characters (removes #, *, (, ), etc.)
  text = text.gsub(/[^a-zA-Z0-9\s]/, " ")

  # Collapse whitespace
  text = text.gsub(/\s+/, " ").strip

  # Expand abbreviations — whole-word, case-insensitive
  text = text.split(" ").map do |word|
    ABBREVIATIONS[word.downcase] || word
  end.join(" ")

  # Final collapse in case expansions introduced extra spaces
  text.gsub(/\s+/, " ").strip
end