philiprehberger-fuzzy_match

Tests Gem Version Last updated

Fuzzy string matching with Levenshtein, Damerau-Levenshtein, Jaro-Winkler, Hamming, LCS, token-based, and phonetic algorithms

Requirements

  • Ruby >= 3.1

Installation

Add to your Gemfile:

gem "philiprehberger-fuzzy_match"

Or install directly:

gem install philiprehberger-fuzzy_match

Usage

require "philiprehberger/fuzzy_match"

# Individual algorithms
Philiprehberger::FuzzyMatch.levenshtein('kitten', 'sitting')   # => 3
Philiprehberger::FuzzyMatch.jaro_winkler('martha', 'marhta')   # => ~0.96
Philiprehberger::FuzzyMatch.dice_coefficient('night', 'nacht') # => 0.25

# Normalized ratio (0.0 to 1.0)
Philiprehberger::FuzzyMatch.ratio('kitten', 'sitting')  # => ~0.57

Damerau-Levenshtein (Transposition-Aware)

# Counts adjacent transpositions as 1 edit (Levenshtein counts them as 2)
Philiprehberger::FuzzyMatch.damerau_levenshtein('teh', 'the')   # => 1
Philiprehberger::FuzzyMatch.damerau_ratio('teh', 'the')         # => ~0.667

Longest Common Subsequence

Philiprehberger::FuzzyMatch.lcs('kitten', 'sitting')       # => 4
Philiprehberger::FuzzyMatch.lcs_ratio('kitten', 'sitting')  # => ~0.615

Best Match

candidates = %w[Ruby Python Rust JavaScript]
result = Philiprehberger::FuzzyMatch.best('rubyy', candidates)
result[:match]  # => "Ruby"
result[:score]  # => 0.8
candidates = %w[commit comment command compare]
results = Philiprehberger::FuzzyMatch.search('comit', candidates, threshold: 0.5)
# => [{ match: "commit", score: 0.8333 }, { match: "comment", score: 0.7143 }, ...]

Did-You-Mean Suggestions

Philiprehberger::FuzzyMatch.suggest('comit', %w[commit comment zebra], threshold: 0.6, max: 3)
# => ["commit", "comment"]

Phonetic Matching

Philiprehberger::FuzzyMatch.soundex('Robert')    # => "R163"
Philiprehberger::FuzzyMatch.metaphone('Smith')    # => "SM0"
Philiprehberger::FuzzyMatch.phonetic_match?('Robert', 'Rupert')  # => true

Deduplication

Philiprehberger::FuzzyMatch.deduplicate(%w[hello helo world wrld], threshold: 0.8)
# => ["hello", "world"]

Hamming Distance

Philiprehberger::FuzzyMatch.hamming('karolin', 'kathrin')  # => 3
Philiprehberger::FuzzyMatch.hamming('abc', 'abc')          # => 0
# Raises Error for different-length strings

Token-Based Matching

# Token sort: reorder tokens alphabetically before comparing
Philiprehberger::FuzzyMatch.token_sort_ratio('john smith jr', 'jr john smith')  # => 1.0

# Token set: compare based on token set intersection/union
Philiprehberger::FuzzyMatch.token_set_ratio('new york mets', 'new york mets vs atlanta braves')
# => high score (shared tokens boost similarity)

Weighted Scoring

Philiprehberger::FuzzyMatch.weighted_score('kitten', 'sitting',
  weights: { jaro_winkler: 0.5, dice: 0.3, levenshtein_ratio: 0.2 })
# => weighted combination of algorithm scores
# Supported keys: :jaro_winkler, :dice, :levenshtein_ratio, :lcs_ratio, :damerau_ratio
# Weights must sum to 1.0

API

Philiprehberger::FuzzyMatch

Method Description
.levenshtein(a, b) Levenshtein edit distance (integer)
.jaro_winkler(a, b) Jaro-Winkler similarity (0.0 to 1.0)
.dice_coefficient(a, b) Dice coefficient from bigram overlap (0.0 to 1.0)
.damerau_levenshtein(a, b) Damerau-Levenshtein distance with transpositions (integer)
.damerau_ratio(a, b) Normalized Damerau-Levenshtein similarity (0.0 to 1.0)
.lcs(a, b) Longest common subsequence length (integer)
.lcs_ratio(a, b) Normalized LCS similarity (0.0 to 1.0)
.ratio(a, b) Normalized Levenshtein ratio (0.0 to 1.0)
.best(query, candidates, threshold: 0.0) Best match as { match:, score: }
.search(query, candidates, threshold: 0.3) Ranked array of { match:, score: }
.suggest(query, candidates, threshold: 0.6, max: 5) Array of match strings
.soundex(string) Generate 4-character Soundex code
.metaphone(string) Generate Metaphone phonetic code
.phonetic_match?(a, b) Check if two strings match phonetically
.hamming(a, b) Hamming distance for equal-length strings (integer)
.token_sort_ratio(a, b) Token-sorted Jaro-Winkler similarity (0.0 to 1.0)
.token_set_ratio(a, b) Token-set-based similarity (0.0 to 1.0)
.weighted_score(a, b, weights:) Weighted multi-algorithm score (0.0 to 1.0)
.deduplicate(array, threshold:, algorithm:) Group and deduplicate similar strings

All methods are case-insensitive by default.

Development

bundle install
bundle exec rspec
bundle exec rubocop

Support

If you find this project useful:

Star the repo

🐛 Report issues

💡 Suggest features

❤️ Sponsor development

🌐 All Open Source Projects

💻 GitHub Profile

🔗 LinkedIn Profile

License

MIT