philiprehberger-fuzzy_match
Fuzzy string matching with Levenshtein, Damerau-Levenshtein, Jaro-Winkler, Hamming, LCS, token-based, and phonetic algorithms
Requirements
- Ruby >= 3.1
Installation
Add to your Gemfile:
gem "philiprehberger-fuzzy_match"
Or install directly:
gem install philiprehberger-fuzzy_match
Usage
require "philiprehberger/fuzzy_match"
# Individual algorithms
Philiprehberger::FuzzyMatch.levenshtein('kitten', 'sitting') # => 3
Philiprehberger::FuzzyMatch.jaro_winkler('martha', 'marhta') # => ~0.96
Philiprehberger::FuzzyMatch.dice_coefficient('night', 'nacht') # => 0.25
# Normalized ratio (0.0 to 1.0)
Philiprehberger::FuzzyMatch.ratio('kitten', 'sitting') # => ~0.57
Damerau-Levenshtein (Transposition-Aware)
# Counts adjacent transpositions as 1 edit (Levenshtein counts them as 2)
Philiprehberger::FuzzyMatch.damerau_levenshtein('teh', 'the') # => 1
Philiprehberger::FuzzyMatch.damerau_ratio('teh', 'the') # => ~0.667
Longest Common Subsequence
Philiprehberger::FuzzyMatch.lcs('kitten', 'sitting') # => 4
Philiprehberger::FuzzyMatch.lcs_ratio('kitten', 'sitting') # => ~0.615
Best Match
candidates = %w[Ruby Python Rust JavaScript]
result = Philiprehberger::FuzzyMatch.best('rubyy', candidates)
result[:match] # => "Ruby"
result[:score] # => 0.8
Ranked Search
candidates = %w[commit comment command compare]
results = Philiprehberger::FuzzyMatch.search('comit', candidates, threshold: 0.5)
# => [{ match: "commit", score: 0.8333 }, { match: "comment", score: 0.7143 }, ...]
Did-You-Mean Suggestions
Philiprehberger::FuzzyMatch.suggest('comit', %w[commit comment zebra], threshold: 0.6, max: 3)
# => ["commit", "comment"]
Phonetic Matching
Philiprehberger::FuzzyMatch.soundex('Robert') # => "R163"
Philiprehberger::FuzzyMatch.('Smith') # => "SM0"
Philiprehberger::FuzzyMatch.phonetic_match?('Robert', 'Rupert') # => true
Deduplication
Philiprehberger::FuzzyMatch.deduplicate(%w[hello helo world wrld], threshold: 0.8)
# => ["hello", "world"]
Hamming Distance
Philiprehberger::FuzzyMatch.hamming('karolin', 'kathrin') # => 3
Philiprehberger::FuzzyMatch.hamming('abc', 'abc') # => 0
# Raises Error for different-length strings
Token-Based Matching
# Token sort: reorder tokens alphabetically before comparing
Philiprehberger::FuzzyMatch.token_sort_ratio('john smith jr', 'jr john smith') # => 1.0
# Token set: compare based on token set intersection/union
Philiprehberger::FuzzyMatch.token_set_ratio('new york mets', 'new york mets vs atlanta braves')
# => high score (shared tokens boost similarity)
Weighted Scoring
Philiprehberger::FuzzyMatch.weighted_score('kitten', 'sitting',
weights: { jaro_winkler: 0.5, dice: 0.3, levenshtein_ratio: 0.2 })
# => weighted combination of algorithm scores
# Supported keys: :jaro_winkler, :dice, :levenshtein_ratio, :lcs_ratio, :damerau_ratio
# Weights must sum to 1.0
API
Philiprehberger::FuzzyMatch
| Method | Description |
|---|---|
.levenshtein(a, b) |
Levenshtein edit distance (integer) |
.jaro_winkler(a, b) |
Jaro-Winkler similarity (0.0 to 1.0) |
.dice_coefficient(a, b) |
Dice coefficient from bigram overlap (0.0 to 1.0) |
.damerau_levenshtein(a, b) |
Damerau-Levenshtein distance with transpositions (integer) |
.damerau_ratio(a, b) |
Normalized Damerau-Levenshtein similarity (0.0 to 1.0) |
.lcs(a, b) |
Longest common subsequence length (integer) |
.lcs_ratio(a, b) |
Normalized LCS similarity (0.0 to 1.0) |
.ratio(a, b) |
Normalized Levenshtein ratio (0.0 to 1.0) |
.best(query, candidates, threshold: 0.0) |
Best match as { match:, score: } |
.search(query, candidates, threshold: 0.3) |
Ranked array of { match:, score: } |
.suggest(query, candidates, threshold: 0.6, max: 5) |
Array of match strings |
.soundex(string) |
Generate 4-character Soundex code |
.metaphone(string) |
Generate Metaphone phonetic code |
.phonetic_match?(a, b) |
Check if two strings match phonetically |
.hamming(a, b) |
Hamming distance for equal-length strings (integer) |
.token_sort_ratio(a, b) |
Token-sorted Jaro-Winkler similarity (0.0 to 1.0) |
.token_set_ratio(a, b) |
Token-set-based similarity (0.0 to 1.0) |
.weighted_score(a, b, weights:) |
Weighted multi-algorithm score (0.0 to 1.0) |
.deduplicate(array, threshold:, algorithm:) |
Group and deduplicate similar strings |
All methods are case-insensitive by default.
Development
bundle install
bundle exec rspec
bundle exec rubocop
Support
If you find this project useful: