amatch - Approximate Matching Extension for Ruby 📏

Description 📝

amatch is a high-performance collection of classes used for approximate matching, searching, and comparing strings. It provides an efficient Ruby interface to several industry-standard algorithms for calculating edit distance and string similarity.

Supported Algorithms 🧩

The library implements a wide array of metrics to suit different matching needs:

Levenshtein Distance: The classic "edit distance" (insertions, deletions, substitutions).
Sellers Algorithm: A variation of Levenshtein optimized for searching a pattern within a longer text.
Damerau-Levenshtein: Similar to Levenshtein but considers transpositions of two adjacent characters as a single edit.
Hamming Distance: Measures the number of positions at which corresponding symbols are different (only for strings of equal length).
Jaro-Winkler: A metric geared towards short strings like names, giving more weight to prefix matches.
Pair Distance: A flexible distance metric based on character pairs (also available as Amatch::DiceCoefficient). Unlike set-based measures, this implementation uses multisets, meaning it is sensitive to the frequency of repeated character pairs.
Longest Common Subsequence/Substring: Finds the longest shared sequences between two strings.

Installation 📦

You can install the extension as a gem:

gem install amatch

Alternatively, if you prefer manual installation:

ruby install.rb
# or
rake install

Usage 🛠️

Basic Setup

To get started, simply require the library and include the Amatch module to add similarity methods directly to the String class.

require 'amatch'
include Amatch

Edit Distance Algorithms 📉

These algorithms return the "cost" to transform one string into another. Lower values indicate higher similarity.

Levenshtein & Damerau-Levenshtein

# Standard Levenshtein
m = Levenshtein.new("pattern")
m.match("pattren") # => 2
"pattern language".levenshtein_similar("language of patterns") # => 0.2

# Damerau-Levenshtein (handles transpositions)
m = Amatch::DamerauLevenshtein.new("pattern")
m.match("pattren") # => 1
"pattern language".damerau_levenshtein_similar("language of patterns") # => 0.2

Sellers (Pattern Searching)

Sellers is particularly useful for finding the best match of a pattern within a larger body of text.

m = Sellers.new("pattern")
m.match("pattren") # => 2.0

# You can customize weights for different edit types
m.substitution = m.insertion = 3
m.match("pattren") # => 4.0

m.reset_weights
m.search("abcpattrendef") # => 2.0

Hamming Distance

Used primarily for strings of equal length to count substitutions.

m = Hamming.new("pattern")
m.match("pattren") # => 2
"pattern language".hamming_similar("language of patterns") # => 0.1

Similarity Metrics 📈

These algorithms typically return a score between 0.0 and 1.0, where 1.0 is a perfect match.

Jaro-Winkler

Highly effective for record linkage and matching names.

m = JaroWinkler.new("pattern")
m.match("paTTren") # => 0.9714...
m.ignore_case = false
m.match("paTTren") # => 0.7942...

# Custom scaling factor for prefix bonus
m.scaling_factor = 0.05
m.match("pattren") # => 0.9619...

"pattern language".jarowinkler_similar("language of patterns") # => 0.6722...

Jaro

The base metric for the Winkler variation.

m = Jaro.new("pattern")
m.match("paTTren") # => 0.9523...
"pattern language".jaro_similar("language of patterns") # => 0.6722...

Other Metrics (Pair Distance, LCS, Longest Substring)

# Pair Distance
# Note: This implementation uses multisets, meaning it considers character 
# frequencies rather than just unique pairs.
m = PairDistance.new("pattern")
m.match("pattr en") # => 0.5454...

# Pro Tip: Pass a regex as the second argument to match based on tokens
#  (e.g., words) rather than individual characters. This is particularly
# useful for natural language.
m.match("language of patterns", /\s+/)
"pattern language".pair_distance_similar("language of patterns", /\s+/) # => 0.9285...

# Longest Common Subsequence
m = LongestSubsequence.new("pattern")
m.match("pattren") # => 6
"pattern language".longest_subsequence_similar("language of patterns") # => 0.4

# Longest Common Substring
m = LongestSubstring.new("pattern")
m.match("pattren") # => 4
"pattern language".longest_substring_similar("language of patterns") # => 0.4

Performance ⚡

amatch is implemented as a C extension to ensure maximum throughput when processing large datasets or complex string comparisons.

performance

Download 📥

The homepage of this library is located at:

https://github.com/flori/amatch

Author 👨‍💻

Florian Frank

License 📄

Apache License, Version 2.0 – See the COPYING file in the source archive.