Class: SimilarityEngine
- Inherits:
-
Object
- Object
- SimilarityEngine
- Includes:
- SimilarityEngineProtocol
- Defined in:
- lib/kotoshu/embeddings/similarity_engine.rb
Overview
SimilarityEngine - Compute similarity between embedding vectors
Provides various similarity/distance metrics with optimizations like norm caching and pre-normalized vector support.
Constant Summary collapse
- DEFAULT_CACHE_SIZE =
Default embedding dimension for norm cache initialization
10_000
Instance Attribute Summary collapse
-
#cache_hits ⇒ Integer
readonly
Number of cache hits.
-
#cache_misses ⇒ Integer
readonly
Number of cache misses.
-
#pre_normalize(vec) ⇒ Array<Float>
readonly
Pre-normalize a vector to unit length.
Instance Method Summary collapse
-
#cache_stats ⇒ Hash
Get cache statistics.
-
#clear_cache ⇒ self
Clear the norm cache.
-
#compute_all_pairs(vectors) ⇒ Array<Array<Float>>
Compute all pairwise similarities for a set of vectors.
-
#cosine(vec1, vec2) ⇒ Float
Compute cosine similarity between two vectors.
-
#cosine_batch(pairs) ⇒ Array<Float>
Compute similarity for a batch of vector pairs.
-
#dot_product(vec1, vec2) ⇒ Float
Compute dot product between two vectors.
-
#euclidean(vec1, vec2) ⇒ Float
Compute Euclidean distance between two vectors.
-
#initialize(pre_normalize: false, cache_norms: true) ⇒ SimilarityEngine
constructor
Create a new similarity engine.
-
#is_normalized?(vec) ⇒ Boolean
Check if vectors are normalized (unit length).
-
#manhattan(vec1, vec2) ⇒ Float
Compute Manhattan (L1) distance between two vectors.
-
#normalization_required? ⇒ Boolean
Check if normalization is required for accurate similarity.
-
#normalize_and_compute(vec1, vec2) ⇒ Float
Normalize and compute similarity in one pass.
Methods included from Protocol
#assert_implemented_by!, #compliance_errors, #optional_methods, #required_methods
Constructor Details
#initialize(pre_normalize: false, cache_norms: true) ⇒ SimilarityEngine
Create a new similarity engine
38 39 40 41 42 43 44 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 38 def initialize(pre_normalize: false, cache_norms: true) @pre_normalize = pre_normalize @cache_norms = cache_norms @norm_cache = cache_norms ? {} : nil @cache_hits = 0 @cache_misses = 0 end |
Instance Attribute Details
#cache_hits ⇒ Integer (readonly)
Returns Number of cache hits.
28 29 30 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 28 def cache_hits @cache_hits end |
#cache_misses ⇒ Integer (readonly)
Returns Number of cache misses.
31 32 33 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 31 def cache_misses @cache_misses end |
#pre_normalize(vec) ⇒ Array<Float> (readonly)
Pre-normalize a vector to unit length
25 26 27 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 25 def pre_normalize @pre_normalize end |
Instance Method Details
#cache_stats ⇒ Hash
Get cache statistics
177 178 179 180 181 182 183 184 185 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 177 def cache_stats total = @cache_hits + @cache_misses { hits: @cache_hits, misses: @cache_misses, hit_rate: total.zero? ? 0.0 : @cache_hits.to_f / total, cache_size: @norm_cache&.size || 0 } end |
#clear_cache ⇒ self
Clear the norm cache
166 167 168 169 170 171 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 166 def clear_cache @norm_cache&.clear @cache_hits = 0 @cache_misses = 0 self end |
#compute_all_pairs(vectors) ⇒ Array<Array<Float>>
Compute all pairwise similarities for a set of vectors
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 203 def compute_all_pairs(vectors) n = vectors.length matrix = Array.new(n) { Array.new(n, 0.0) } (0...n).each do |i| matrix[i][i] = 1.0 ((i + 1)...n).each do |j| sim = cosine(vectors[i], vectors[j]) matrix[i][j] = sim matrix[j][i] = sim end end matrix end |
#cosine(vec1, vec2) ⇒ Float
Compute cosine similarity between two vectors
Cosine similarity = dot(v1, v2) / (||v1|| * ||v2||) Range: -1.0 (opposite) to 1.0 (identical)
55 56 57 58 59 60 61 62 63 64 65 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 55 def cosine(vec1, vec2) return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty? norm1 = get_norm(vec1) norm2 = get_norm(vec2) return 0.0 if norm1.zero? || norm2.zero? dot = dot_product(vec1, vec2) dot / (norm1 * norm2) end |
#cosine_batch(pairs) ⇒ Array<Float>
Compute similarity for a batch of vector pairs
More efficient than calling cosine() repeatedly.
194 195 196 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 194 def cosine_batch(pairs) pairs.map { |v1, v2| cosine(v1, v2) } end |
#dot_product(vec1, vec2) ⇒ Float
Compute dot product between two vectors
73 74 75 76 77 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 73 def dot_product(vec1, vec2) return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty? vec1.zip(vec2).sum { |a, b| a * b } end |
#euclidean(vec1, vec2) ⇒ Float
Compute Euclidean distance between two vectors
85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 85 def euclidean(vec1, vec2) return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty? return 0.0 if vec1.equal?(vec2) sum = 0.0 vec1.zip(vec2) do |a, b| diff = a - b sum += diff * diff end Math.sqrt(sum) end |
#is_normalized?(vec) ⇒ Boolean
Check if vectors are normalized (unit length)
147 148 149 150 151 152 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 147 def is_normalized?(vec) return true if vec.nil? || vec.empty? norm = get_norm(vec) (norm - 1.0).abs < Float::EPSILON * 10 end |
#manhattan(vec1, vec2) ⇒ Float
Compute Manhattan (L1) distance between two vectors
103 104 105 106 107 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 103 def manhattan(vec1, vec2) return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty? vec1.zip(vec2).sum { |a, b| (a - b).abs } end |
#normalization_required? ⇒ Boolean
Check if normalization is required for accurate similarity
158 159 160 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 158 def normalization_required? !@pre_normalize end |
#normalize_and_compute(vec1, vec2) ⇒ Float
Normalize and compute similarity in one pass
For pre-normalized vectors, this is just dot product (much faster).
131 132 133 134 135 136 137 138 139 140 |
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 131 def normalize_and_compute(vec1, vec2) return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty? if @pre_normalize # For normalized vectors, cosine similarity = dot product dot_product(vec1, vec2) else cosine(vec1, vec2) end end |