Class: SimilarityEngine

Inherits:
Object
  • Object
show all
Includes:
SimilarityEngineProtocol
Defined in:
lib/kotoshu/embeddings/similarity_engine.rb

Overview

SimilarityEngine - Compute similarity between embedding vectors

Provides various similarity/distance metrics with optimizations like norm caching and pre-normalized vector support.

Examples:

Basic usage

engine = SimilarityEngine.new
engine.cosine([1.0, 0.0], [1.0, 0.0])  # => 1.0

Pre-normalized vectors (faster)

engine = SimilarityEngine.new(pre_normalize: true)
engine.pre_normalize([1.0, 0.0])  # => [1.0, 0.0]

Constant Summary collapse

DEFAULT_CACHE_SIZE =

Default embedding dimension for norm cache initialization

10_000

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Protocol

#assert_implemented_by!, #compliance_errors, #optional_methods, #required_methods

Constructor Details

#initialize(pre_normalize: false, cache_norms: true) ⇒ SimilarityEngine

Create a new similarity engine

Parameters:

  • pre_normalize (Boolean) (defaults to: false)

    Whether to pre-normalize vectors

  • cache_norms (Boolean) (defaults to: true)

    Whether to cache vector norms



38
39
40
41
42
43
44
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 38

def initialize(pre_normalize: false, cache_norms: true)
  @pre_normalize = pre_normalize
  @cache_norms = cache_norms
  @norm_cache = cache_norms ? {} : nil
  @cache_hits = 0
  @cache_misses = 0
end

Instance Attribute Details

#cache_hitsInteger (readonly)

Returns Number of cache hits.

Returns:

  • (Integer)

    Number of cache hits



28
29
30
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 28

def cache_hits
  @cache_hits
end

#cache_missesInteger (readonly)

Returns Number of cache misses.

Returns:

  • (Integer)

    Number of cache misses



31
32
33
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 31

def cache_misses
  @cache_misses
end

#pre_normalize(vec) ⇒ Array<Float> (readonly)

Pre-normalize a vector to unit length

Parameters:

  • vec (Array<Float>)

    Vector to normalize

Returns:

  • (Array<Float>)

    Normalized vector



25
26
27
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 25

def pre_normalize
  @pre_normalize
end

Instance Method Details

#cache_statsHash

Get cache statistics

Returns:

  • (Hash)

    Cache statistics



177
178
179
180
181
182
183
184
185
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 177

def cache_stats
  total = @cache_hits + @cache_misses
  {
    hits: @cache_hits,
    misses: @cache_misses,
    hit_rate: total.zero? ? 0.0 : @cache_hits.to_f / total,
    cache_size: @norm_cache&.size || 0
  }
end

#clear_cacheself

Clear the norm cache

Returns:

  • (self)


166
167
168
169
170
171
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 166

def clear_cache
  @norm_cache&.clear
  @cache_hits = 0
  @cache_misses = 0
  self
end

#compute_all_pairs(vectors) ⇒ Array<Array<Float>>

Compute all pairwise similarities for a set of vectors

Parameters:

  • vectors (Array<Array<Float>>)

    ] Array of vectors

Returns:

  • (Array<Array<Float>>)

    Similarity matrix



203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 203

def compute_all_pairs(vectors)
  n = vectors.length
  matrix = Array.new(n) { Array.new(n, 0.0) }

  (0...n).each do |i|
    matrix[i][i] = 1.0
    ((i + 1)...n).each do |j|
      sim = cosine(vectors[i], vectors[j])
      matrix[i][j] = sim
      matrix[j][i] = sim
    end
  end

  matrix
end

#cosine(vec1, vec2) ⇒ Float

Compute cosine similarity between two vectors

Cosine similarity = dot(v1, v2) / (||v1|| * ||v2||) Range: -1.0 (opposite) to 1.0 (identical)

Parameters:

  • vec1 (Array<Float>)

    First vector

  • vec2 (Array<Float>)

    Second vector

Returns:

  • (Float)

    Cosine similarity, or 0.0 if either vector is nil/empty



55
56
57
58
59
60
61
62
63
64
65
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 55

def cosine(vec1, vec2)
  return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty?

  norm1 = get_norm(vec1)
  norm2 = get_norm(vec2)

  return 0.0 if norm1.zero? || norm2.zero?

  dot = dot_product(vec1, vec2)
  dot / (norm1 * norm2)
end

#cosine_batch(pairs) ⇒ Array<Float>

Compute similarity for a batch of vector pairs

More efficient than calling cosine() repeatedly.

Parameters:

  • pairs (Array<Array<Array<Float>>>)

    Array of [vec1, vec2] pairs

Returns:

  • (Array<Float>)

    Array of similarities



194
195
196
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 194

def cosine_batch(pairs)
  pairs.map { |v1, v2| cosine(v1, v2) }
end

#dot_product(vec1, vec2) ⇒ Float

Compute dot product between two vectors

Parameters:

  • vec1 (Array<Float>)

    First vector

  • vec2 (Array<Float>)

    Second vector

Returns:

  • (Float)

    Dot product



73
74
75
76
77
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 73

def dot_product(vec1, vec2)
  return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty?

  vec1.zip(vec2).sum { |a, b| a * b }
end

#euclidean(vec1, vec2) ⇒ Float

Compute Euclidean distance between two vectors

Parameters:

  • vec1 (Array<Float>)

    First vector

  • vec2 (Array<Float>)

    Second vector

Returns:

  • (Float)

    Euclidean distance



85
86
87
88
89
90
91
92
93
94
95
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 85

def euclidean(vec1, vec2)
  return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty?
  return 0.0 if vec1.equal?(vec2)

  sum = 0.0
  vec1.zip(vec2) do |a, b|
    diff = a - b
    sum += diff * diff
  end
  Math.sqrt(sum)
end

#is_normalized?(vec) ⇒ Boolean

Check if vectors are normalized (unit length)

Parameters:

  • vec (Array<Float>)

    Vector to check

Returns:

  • (Boolean)

    True if vector is normalized



147
148
149
150
151
152
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 147

def is_normalized?(vec)
  return true if vec.nil? || vec.empty?

  norm = get_norm(vec)
  (norm - 1.0).abs < Float::EPSILON * 10
end

#manhattan(vec1, vec2) ⇒ Float

Compute Manhattan (L1) distance between two vectors

Parameters:

  • vec1 (Array<Float>)

    First vector

  • vec2 (Array<Float>)

    Second vector

Returns:

  • (Float)

    Manhattan distance



103
104
105
106
107
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 103

def manhattan(vec1, vec2)
  return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty?

  vec1.zip(vec2).sum { |a, b| (a - b).abs }
end

#normalization_required?Boolean

Check if normalization is required for accurate similarity

Returns:

  • (Boolean)

    True if normalization should be applied



158
159
160
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 158

def normalization_required?
  !@pre_normalize
end

#normalize_and_compute(vec1, vec2) ⇒ Float

Normalize and compute similarity in one pass

For pre-normalized vectors, this is just dot product (much faster).

Parameters:

  • vec1 (Array<Float>)

    First vector

  • vec2 (Array<Float>)

    Second vector

Returns:

  • (Float)

    Cosine similarity



131
132
133
134
135
136
137
138
139
140
# File 'lib/kotoshu/embeddings/similarity_engine.rb', line 131

def normalize_and_compute(vec1, vec2)
  return 0.0 if vec1.nil? || vec2.nil? || vec1.empty? || vec2.empty?

  if @pre_normalize
    # For normalized vectors, cosine similarity = dot product
    dot_product(vec1, vec2)
  else
    cosine(vec1, vec2)
  end
end