Class: Kotoshu::Embeddings::SimilaritySearch

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/embeddings/similarity_search.rb

Overview

Similarity search for embedding-based nearest neighbor lookup.

Efficiently finds semantically similar words using cosine similarity. Supports both on-the-fly computation and pre-computed embedding matrices.

Examples:

Basic usage

search = SimilaritySearch.new(
  vocabulary: vocab,
  model: model
)
neighbors = search.find_nearest('hello', k: 10)

With pre-loaded embedding matrix (faster)

search = SimilaritySearch.new(
  vocabulary: vocab,
  model: model,
  preload_embeddings: true
)
neighbors = search.find_nearest('hello', k: 10)

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000) ⇒ SimilaritySearch

Create a new similarity search instance.

Parameters:

  • vocabulary (Vocabulary)

    Word vocabulary

  • model (OnnxRuntimeModel)

    ONNX model for embeddings

  • preload_embeddings (Boolean) (defaults to: false)

    Whether to preload all embeddings

  • max_cache_size (Integer) (defaults to: 1000)

    Maximum embeddings to cache (if not preloading)



43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 43

def initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000)
  @vocabulary = vocabulary
  @model = model
  @preload_embeddings = preload_embeddings
  @max_cache_size = max_cache_size

  # Embedding cache (word -> vector)
  @embedding_cache = {}

  # Pre-loaded embedding matrix (for faster search)
  @embedding_matrix = nil

  # Track whether embeddings are preloaded
  @embeddings_loaded = false

  # Load embeddings if requested
  preload_embeddings! if preload_embeddings
end

Instance Attribute Details

#embeddings_loadedBoolean (readonly)

Returns Whether embeddings are pre-loaded.

Returns:

  • (Boolean)

    Whether embeddings are pre-loaded



35
36
37
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 35

def embeddings_loaded
  @embeddings_loaded
end

#modelOnnxRuntimeModel (readonly)

Returns The ONNX model.

Returns:



32
33
34
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 32

def model
  @model
end

#vocabularyVocabulary (readonly)

Returns The vocabulary.

Returns:



29
30
31
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 29

def vocabulary
  @vocabulary
end

Class Method Details

.from_cache(language_code, cache: nil, preload: false) ⇒ SimilaritySearch?

Create from cache.

Parameters:

  • language_code (String)

    ISO 639-1 language code

  • cache (Cache::ModelCache, nil) (defaults to: nil)

    Optional cache instance

  • preload (Boolean) (defaults to: false)

    Whether to preload embeddings

Returns:



317
318
319
320
321
322
323
324
325
326
327
328
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 317

def self.from_cache(language_code, cache: nil, preload: false)
  vocab = Vocabulary.from_cache(language_code, cache: cache)
  model = OnnxRuntimeModel.from_cache(language_code, cache: cache)

  return nil unless vocab && model

  new(
    vocabulary: vocab,
    model: model,
    preload_embeddings: preload
  )
end

Instance Method Details

#cache_statsHash

Get cache statistics.

Returns:

  • (Hash)

    Cache statistics



166
167
168
169
170
171
172
173
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 166

def cache_stats
  stats = {
    size: @embedding_cache.size,
    max_size: @max_cache_size
  }
  stats[:hit_rate] = @cache_hits.to_f / (@cache_hits + @cache_misses) if defined?(@cache_hits)
  stats
end

#clear_cacheself

Clear the embedding cache.

Returns:

  • (self)

    Self for chaining



156
157
158
159
160
161
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 156

def clear_cache
  @embedding_cache.clear
  @embedding_matrix = nil
  @embeddings_loaded = false
  self
end

#cosine_similarity(vec1, vec2) ⇒ Float

Compute similarity between two embedding vectors.

Parameters:

  • vec1 (Array<Float>)

    First vector

  • vec2 (Array<Float>)

    Second vector

Returns:

  • (Float)

    Cosine similarity (-1.0 to 1.0)



112
113
114
115
116
117
118
119
120
121
122
123
124
125
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 112

def cosine_similarity(vec1, vec2)
  return 0.0 if vec1.nil? || vec2.nil?

  # Compute dot product
  dot = vec1.zip(vec2).sum { |a, b| a * b }

  # Compute magnitudes
  norm1 = Math.sqrt(vec1.sum { |x| x * x })
  norm2 = Math.sqrt(vec2.sum { |x| x * x })

  return 0.0 if norm1.zero? || norm2.zero?

  dot / (norm1 * norm2)
end

#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>

Find k nearest neighbors for a word.

Parameters:

  • query_word (String)

    The query word

  • k (Integer) (defaults to: 10)

    Number of neighbors to return

  • exclude_self (Boolean) (defaults to: true)

    Whether to exclude the query word itself

  • min_similarity (Float) (defaults to: 0.0)

    Minimum similarity threshold (0.0 to 1.0)

Returns:

  • (Array<Hash>)

    Array of similarity hashes



69
70
71
72
73
74
75
76
77
78
79
80
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 69

def find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0)
  # Get query embedding
  query_vec = get_embedding(query_word)
  return [] unless query_vec

  # Find neighbors
  if @embedding_matrix
    nearest_from_matrix(query_vec, k, exclude_self, min_similarity)
  else
    nearest_from_cache(query_vec, k, exclude_self, min_similarity)
  end
end

#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>

Find k nearest neighbors for multiple words.

Parameters:

  • query_words (Array<String>)

    Query words

  • k (Integer) (defaults to: 10)

    Number of neighbors per word

Returns:

  • (Hash<String, Array<Hash>>)

    Word to neighbors mapping



87
88
89
90
91
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 87

def find_nearest_batch(query_words, k: 10)
  query_words.each_with_object({}) do |word, result|
    result[word] = find_nearest(word, k: k)
  end
end

#preload_embeddings!Boolean

Preload all embeddings into memory for faster search.

Returns:

  • (Boolean)

    True if loaded successfully



130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 130

def preload_embeddings!
  return false if @embedding_matrix

  # Get all indices
  all_indices = (0...@vocabulary.size).to_a

  # Batch load embeddings
  vectors = @model.get_embeddings(all_indices)
  return false if vectors.nil? || vectors.empty?

  # Store as hash for now (could use Numo::SFloat for efficiency)
  @embedding_matrix = {}
  all_indices.zip(vectors).each do |idx, vec|
    @embedding_matrix[idx] = vec
  end

  @embeddings_loaded = true
  true
rescue StandardError => e
  warn "Failed to preload embeddings: #{e.message}"
  false
end

#similarity(word1, word2) ⇒ Float

Compute similarity between two words.

Parameters:

  • word1 (String)

    First word

  • word2 (String)

    Second word

Returns:

  • (Float)

    Cosine similarity (-1.0 to 1.0, or nil if either word not found)



98
99
100
101
102
103
104
105
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 98

def similarity(word1, word2)
  vec1 = get_embedding(word1)
  vec2 = get_embedding(word2)

  return nil unless vec1 && vec2

  cosine_similarity(vec1, vec2)
end

#to_sString Also known as: inspect

String representation.

Returns:

  • (String)

    String representation



178
179
180
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 178

def to_s
  "SimilaritySearch(vocab_size: #{@vocabulary.size}, loaded: #{@embeddings_loaded})"
end