Class: Kotoshu::Embeddings::SimilaritySearch
- Inherits:
-
Object
- Object
- Kotoshu::Embeddings::SimilaritySearch
- Defined in:
- lib/kotoshu/embeddings/similarity_search.rb
Overview
Similarity search for embedding-based nearest neighbor lookup.
Efficiently finds semantically similar words using cosine similarity. Supports both on-the-fly computation and pre-computed embedding matrices.
Instance Attribute Summary collapse
-
#embeddings_loaded ⇒ Boolean
readonly
Whether embeddings are pre-loaded.
-
#model ⇒ OnnxRuntimeModel
readonly
The ONNX model.
-
#vocabulary ⇒ Vocabulary
readonly
The vocabulary.
Class Method Summary collapse
-
.from_cache(language_code, cache: nil, preload: false) ⇒ SimilaritySearch?
Create from cache.
Instance Method Summary collapse
-
#cache_stats ⇒ Hash
Get cache statistics.
-
#clear_cache ⇒ self
Clear the embedding cache.
-
#cosine_similarity(vec1, vec2) ⇒ Float
Compute similarity between two embedding vectors.
-
#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>
Find k nearest neighbors for a word.
-
#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>
Find k nearest neighbors for multiple words.
-
#initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000) ⇒ SimilaritySearch
constructor
Create a new similarity search instance.
-
#preload_embeddings! ⇒ Boolean
Preload all embeddings into memory for faster search.
-
#similarity(word1, word2) ⇒ Float
Compute similarity between two words.
-
#to_s ⇒ String
(also: #inspect)
String representation.
Constructor Details
#initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000) ⇒ SimilaritySearch
Create a new similarity search instance.
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 43 def initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000) @vocabulary = vocabulary @model = model @preload_embeddings = @max_cache_size = max_cache_size # Embedding cache (word -> vector) @embedding_cache = {} # Pre-loaded embedding matrix (for faster search) @embedding_matrix = nil # Track whether embeddings are preloaded @embeddings_loaded = false # Load embeddings if requested if end |
Instance Attribute Details
#embeddings_loaded ⇒ Boolean (readonly)
Returns Whether embeddings are pre-loaded.
35 36 37 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 35 def @embeddings_loaded end |
#model ⇒ OnnxRuntimeModel (readonly)
Returns The ONNX model.
32 33 34 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 32 def model @model end |
#vocabulary ⇒ Vocabulary (readonly)
Returns The vocabulary.
29 30 31 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 29 def vocabulary @vocabulary end |
Class Method Details
.from_cache(language_code, cache: nil, preload: false) ⇒ SimilaritySearch?
Create from cache.
317 318 319 320 321 322 323 324 325 326 327 328 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 317 def self.from_cache(language_code, cache: nil, preload: false) vocab = Vocabulary.from_cache(language_code, cache: cache) model = OnnxRuntimeModel.from_cache(language_code, cache: cache) return nil unless vocab && model new( vocabulary: vocab, model: model, preload_embeddings: preload ) end |
Instance Method Details
#cache_stats ⇒ Hash
Get cache statistics.
166 167 168 169 170 171 172 173 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 166 def cache_stats stats = { size: @embedding_cache.size, max_size: @max_cache_size } stats[:hit_rate] = @cache_hits.to_f / (@cache_hits + @cache_misses) if defined?(@cache_hits) stats end |
#clear_cache ⇒ self
Clear the embedding cache.
156 157 158 159 160 161 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 156 def clear_cache @embedding_cache.clear @embedding_matrix = nil @embeddings_loaded = false self end |
#cosine_similarity(vec1, vec2) ⇒ Float
Compute similarity between two embedding vectors.
112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 112 def cosine_similarity(vec1, vec2) return 0.0 if vec1.nil? || vec2.nil? # Compute dot product dot = vec1.zip(vec2).sum { |a, b| a * b } # Compute magnitudes norm1 = Math.sqrt(vec1.sum { |x| x * x }) norm2 = Math.sqrt(vec2.sum { |x| x * x }) return 0.0 if norm1.zero? || norm2.zero? dot / (norm1 * norm2) end |
#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>
Find k nearest neighbors for a word.
69 70 71 72 73 74 75 76 77 78 79 80 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 69 def find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) # Get query embedding query_vec = (query_word) return [] unless query_vec # Find neighbors if @embedding_matrix nearest_from_matrix(query_vec, k, exclude_self, min_similarity) else nearest_from_cache(query_vec, k, exclude_self, min_similarity) end end |
#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>
Find k nearest neighbors for multiple words.
87 88 89 90 91 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 87 def find_nearest_batch(query_words, k: 10) query_words.each_with_object({}) do |word, result| result[word] = find_nearest(word, k: k) end end |
#preload_embeddings! ⇒ Boolean
Preload all embeddings into memory for faster search.
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 130 def return false if @embedding_matrix # Get all indices all_indices = (0...@vocabulary.size).to_a # Batch load embeddings vectors = @model.(all_indices) return false if vectors.nil? || vectors.empty? # Store as hash for now (could use Numo::SFloat for efficiency) @embedding_matrix = {} all_indices.zip(vectors).each do |idx, vec| @embedding_matrix[idx] = vec end @embeddings_loaded = true true rescue StandardError => e warn "Failed to preload embeddings: #{e.}" false end |
#similarity(word1, word2) ⇒ Float
Compute similarity between two words.
98 99 100 101 102 103 104 105 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 98 def similarity(word1, word2) vec1 = (word1) vec2 = (word2) return nil unless vec1 && vec2 cosine_similarity(vec1, vec2) end |
#to_s ⇒ String Also known as: inspect
String representation.
178 179 180 |
# File 'lib/kotoshu/embeddings/similarity_search.rb', line 178 def to_s "SimilaritySearch(vocab_size: #{@vocabulary.size}, loaded: #{@embeddings_loaded})" end |