Class: Search
- Inherits:
-
Object
- Object
- Search
- Defined in:
- lib/kotoshu/embeddings/search.rb
Overview
Search - Brute force nearest neighbor search
Performs exhaustive search over all vocabulary entries. Uses min-heap for efficient top-k selection (O(n log k) instead of O(n log n)).
Defined Under Namespace
Classes: MinHeap
Instance Attribute Summary collapse
-
#embeddings_loaded ⇒ Boolean
readonly
Whether embeddings are preloaded.
- #model ⇒ EmbeddingModel readonly
- #similarity_engine ⇒ SimilarityEngine readonly
- #vocabulary ⇒ Vocabulary readonly
Instance Method Summary collapse
-
#clear_cache ⇒ self
Clear embedding cache.
-
#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>
Find k nearest neighbors for a word.
-
#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>
Find nearest neighbors for multiple words.
-
#initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false) ⇒ Search
constructor
Create a new exact search.
-
#preload_embeddings! ⇒ self
Preload all embeddings into memory.
-
#similarity(word1, word2) ⇒ Float?
Compute similarity between two words.
-
#to_s ⇒ String
(also: #inspect)
String representation.
Constructor Details
#initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false) ⇒ Search
Create a new exact search
66 67 68 69 70 71 72 73 74 |
# File 'lib/kotoshu/embeddings/search.rb', line 66 def initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false) @vocabulary = vocabulary @model = model @similarity_engine = similarity_engine @pre_normalize = pre_normalize @embedding_cache = {} @embeddings_loaded = false end |
Instance Attribute Details
#embeddings_loaded ⇒ Boolean (readonly)
Returns Whether embeddings are preloaded.
57 58 59 |
# File 'lib/kotoshu/embeddings/search.rb', line 57 def @embeddings_loaded end |
#model ⇒ EmbeddingModel (readonly)
51 52 53 |
# File 'lib/kotoshu/embeddings/search.rb', line 51 def model @model end |
#similarity_engine ⇒ SimilarityEngine (readonly)
54 55 56 |
# File 'lib/kotoshu/embeddings/search.rb', line 54 def similarity_engine @similarity_engine end |
#vocabulary ⇒ Vocabulary (readonly)
48 49 50 |
# File 'lib/kotoshu/embeddings/search.rb', line 48 def vocabulary @vocabulary end |
Instance Method Details
#clear_cache ⇒ self
Clear embedding cache
153 154 155 156 157 |
# File 'lib/kotoshu/embeddings/search.rb', line 153 def clear_cache @embedding_cache.clear @embeddings_loaded = false self end |
#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>
Find k nearest neighbors for a word
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/kotoshu/embeddings/search.rb', line 84 def find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) query_vec = (query_word) return [] unless query_vec heap = MinHeap.new(k) @vocabulary.words.each do |word| next if exclude_self && word == query_word vec = (word) next unless vec similarity = @similarity_engine.cosine(query_vec, vec) next if similarity < min_similarity index = @vocabulary.lookup(word) heap.push(word: word, similarity: similarity, index: index) end # Return sorted by similarity descending heap.to_a.sort_by { |r| -r[:similarity] } end |
#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>
Find nearest neighbors for multiple words
113 114 115 116 117 |
# File 'lib/kotoshu/embeddings/search.rb', line 113 def find_nearest_batch(query_words, k: 10) query_words.each_with_object({}) do |word, results| results[word] = find_nearest(word, k: k) end end |
#preload_embeddings! ⇒ self
Preload all embeddings into memory
137 138 139 140 141 142 143 144 145 146 147 |
# File 'lib/kotoshu/embeddings/search.rb', line 137 def all_indices = (0...@vocabulary.size).to_a = @model.(all_indices) @vocabulary.words.each_with_index do |word, i| @embedding_cache[word] = [i] end @embeddings_loaded = true self end |
#similarity(word1, word2) ⇒ Float?
Compute similarity between two words
125 126 127 128 129 130 131 |
# File 'lib/kotoshu/embeddings/search.rb', line 125 def similarity(word1, word2) vec1 = (word1) vec2 = (word2) return nil unless vec1 && vec2 @similarity_engine.cosine(vec1, vec2) end |
#to_s ⇒ String Also known as: inspect
String representation
163 164 165 |
# File 'lib/kotoshu/embeddings/search.rb', line 163 def to_s "ExactSearch(vocab: #{@vocabulary.size}, loaded: #{@embeddings_loaded})" end |