Class: Search

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/embeddings/search.rb

Overview

Search - Brute force nearest neighbor search

Performs exhaustive search over all vocabulary entries. Uses min-heap for efficient top-k selection (O(n log k) instead of O(n log n)).

Examples:

search = ExactSearch.new(
  vocabulary: vocab,
  model: model,
  similarity_engine: engine
)
neighbors = search.find_nearest('hello', k: 5)

Defined Under Namespace

Classes: MinHeap

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false) ⇒ Search

Create a new exact search

Parameters:

  • vocabulary (Vocabulary)

    Word vocabulary

  • model (EmbeddingModel)

    Embedding provider

  • similarity_engine (SimilarityEngine)

    Similarity calculator

  • pre_normalize (Boolean) (defaults to: false)

    Pre-normalize vectors for faster similarity



66
67
68
69
70
71
72
73
74
# File 'lib/kotoshu/embeddings/search.rb', line 66

def initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false)
  @vocabulary = vocabulary
  @model = model
  @similarity_engine = similarity_engine
  @pre_normalize = pre_normalize

  @embedding_cache = {}
  @embeddings_loaded = false
end

Instance Attribute Details

#embeddings_loadedBoolean (readonly)

Returns Whether embeddings are preloaded.

Returns:

  • (Boolean)

    Whether embeddings are preloaded



57
58
59
# File 'lib/kotoshu/embeddings/search.rb', line 57

def embeddings_loaded
  @embeddings_loaded
end

#modelEmbeddingModel (readonly)

Returns:

  • (EmbeddingModel)


51
52
53
# File 'lib/kotoshu/embeddings/search.rb', line 51

def model
  @model
end

#similarity_engineSimilarityEngine (readonly)

Returns:



54
55
56
# File 'lib/kotoshu/embeddings/search.rb', line 54

def similarity_engine
  @similarity_engine
end

#vocabularyVocabulary (readonly)

Returns:



48
49
50
# File 'lib/kotoshu/embeddings/search.rb', line 48

def vocabulary
  @vocabulary
end

Instance Method Details

#clear_cacheself

Clear embedding cache

Returns:

  • (self)


153
154
155
156
157
# File 'lib/kotoshu/embeddings/search.rb', line 153

def clear_cache
  @embedding_cache.clear
  @embeddings_loaded = false
  self
end

#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>

Find k nearest neighbors for a word

Parameters:

  • query_word (String)

    Query word

  • k (Integer) (defaults to: 10)

    Number of neighbors to return

  • exclude_self (Boolean) (defaults to: true)

    Exclude query word from results

  • min_similarity (Float) (defaults to: 0.0)

    Minimum similarity threshold

Returns:

  • (Array<Hash>)

    Array of similarity, index



84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# File 'lib/kotoshu/embeddings/search.rb', line 84

def find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0)
  query_vec = get_embedding_for_word(query_word)
  return [] unless query_vec

  heap = MinHeap.new(k)

  @vocabulary.words.each do |word|
    next if exclude_self && word == query_word

    vec = get_embedding_for_word(word)
    next unless vec

    similarity = @similarity_engine.cosine(query_vec, vec)
    next if similarity < min_similarity

    index = @vocabulary.lookup(word)
    heap.push(word: word, similarity: similarity, index: index)
  end

  # Return sorted by similarity descending
  heap.to_a.sort_by { |r| -r[:similarity] }
end

#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>

Find nearest neighbors for multiple words

Parameters:

  • query_words (Array<String>)

    Query words

  • k (Integer) (defaults to: 10)

    Number of neighbors per word

Returns:

  • (Hash<String, Array<Hash>>)

    Word to results mapping



113
114
115
116
117
# File 'lib/kotoshu/embeddings/search.rb', line 113

def find_nearest_batch(query_words, k: 10)
  query_words.each_with_object({}) do |word, results|
    results[word] = find_nearest(word, k: k)
  end
end

#preload_embeddings!self

Preload all embeddings into memory

Returns:

  • (self)


137
138
139
140
141
142
143
144
145
146
147
# File 'lib/kotoshu/embeddings/search.rb', line 137

def preload_embeddings!
  all_indices = (0...@vocabulary.size).to_a
  embeddings = @model.get_embeddings(all_indices)

  @vocabulary.words.each_with_index do |word, i|
    @embedding_cache[word] = embeddings[i]
  end

  @embeddings_loaded = true
  self
end

#similarity(word1, word2) ⇒ Float?

Compute similarity between two words

Parameters:

  • word1 (String)

    First word

  • word2 (String)

    Second word

Returns:

  • (Float, nil)

    Similarity or nil if either word not found



125
126
127
128
129
130
131
# File 'lib/kotoshu/embeddings/search.rb', line 125

def similarity(word1, word2)
  vec1 = get_embedding_for_word(word1)
  vec2 = get_embedding_for_word(word2)
  return nil unless vec1 && vec2

  @similarity_engine.cosine(vec1, vec2)
end

#to_sString Also known as: inspect

String representation

Returns:

  • (String)


163
164
165
# File 'lib/kotoshu/embeddings/search.rb', line 163

def to_s
  "ExactSearch(vocab: #{@vocabulary.size}, loaded: #{@embeddings_loaded})"
end