Class: Search

Inherits:

Object

Object
Search

show all

Defined in:: lib/kotoshu/embeddings/search.rb

Overview

Search - Brute force nearest neighbor search

Performs exhaustive search over all vocabulary entries. Uses min-heap for efficient top-k selection (O(n log k) instead of O(n log n)).

Examples:

search = ExactSearch.new(
  vocabulary: vocab,
  model: model,
  similarity_engine: engine
)
neighbors = search.find_nearest('hello', k: 5)

Defined Under Namespace

Classes: MinHeap

Instance Attribute Summary collapse

#embeddings_loaded ⇒ Boolean readonly

Whether embeddings are preloaded.
#model ⇒ EmbeddingModel readonly
#similarity_engine ⇒ SimilarityEngine readonly
#vocabulary ⇒ Vocabulary readonly

Instance Method Summary collapse

#clear_cache ⇒ self

Clear embedding cache.
#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>

Find k nearest neighbors for a word.
#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>

Find nearest neighbors for multiple words.
#initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false) ⇒ Search constructor

Create a new exact search.
#preload_embeddings! ⇒ self

Preload all embeddings into memory.
#similarity(word1, word2) ⇒ Float^?

Compute similarity between two words.
#to_s ⇒ String (also: #inspect)

String representation.

Constructor Details

#initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false) ⇒ `Search`

Create a new exact search

Parameters:

vocabulary (Vocabulary) —

Word vocabulary
model (EmbeddingModel) —

Embedding provider
similarity_engine (SimilarityEngine) —

Similarity calculator
pre_normalize (Boolean) (defaults to: false) —

Pre-normalize vectors for faster similarity

# File 'lib/kotoshu/embeddings/search.rb', line 66

def initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false)
  @vocabulary = vocabulary
  @model = model
  @similarity_engine = similarity_engine
  @pre_normalize = pre_normalize

  @embedding_cache = {}
  @embeddings_loaded = false
end

Instance Attribute Details

#embeddings_loaded ⇒ `Boolean` (readonly)

Returns Whether embeddings are preloaded.

Returns:

(Boolean) —

Whether embeddings are preloaded



57
58
59

# File 'lib/kotoshu/embeddings/search.rb', line 57

def embeddings_loaded
  @embeddings_loaded
end

#model ⇒ `EmbeddingModel` (readonly)

Returns:

(EmbeddingModel)



51
52
53

# File 'lib/kotoshu/embeddings/search.rb', line 51

def model
  @model
end

#similarity_engine ⇒ `SimilarityEngine` (readonly)

Returns:

(SimilarityEngine)



54
55
56

# File 'lib/kotoshu/embeddings/search.rb', line 54

def similarity_engine
  @similarity_engine
end

#vocabulary ⇒ `Vocabulary` (readonly)

Returns:

(Vocabulary)



48
49
50

# File 'lib/kotoshu/embeddings/search.rb', line 48

def vocabulary
  @vocabulary
end

Instance Method Details

#clear_cache ⇒ `self`

Clear embedding cache

Returns:

(self)

# File 'lib/kotoshu/embeddings/search.rb', line 153

def clear_cache
  @embedding_cache.clear
  @embeddings_loaded = false
  self
end

#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ `Array<Hash>`

Find k nearest neighbors for a word

Parameters:

query_word (String) —

Query word
k (Integer) (defaults to: 10) —

Number of neighbors to return
exclude_self (Boolean) (defaults to: true) —

Exclude query word from results
min_similarity (Float) (defaults to: 0.0) —

Minimum similarity threshold

Returns:

(Array<Hash>) —

Array of similarity, index

# File 'lib/kotoshu/embeddings/search.rb', line 84

def find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0)
  query_vec = get_embedding_for_word(query_word)
  return [] unless query_vec

  heap = MinHeap.new(k)

  @vocabulary.words.each do |word|
    next if exclude_self && word == query_word

    vec = get_embedding_for_word(word)
    next unless vec

    similarity = @similarity_engine.cosine(query_vec, vec)
    next if similarity < min_similarity

    index = @vocabulary.lookup(word)
    heap.push(word: word, similarity: similarity, index: index)
  end

  # Return sorted by similarity descending
  heap.to_a.sort_by { |r| -r[:similarity] }
end

#find_nearest_batch(query_words, k: 10) ⇒ `Hash<String, Array<Hash>>`

Find nearest neighbors for multiple words

Parameters:

query_words (Array<String>) —

Query words
k (Integer) (defaults to: 10) —

Number of neighbors per word

Returns:

(Hash<String, Array<Hash>>) —

Word to results mapping

# File 'lib/kotoshu/embeddings/search.rb', line 113

def find_nearest_batch(query_words, k: 10)
  query_words.each_with_object({}) do |word, results|
    results[word] = find_nearest(word, k: k)
  end
end

#preload_embeddings! ⇒ `self`

Preload all embeddings into memory

Returns:

(self)

# File 'lib/kotoshu/embeddings/search.rb', line 137

def preload_embeddings!
  all_indices = (0...@vocabulary.size).to_a
  embeddings = @model.get_embeddings(all_indices)

  @vocabulary.words.each_with_index do |word, i|
    @embedding_cache[word] = embeddings[i]
  end

  @embeddings_loaded = true
  self
end

#similarity(word1, word2) ⇒ `Float`^?

Compute similarity between two words

Parameters:

word1 (String) —

First word
word2 (String) —

Second word

Returns:

(Float, nil) —

Similarity or nil if either word not found

# File 'lib/kotoshu/embeddings/search.rb', line 125

def similarity(word1, word2)
  vec1 = get_embedding_for_word(word1)
  vec2 = get_embedding_for_word(word2)
  return nil unless vec1 && vec2

  @similarity_engine.cosine(vec1, vec2)
end

#to_s ⇒ `String` Also known as: inspect

String representation

Returns:

(String)



163
164
165

# File 'lib/kotoshu/embeddings/search.rb', line 163

def to_s
  "ExactSearch(vocab: #{@vocabulary.size}, loaded: #{@embeddings_loaded})"
end

Class: Search

Overview

Examples:

Defined Under Namespace

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false) ⇒ Search

Instance Attribute Details

#embeddings_loaded ⇒ Boolean (readonly)

#model ⇒ EmbeddingModel (readonly)

#similarity_engine ⇒ SimilarityEngine (readonly)

#vocabulary ⇒ Vocabulary (readonly)

Instance Method Details

#clear_cache ⇒ self

#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>

#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>

#preload_embeddings! ⇒ self

#similarity(word1, word2) ⇒ Float?

#to_s ⇒ String Also known as: inspect

#initialize(vocabulary:, model:, similarity_engine:, pre_normalize: false) ⇒ `Search`

#embeddings_loaded ⇒ `Boolean` (readonly)

#model ⇒ `EmbeddingModel` (readonly)

#similarity_engine ⇒ `SimilarityEngine` (readonly)

#vocabulary ⇒ `Vocabulary` (readonly)

#clear_cache ⇒ `self`

#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ `Array<Hash>`

#find_nearest_batch(query_words, k: 10) ⇒ `Hash<String, Array<Hash>>`

#preload_embeddings! ⇒ `self`

#similarity(word1, word2) ⇒ `Float`^?

#to_s ⇒ `String` Also known as: inspect