Class: Kotoshu::Embeddings::SimilaritySearch

Inherits:

Object

Object
Kotoshu::Embeddings::SimilaritySearch

show all

Defined in:: lib/kotoshu/embeddings/similarity_search.rb

Overview

Similarity search for embedding-based nearest neighbor lookup.

Efficiently finds semantically similar words using cosine similarity. Supports both on-the-fly computation and pre-computed embedding matrices.

Examples:

Basic usage

search = SimilaritySearch.new(
  vocabulary: vocab,
  model: model
)
neighbors = search.find_nearest('hello', k: 10)

With pre-loaded embedding matrix (faster)

search = SimilaritySearch.new(
  vocabulary: vocab,
  model: model,
  preload_embeddings: true
)
neighbors = search.find_nearest('hello', k: 10)

Instance Attribute Summary collapse

#embeddings_loaded ⇒ Boolean readonly

Whether embeddings are pre-loaded.
#model ⇒ OnnxRuntimeModel readonly

The ONNX model.
#vocabulary ⇒ Vocabulary readonly

The vocabulary.

Class Method Summary collapse

.from_cache(language_code, cache: nil, preload: false) ⇒ SimilaritySearch^?

Create from cache.

Instance Method Summary collapse

#cache_stats ⇒ Hash

Get cache statistics.
#clear_cache ⇒ self

Clear the embedding cache.
#cosine_similarity(vec1, vec2) ⇒ Float

Compute similarity between two embedding vectors.
#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>

Find k nearest neighbors for a word.
#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>

Find k nearest neighbors for multiple words.
#initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000) ⇒ SimilaritySearch constructor

Create a new similarity search instance.
#preload_embeddings! ⇒ Boolean

Preload all embeddings into memory for faster search.
#similarity(word1, word2) ⇒ Float

Compute similarity between two words.
#to_s ⇒ String (also: #inspect)

String representation.

Constructor Details

#initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000) ⇒ `SimilaritySearch`

Create a new similarity search instance.

Parameters:

vocabulary (Vocabulary) —

Word vocabulary
model (OnnxRuntimeModel) —

ONNX model for embeddings
preload_embeddings (Boolean) (defaults to: false) —

Whether to preload all embeddings
max_cache_size (Integer) (defaults to: 1000) —

Maximum embeddings to cache (if not preloading)

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 43

def initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000)
  @vocabulary = vocabulary
  @model = model
  @preload_embeddings = preload_embeddings
  @max_cache_size = max_cache_size

  # Embedding cache (word -> vector)
  @embedding_cache = {}

  # Pre-loaded embedding matrix (for faster search)
  @embedding_matrix = nil

  # Track whether embeddings are preloaded
  @embeddings_loaded = false

  # Load embeddings if requested
  preload_embeddings! if preload_embeddings
end

Instance Attribute Details

#embeddings_loaded ⇒ `Boolean` (readonly)

Returns Whether embeddings are pre-loaded.

Returns:

(Boolean) —

Whether embeddings are pre-loaded



35
36
37

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 35

def embeddings_loaded
  @embeddings_loaded
end

#model ⇒ `OnnxRuntimeModel` (readonly)

Returns The ONNX model.

Returns:

(OnnxRuntimeModel) —

The ONNX model



32
33
34

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 32

def model
  @model
end

#vocabulary ⇒ `Vocabulary` (readonly)

Returns The vocabulary.

Returns:

(Vocabulary) —

The vocabulary



29
30
31

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 29

def vocabulary
  @vocabulary
end

Class Method Details

.from_cache(language_code, cache: nil, preload: false) ⇒ `SimilaritySearch`^?

Create from cache.

Parameters:

language_code (String) —

ISO 639-1 language code
cache (Cache::ModelCache, nil) (defaults to: nil) —

Optional cache instance
preload (Boolean) (defaults to: false) —

Whether to preload embeddings

Returns:

(SimilaritySearch, nil) —

New search instance or nil if not available

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 317

def self.from_cache(language_code, cache: nil, preload: false)
  vocab = Vocabulary.from_cache(language_code, cache: cache)
  model = OnnxRuntimeModel.from_cache(language_code, cache: cache)

  return nil unless vocab && model

  new(
    vocabulary: vocab,
    model: model,
    preload_embeddings: preload
  )
end

Instance Method Details

#cache_stats ⇒ `Hash`

Get cache statistics.

Returns:

(Hash) —

Cache statistics

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 166

def cache_stats
  stats = {
    size: @embedding_cache.size,
    max_size: @max_cache_size
  }
  stats[:hit_rate] = @cache_hits.to_f / (@cache_hits + @cache_misses) if defined?(@cache_hits)
  stats
end

#clear_cache ⇒ `self`

Clear the embedding cache.

Returns:

(self) —

Self for chaining

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 156

def clear_cache
  @embedding_cache.clear
  @embedding_matrix = nil
  @embeddings_loaded = false
  self
end

#cosine_similarity(vec1, vec2) ⇒ `Float`

Compute similarity between two embedding vectors.

Parameters:

vec1 (Array<Float>) —

First vector
vec2 (Array<Float>) —

Second vector

Returns:

(Float) —

Cosine similarity (-1.0 to 1.0)

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 112

def cosine_similarity(vec1, vec2)
  return 0.0 if vec1.nil? || vec2.nil?

  # Compute dot product
  dot = vec1.zip(vec2).sum { |a, b| a * b }

  # Compute magnitudes
  norm1 = Math.sqrt(vec1.sum { |x| x * x })
  norm2 = Math.sqrt(vec2.sum { |x| x * x })

  return 0.0 if norm1.zero? || norm2.zero?

  dot / (norm1 * norm2)
end

#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ `Array<Hash>`

Find k nearest neighbors for a word.

Parameters:

query_word (String) —

The query word
k (Integer) (defaults to: 10) —

Number of neighbors to return
exclude_self (Boolean) (defaults to: true) —

Whether to exclude the query word itself
min_similarity (Float) (defaults to: 0.0) —

Minimum similarity threshold (0.0 to 1.0)

Returns:

(Array<Hash>) —

Array of similarity hashes

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 69

def find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0)
  # Get query embedding
  query_vec = get_embedding(query_word)
  return [] unless query_vec

  # Find neighbors
  if @embedding_matrix
    nearest_from_matrix(query_vec, k, exclude_self, min_similarity)
  else
    nearest_from_cache(query_vec, k, exclude_self, min_similarity)
  end
end

#find_nearest_batch(query_words, k: 10) ⇒ `Hash<String, Array<Hash>>`

Find k nearest neighbors for multiple words.

Parameters:

query_words (Array<String>) —

Query words
k (Integer) (defaults to: 10) —

Number of neighbors per word

Returns:

(Hash<String, Array<Hash>>) —

Word to neighbors mapping

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 87

def find_nearest_batch(query_words, k: 10)
  query_words.each_with_object({}) do |word, result|
    result[word] = find_nearest(word, k: k)
  end
end

#preload_embeddings! ⇒ `Boolean`

Preload all embeddings into memory for faster search.

Returns:

(Boolean) —

True if loaded successfully

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 130

def preload_embeddings!
  return false if @embedding_matrix

  # Get all indices
  all_indices = (0...@vocabulary.size).to_a

  # Batch load embeddings
  vectors = @model.get_embeddings(all_indices)
  return false if vectors.nil? || vectors.empty?

  # Store as hash for now (could use Numo::SFloat for efficiency)
  @embedding_matrix = {}
  all_indices.zip(vectors).each do |idx, vec|
    @embedding_matrix[idx] = vec
  end

  @embeddings_loaded = true
  true
rescue StandardError => e
  warn "Failed to preload embeddings: #{e.message}"
  false
end

#similarity(word1, word2) ⇒ `Float`

Compute similarity between two words.

Parameters:

word1 (String) —

First word
word2 (String) —

Second word

Returns:

(Float) —

Cosine similarity (-1.0 to 1.0, or nil if either word not found)

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 98

def similarity(word1, word2)
  vec1 = get_embedding(word1)
  vec2 = get_embedding(word2)

  return nil unless vec1 && vec2

  cosine_similarity(vec1, vec2)
end

#to_s ⇒ `String` Also known as: inspect

String representation.

Returns:

(String) —

String representation



178
179
180

# File 'lib/kotoshu/embeddings/similarity_search.rb', line 178

def to_s
  "SimilaritySearch(vocab_size: #{@vocabulary.size}, loaded: #{@embeddings_loaded})"
end

Class: Kotoshu::Embeddings::SimilaritySearch

Overview

Examples:

Basic usage

With pre-loaded embedding matrix (faster)

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000) ⇒ SimilaritySearch

Instance Attribute Details

#embeddings_loaded ⇒ Boolean (readonly)

#model ⇒ OnnxRuntimeModel (readonly)

#vocabulary ⇒ Vocabulary (readonly)

Class Method Details

.from_cache(language_code, cache: nil, preload: false) ⇒ SimilaritySearch?

Instance Method Details

#cache_stats ⇒ Hash

#clear_cache ⇒ self

#cosine_similarity(vec1, vec2) ⇒ Float

#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>

#find_nearest_batch(query_words, k: 10) ⇒ Hash<String, Array<Hash>>

#preload_embeddings! ⇒ Boolean

#similarity(word1, word2) ⇒ Float

#to_s ⇒ String Also known as: inspect

#initialize(vocabulary:, model:, preload_embeddings: false, max_cache_size: 1000) ⇒ `SimilaritySearch`

#embeddings_loaded ⇒ `Boolean` (readonly)

#model ⇒ `OnnxRuntimeModel` (readonly)

#vocabulary ⇒ `Vocabulary` (readonly)

.from_cache(language_code, cache: nil, preload: false) ⇒ `SimilaritySearch`^?

#cache_stats ⇒ `Hash`

#clear_cache ⇒ `self`

#cosine_similarity(vec1, vec2) ⇒ `Float`

#find_nearest(query_word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ `Array<Hash>`

#find_nearest_batch(query_words, k: 10) ⇒ `Hash<String, Array<Hash>>`

#preload_embeddings! ⇒ `Boolean`

#similarity(word1, word2) ⇒ `Float`

#to_s ⇒ `String` Also known as: inspect