Class: EmbeddingPipeline

Inherits:

Object

Object
EmbeddingPipeline

show all

Defined in:: lib/kotoshu/embeddings/embedding_pipeline.rb

Overview

EmbeddingPipeline - Unified API for embedding-based similarity search

Provides a simple, unified interface for loading vocabulary and models, and performing similarity search. This is the recommended entry point.

Examples:

Simple usage (one line)

pipeline = EmbeddingPipeline.from_cache(language: 'en')

Full configuration

pipeline = EmbeddingPipeline.new(
  vocabulary: vocab,
  model: model,
  preload: true
)

Finding similar words

neighbors = pipeline.find_nearest('semantic', k: 5)
neighbors.each { |r| puts "#{r[:word]}: #{r[:similarity].round(4)}" }

Instance Attribute Summary collapse

#model ⇒ EmbeddingModel readonly
#search ⇒ Search readonly
#similarity_engine ⇒ SimilarityEngine readonly
#vocabulary ⇒ Vocabulary readonly

Class Method Summary collapse

.from_cache(language:, cache: nil, preload: false, index: :exact) ⇒ EmbeddingPipeline (also: [])

Create pipeline from cache (one-line initialization).
.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact) ⇒ EmbeddingPipeline

Create pipeline from files.

Instance Method Summary collapse

#find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>

Find k nearest neighbors for a word.
#find_nearest_batch(words, k: 10) ⇒ Hash<String, Array<Hash>>

Find nearest neighbors for multiple words.
#get_embedding(word) ⇒ Array<Float>^?

Get embedding for a word.
#get_embedding_by_index(index) ⇒ Array<Float>^?

Get embedding by index.
#include?(word) ⇒ Boolean

Check if word exists in vocabulary.
#initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000) ⇒ EmbeddingPipeline constructor

Create pipeline with full configuration.
#model_info ⇒ Hash

Get model information.
#preload_embeddings! ⇒ self

Preload all embeddings into memory.
#similarity(word1, word2) ⇒ Float^?

Compute similarity between two words.
#stats ⇒ Hash

Get pipeline statistics.
#to_s ⇒ String (also: #inspect)

String representation.
#unload! ⇒ self

Unload model from memory.

Constructor Details

#initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000) ⇒ `EmbeddingPipeline`

Create pipeline with full configuration

Parameters:

vocabulary (Vocabulary) —

Vocabulary instance
model (EmbeddingModel) —

Model instance
preload (Boolean) (defaults to: false) —

Preload embeddings
index (:exact, :ann) (defaults to: :exact) —

Search index type (:exact = brute force, :ann = FAISS/HNSW)
pre_normalize (Boolean) (defaults to: false) —

Pre-normalize vectors
cache_size (Integer) (defaults to: 1000) —

Embedding cache size

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 103

def initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000)
  @vocabulary = vocabulary
  @model = model
  @similarity_engine = SimilarityEngine.new(pre_normalize: pre_normalize)
  @cache_size = cache_size

  # Create search engine
  @search = Search.new(
    vocabulary: vocabulary,
    model: model,
    similarity_engine: @similarity_engine,
    pre_normalize: pre_normalize
  )

  preload_embeddings! if preload
end

Instance Attribute Details

#model ⇒ `EmbeddingModel` (readonly)

Returns:

(EmbeddingModel)



33
34
35

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 33

def model
  @model
end

#search ⇒ `Search` (readonly)

Returns:

(Search)



39
40
41

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 39

def search
  @search
end

#similarity_engine ⇒ `SimilarityEngine` (readonly)

Returns:

(SimilarityEngine)



36
37
38

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 36

def similarity_engine
  @similarity_engine
end

#vocabulary ⇒ `Vocabulary` (readonly)

Returns:

(Vocabulary)



30
31
32

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 30

def vocabulary
  @vocabulary
end

Class Method Details

.from_cache(language:, cache: nil, preload: false, index: :exact) ⇒ `EmbeddingPipeline` Also known as: []

Create pipeline from cache (one-line initialization)

Parameters:

language (String) —

ISO 639-1 language code
cache (Cache::ModelCache) (defaults to: nil) —

Cache instance
preload (Boolean) (defaults to: false) —

Preload embeddings into memory
index (:exact, :auto) (defaults to: :exact) —

Search index type

Returns:

(EmbeddingPipeline)

Raises:

(ArgumentError) —

If no cached model found for language

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 51

def self.from_cache(language:, cache: nil, preload: false, index: :exact)
  require_relative 'cache/model_cache'

  cache ||= Cache::ModelCache.new

  vocab_path = cache.find_vocab(language)
  model_path = cache.find_model(language, :onnx)

  unless vocab_path && model_path
    raise ArgumentError, "No cached model for language: #{language}. " \
                         "Run: ruby scripts/extract_vocabularies.rb --languages=#{language}"
  end

  from_files(
    vocab_path: vocab_path,
    model_path: model_path,
    language: language,
    preload: preload,
    index: index
  )
end

.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact) ⇒ `EmbeddingPipeline`

Create pipeline from files

Parameters:

vocab_path (String) —

Path to vocabulary JSON file
model_path (String) —

Path to ONNX model file
language (String) —

Language code
preload (Boolean) (defaults to: false) —

Preload embeddings
index (:exact, :auto) (defaults to: :exact) —

Search index type

Returns:

(EmbeddingPipeline)

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 82

def self.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact)
  vocab = Vocabulary.from_file(vocab_path, language_code: language)
  model = OnnxRuntimeModel.from_file(model_path, language_code: language)

  new(
    vocabulary: vocab,
    model: model,
    preload: preload,
    index: index
  )
end

Instance Method Details

#find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ `Array<Hash>`

Find k nearest neighbors for a word

Parameters:

word (String) —

Query word
k (Integer) (defaults to: 10) —

Number of neighbors
exclude_self (Boolean) (defaults to: true) —

Exclude query word
min_similarity (Float) (defaults to: 0.0) —

Minimum similarity threshold

Returns:

(Array<Hash>) —

Array of similarity, index



128
129
130

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 128

def find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0)
  @search.find_nearest(word, k: k, exclude_self: exclude_self, min_similarity: min_similarity)
end

#find_nearest_batch(words, k: 10) ⇒ `Hash<String, Array<Hash>>`

Find nearest neighbors for multiple words

Parameters:

words (Array<String>) —

Query words
k (Integer) (defaults to: 10) —

Neighbors per word

Returns:

(Hash<String, Array<Hash>>)



138
139
140

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 138

def find_nearest_batch(words, k: 10)
  @search.find_nearest_batch(words, k: k)
end

#get_embedding(word) ⇒ `Array<Float>`^?

Get embedding for a word

Parameters:

word (String) —

Word

Returns:

(Array<Float>, nil)



157
158
159

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 157

def get_embedding(word)
  @model.get_embedding_for_word(word, @vocabulary)
end

#get_embedding_by_index(index) ⇒ `Array<Float>`^?

Get embedding by index

Parameters:

index (Integer) —

Word index

Returns:

(Array<Float>, nil)



166
167
168

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 166

def get_embedding_by_index(index)
  @model.get_embedding(index)
end

#include?(word) ⇒ `Boolean`

Check if word exists in vocabulary

Parameters:

word (String) —

Word

Returns:

(Boolean)



175
176
177

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 175

def include?(word)
  @vocabulary.include?(word)
end

#model_info ⇒ `Hash`

Get model information

Returns:

(Hash)



218
219
220

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 218

def model_info
  @model.model_info
end

#preload_embeddings! ⇒ `self`

Preload all embeddings into memory

Returns:

(self)

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 183

def preload_embeddings!
  @model.load!
  @search.preload_embeddings!
  self
end

#similarity(word1, word2) ⇒ `Float`^?

Compute similarity between two words

Parameters:

word1 (String) —

First word
word2 (String) —

Second word

Returns:

(Float, nil) —

Similarity or nil if either word not found



148
149
150

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 148

def similarity(word1, word2)
  @search.similarity(word1, word2)
end

#stats ⇒ `Hash`

Get pipeline statistics

Returns:

(Hash)

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 203

def stats
  {
    language: @vocabulary.language_code,
    vocabulary_size: @vocabulary.size,
    embedding_dimension: @model.dimension,
    model_loaded: @model.loaded?,
    embeddings_preloaded: @search.embeddings_loaded,
    cache_stats: @search.instance_variable_get(:@embedding_cache)&.stats
  }
end

#to_s ⇒ `String` Also known as: inspect

String representation

Returns:

(String)

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 226

def to_s
  "EmbeddingPipeline(language: #{@vocabulary.language_code}, " \
    "vocab_size: #{@vocabulary.size}, " \
    "dimension: #{@model.dimension}, " \
    "loaded: #{@model.loaded?})"
end

#unload! ⇒ `self`

Unload model from memory

Returns:

(self)

# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 193

def unload!
  @model.unload!
  @search.clear_cache
  self
end

Class: EmbeddingPipeline

Overview

Examples:

Simple usage (one line)

Full configuration

Finding similar words

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000) ⇒ EmbeddingPipeline

Instance Attribute Details

#model ⇒ EmbeddingModel (readonly)

#search ⇒ Search (readonly)

#similarity_engine ⇒ SimilarityEngine (readonly)

#vocabulary ⇒ Vocabulary (readonly)

Class Method Details

.from_cache(language:, cache: nil, preload: false, index: :exact) ⇒ EmbeddingPipeline Also known as: []

.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact) ⇒ EmbeddingPipeline

Instance Method Details

#find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>

#find_nearest_batch(words, k: 10) ⇒ Hash<String, Array<Hash>>

#get_embedding(word) ⇒ Array<Float>?

#get_embedding_by_index(index) ⇒ Array<Float>?

#include?(word) ⇒ Boolean

#model_info ⇒ Hash

#preload_embeddings! ⇒ self

#similarity(word1, word2) ⇒ Float?

#stats ⇒ Hash

#to_s ⇒ String Also known as: inspect

#unload! ⇒ self

#initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000) ⇒ `EmbeddingPipeline`

#model ⇒ `EmbeddingModel` (readonly)

#search ⇒ `Search` (readonly)

#similarity_engine ⇒ `SimilarityEngine` (readonly)

#vocabulary ⇒ `Vocabulary` (readonly)

.from_cache(language:, cache: nil, preload: false, index: :exact) ⇒ `EmbeddingPipeline` Also known as: []

.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact) ⇒ `EmbeddingPipeline`

#find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ `Array<Hash>`

#find_nearest_batch(words, k: 10) ⇒ `Hash<String, Array<Hash>>`

#get_embedding(word) ⇒ `Array<Float>`^?

#get_embedding_by_index(index) ⇒ `Array<Float>`^?

#include?(word) ⇒ `Boolean`

#model_info ⇒ `Hash`

#preload_embeddings! ⇒ `self`

#similarity(word1, word2) ⇒ `Float`^?

#stats ⇒ `Hash`

#to_s ⇒ `String` Also known as: inspect

#unload! ⇒ `self`