Class: EmbeddingPipeline

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/embeddings/embedding_pipeline.rb

Overview

EmbeddingPipeline - Unified API for embedding-based similarity search

Provides a simple, unified interface for loading vocabulary and models, and performing similarity search. This is the recommended entry point.

Examples:

Simple usage (one line)

pipeline = EmbeddingPipeline.from_cache(language: 'en')

Full configuration

pipeline = EmbeddingPipeline.new(
  vocabulary: vocab,
  model: model,
  preload: true
)

Finding similar words

neighbors = pipeline.find_nearest('semantic', k: 5)
neighbors.each { |r| puts "#{r[:word]}: #{r[:similarity].round(4)}" }

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000) ⇒ EmbeddingPipeline

Create pipeline with full configuration

Parameters:

  • vocabulary (Vocabulary)

    Vocabulary instance

  • model (EmbeddingModel)

    Model instance

  • preload (Boolean) (defaults to: false)

    Preload embeddings

  • index (:exact, :ann) (defaults to: :exact)

    Search index type (:exact = brute force, :ann = FAISS/HNSW)

  • pre_normalize (Boolean) (defaults to: false)

    Pre-normalize vectors

  • cache_size (Integer) (defaults to: 1000)

    Embedding cache size



103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 103

def initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000)
  @vocabulary = vocabulary
  @model = model
  @similarity_engine = SimilarityEngine.new(pre_normalize: pre_normalize)
  @cache_size = cache_size

  # Create search engine
  @search = Search.new(
    vocabulary: vocabulary,
    model: model,
    similarity_engine: @similarity_engine,
    pre_normalize: pre_normalize
  )

  preload_embeddings! if preload
end

Instance Attribute Details

#modelEmbeddingModel (readonly)

Returns:

  • (EmbeddingModel)


33
34
35
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 33

def model
  @model
end

#searchSearch (readonly)

Returns:



39
40
41
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 39

def search
  @search
end

#similarity_engineSimilarityEngine (readonly)

Returns:



36
37
38
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 36

def similarity_engine
  @similarity_engine
end

#vocabularyVocabulary (readonly)

Returns:



30
31
32
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 30

def vocabulary
  @vocabulary
end

Class Method Details

.from_cache(language:, cache: nil, preload: false, index: :exact) ⇒ EmbeddingPipeline Also known as: []

Create pipeline from cache (one-line initialization)

Parameters:

  • language (String)

    ISO 639-1 language code

  • cache (Cache::ModelCache) (defaults to: nil)

    Cache instance

  • preload (Boolean) (defaults to: false)

    Preload embeddings into memory

  • index (:exact, :auto) (defaults to: :exact)

    Search index type

Returns:

Raises:

  • (ArgumentError)

    If no cached model found for language



51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 51

def self.from_cache(language:, cache: nil, preload: false, index: :exact)
  require_relative 'cache/model_cache'

  cache ||= Cache::ModelCache.new

  vocab_path = cache.find_vocab(language)
  model_path = cache.find_model(language, :onnx)

  unless vocab_path && model_path
    raise ArgumentError, "No cached model for language: #{language}. " \
                         "Run: ruby scripts/extract_vocabularies.rb --languages=#{language}"
  end

  from_files(
    vocab_path: vocab_path,
    model_path: model_path,
    language: language,
    preload: preload,
    index: index
  )
end

.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact) ⇒ EmbeddingPipeline

Create pipeline from files

Parameters:

  • vocab_path (String)

    Path to vocabulary JSON file

  • model_path (String)

    Path to ONNX model file

  • language (String)

    Language code

  • preload (Boolean) (defaults to: false)

    Preload embeddings

  • index (:exact, :auto) (defaults to: :exact)

    Search index type

Returns:



82
83
84
85
86
87
88
89
90
91
92
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 82

def self.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact)
  vocab = Vocabulary.from_file(vocab_path, language_code: language)
  model = OnnxRuntimeModel.from_file(model_path, language_code: language)

  new(
    vocabulary: vocab,
    model: model,
    preload: preload,
    index: index
  )
end

Instance Method Details

#find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>

Find k nearest neighbors for a word

Parameters:

  • word (String)

    Query word

  • k (Integer) (defaults to: 10)

    Number of neighbors

  • exclude_self (Boolean) (defaults to: true)

    Exclude query word

  • min_similarity (Float) (defaults to: 0.0)

    Minimum similarity threshold

Returns:

  • (Array<Hash>)

    Array of similarity, index



128
129
130
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 128

def find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0)
  @search.find_nearest(word, k: k, exclude_self: exclude_self, min_similarity: min_similarity)
end

#find_nearest_batch(words, k: 10) ⇒ Hash<String, Array<Hash>>

Find nearest neighbors for multiple words

Parameters:

  • words (Array<String>)

    Query words

  • k (Integer) (defaults to: 10)

    Neighbors per word

Returns:

  • (Hash<String, Array<Hash>>)


138
139
140
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 138

def find_nearest_batch(words, k: 10)
  @search.find_nearest_batch(words, k: k)
end

#get_embedding(word) ⇒ Array<Float>?

Get embedding for a word

Parameters:

  • word (String)

    Word

Returns:

  • (Array<Float>, nil)


157
158
159
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 157

def get_embedding(word)
  @model.get_embedding_for_word(word, @vocabulary)
end

#get_embedding_by_index(index) ⇒ Array<Float>?

Get embedding by index

Parameters:

  • index (Integer)

    Word index

Returns:

  • (Array<Float>, nil)


166
167
168
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 166

def get_embedding_by_index(index)
  @model.get_embedding(index)
end

#include?(word) ⇒ Boolean

Check if word exists in vocabulary

Parameters:

  • word (String)

    Word

Returns:

  • (Boolean)


175
176
177
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 175

def include?(word)
  @vocabulary.include?(word)
end

#model_infoHash

Get model information

Returns:

  • (Hash)


218
219
220
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 218

def model_info
  @model.model_info
end

#preload_embeddings!self

Preload all embeddings into memory

Returns:

  • (self)


183
184
185
186
187
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 183

def preload_embeddings!
  @model.load!
  @search.preload_embeddings!
  self
end

#similarity(word1, word2) ⇒ Float?

Compute similarity between two words

Parameters:

  • word1 (String)

    First word

  • word2 (String)

    Second word

Returns:

  • (Float, nil)

    Similarity or nil if either word not found



148
149
150
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 148

def similarity(word1, word2)
  @search.similarity(word1, word2)
end

#statsHash

Get pipeline statistics

Returns:

  • (Hash)


203
204
205
206
207
208
209
210
211
212
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 203

def stats
  {
    language: @vocabulary.language_code,
    vocabulary_size: @vocabulary.size,
    embedding_dimension: @model.dimension,
    model_loaded: @model.loaded?,
    embeddings_preloaded: @search.embeddings_loaded,
    cache_stats: @search.instance_variable_get(:@embedding_cache)&.stats
  }
end

#to_sString Also known as: inspect

String representation

Returns:

  • (String)


226
227
228
229
230
231
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 226

def to_s
  "EmbeddingPipeline(language: #{@vocabulary.language_code}, " \
    "vocab_size: #{@vocabulary.size}, " \
    "dimension: #{@model.dimension}, " \
    "loaded: #{@model.loaded?})"
end

#unload!self

Unload model from memory

Returns:

  • (self)


193
194
195
196
197
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 193

def unload!
  @model.unload!
  @search.clear_cache
  self
end