Class: EmbeddingPipeline
- Inherits:
-
Object
- Object
- EmbeddingPipeline
- Defined in:
- lib/kotoshu/embeddings/embedding_pipeline.rb
Overview
EmbeddingPipeline - Unified API for embedding-based similarity search
Provides a simple, unified interface for loading vocabulary and models, and performing similarity search. This is the recommended entry point.
Instance Attribute Summary collapse
- #model ⇒ EmbeddingModel readonly
- #search ⇒ Search readonly
- #similarity_engine ⇒ SimilarityEngine readonly
- #vocabulary ⇒ Vocabulary readonly
Class Method Summary collapse
-
.from_cache(language:, cache: nil, preload: false, index: :exact) ⇒ EmbeddingPipeline
(also: [])
Create pipeline from cache (one-line initialization).
-
.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact) ⇒ EmbeddingPipeline
Create pipeline from files.
Instance Method Summary collapse
-
#find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>
Find k nearest neighbors for a word.
-
#find_nearest_batch(words, k: 10) ⇒ Hash<String, Array<Hash>>
Find nearest neighbors for multiple words.
-
#get_embedding(word) ⇒ Array<Float>?
Get embedding for a word.
-
#get_embedding_by_index(index) ⇒ Array<Float>?
Get embedding by index.
-
#include?(word) ⇒ Boolean
Check if word exists in vocabulary.
-
#initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000) ⇒ EmbeddingPipeline
constructor
Create pipeline with full configuration.
-
#model_info ⇒ Hash
Get model information.
-
#preload_embeddings! ⇒ self
Preload all embeddings into memory.
-
#similarity(word1, word2) ⇒ Float?
Compute similarity between two words.
-
#stats ⇒ Hash
Get pipeline statistics.
-
#to_s ⇒ String
(also: #inspect)
String representation.
-
#unload! ⇒ self
Unload model from memory.
Constructor Details
#initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000) ⇒ EmbeddingPipeline
Create pipeline with full configuration
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 103 def initialize(vocabulary:, model:, preload: false, index: :exact, pre_normalize: false, cache_size: 1000) @vocabulary = vocabulary @model = model @similarity_engine = SimilarityEngine.new(pre_normalize: pre_normalize) @cache_size = cache_size # Create search engine @search = Search.new( vocabulary: vocabulary, model: model, similarity_engine: @similarity_engine, pre_normalize: pre_normalize ) if preload end |
Instance Attribute Details
#model ⇒ EmbeddingModel (readonly)
33 34 35 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 33 def model @model end |
#search ⇒ Search (readonly)
39 40 41 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 39 def search @search end |
#similarity_engine ⇒ SimilarityEngine (readonly)
36 37 38 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 36 def similarity_engine @similarity_engine end |
#vocabulary ⇒ Vocabulary (readonly)
30 31 32 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 30 def vocabulary @vocabulary end |
Class Method Details
.from_cache(language:, cache: nil, preload: false, index: :exact) ⇒ EmbeddingPipeline Also known as: []
Create pipeline from cache (one-line initialization)
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 51 def self.from_cache(language:, cache: nil, preload: false, index: :exact) require_relative 'cache/model_cache' cache ||= Cache::ModelCache.new vocab_path = cache.find_vocab(language) model_path = cache.find_model(language, :onnx) unless vocab_path && model_path raise ArgumentError, "No cached model for language: #{language}. " \ "Run: ruby scripts/extract_vocabularies.rb --languages=#{language}" end from_files( vocab_path: vocab_path, model_path: model_path, language: language, preload: preload, index: index ) end |
.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact) ⇒ EmbeddingPipeline
Create pipeline from files
82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 82 def self.from_files(vocab_path:, model_path:, language:, preload: false, index: :exact) vocab = Vocabulary.from_file(vocab_path, language_code: language) model = OnnxRuntimeModel.from_file(model_path, language_code: language) new( vocabulary: vocab, model: model, preload: preload, index: index ) end |
Instance Method Details
#find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0) ⇒ Array<Hash>
Find k nearest neighbors for a word
128 129 130 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 128 def find_nearest(word, k: 10, exclude_self: true, min_similarity: 0.0) @search.find_nearest(word, k: k, exclude_self: exclude_self, min_similarity: min_similarity) end |
#find_nearest_batch(words, k: 10) ⇒ Hash<String, Array<Hash>>
Find nearest neighbors for multiple words
138 139 140 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 138 def find_nearest_batch(words, k: 10) @search.find_nearest_batch(words, k: k) end |
#get_embedding(word) ⇒ Array<Float>?
Get embedding for a word
157 158 159 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 157 def (word) @model.(word, @vocabulary) end |
#get_embedding_by_index(index) ⇒ Array<Float>?
Get embedding by index
166 167 168 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 166 def (index) @model.(index) end |
#include?(word) ⇒ Boolean
Check if word exists in vocabulary
175 176 177 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 175 def include?(word) @vocabulary.include?(word) end |
#model_info ⇒ Hash
Get model information
218 219 220 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 218 def model_info @model.model_info end |
#preload_embeddings! ⇒ self
Preload all embeddings into memory
183 184 185 186 187 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 183 def @model.load! @search. self end |
#similarity(word1, word2) ⇒ Float?
Compute similarity between two words
148 149 150 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 148 def similarity(word1, word2) @search.similarity(word1, word2) end |
#stats ⇒ Hash
Get pipeline statistics
203 204 205 206 207 208 209 210 211 212 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 203 def stats { language: @vocabulary.language_code, vocabulary_size: @vocabulary.size, embedding_dimension: @model.dimension, model_loaded: @model.loaded?, embeddings_preloaded: @search., cache_stats: @search.instance_variable_get(:@embedding_cache)&.stats } end |
#to_s ⇒ String Also known as: inspect
String representation
226 227 228 229 230 231 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 226 def to_s "EmbeddingPipeline(language: #{@vocabulary.language_code}, " \ "vocab_size: #{@vocabulary.size}, " \ "dimension: #{@model.dimension}, " \ "loaded: #{@model.loaded?})" end |
#unload! ⇒ self
Unload model from memory
193 194 195 196 197 |
# File 'lib/kotoshu/embeddings/embedding_pipeline.rb', line 193 def unload! @model.unload! @search.clear_cache self end |