Class: Kotoshu::Models::OnnxModel
- Inherits:
-
EmbeddingModel
- Object
- EmbeddingModel
- Kotoshu::Models::OnnxModel
- Defined in:
- lib/kotoshu/models/onnx_model.rb
Overview
ONNX embedding model implementation.
Loads FastText models converted to ONNX format for faster inference. Uses ONNX Runtime for efficient embedding lookup.
Defined Under Namespace
Classes: OnnxUnavailable
Constant Summary collapse
- ONNX_LOADED =
Soft-load onnxruntime. The gem is intentionally NOT a hard runtime dependency — it fails to build on some platforms and would block install for users who only want traditional spell-checking. Semantic features light up automatically when the gem is present.
KOTOSHU_NO_ONNX=1 forces semantic analysis off even when the gem is installed (useful for benchmarks / CI determinism).
begin if ENV["KOTOSHU_NO_ONNX"] == "1" false else require "onnxruntime" true end rescue LoadError false end
- DEFAULT_DIMENSION =
Default dimension for FastText models
300
Instance Attribute Summary collapse
-
#embedding_matrix ⇒ Object
readonly
Returns the value of attribute embedding_matrix.
-
#onnx_path ⇒ Object
readonly
Returns the value of attribute onnx_path.
-
#vocabulary ⇒ Array<String>
readonly
Get the vocabulary (all words in the model).
Attributes inherited from EmbeddingModel
#dimension, #language_code, #vocabulary_size
Class Method Summary collapse
-
.detect_language_from_path(path) ⇒ String
Detect language code from file path.
-
.from_file(onnx_path, language_code: nil) ⇒ OnnxModel
Load ONNX model from a file.
-
.from_github(language_code, cache: nil) ⇒ OnnxModel
Load ONNX model from GitHub (via ModelCache).
Instance Method Summary collapse
-
#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>
Batch lookup of embeddings for multiple words.
-
#embedding_for(word) ⇒ WordEmbedding?
Get embedding vector for a word.
-
#initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil) ⇒ OnnxModel
constructor
Create a new ONNX model.
-
#loaded? ⇒ Boolean
Check if model is loaded.
-
#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for a word.
-
#preload_embedding_matrix ⇒ Boolean
Preload the embedding matrix into memory for faster nearest neighbor search.
Methods inherited from EmbeddingModel
#distance, #has_word?, #metadata, #nearest_neighbors_for_embedding, #similarity, #statistics, #to_s
Constructor Details
#initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil) ⇒ OnnxModel
Create a new ONNX model.
60 61 62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/kotoshu/models/onnx_model.rb', line 60 def initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil) super(language_code: language_code, dimension: dimension) @onnx_path = onnx_path @vocabulary = vocabulary.freeze @vocabulary_size = @vocabulary.size # Pre-load embedding matrix if provided (for faster nearest neighbor search) @embedding_matrix = # Lazy load session @session = nil @loaded = false end |
Instance Attribute Details
#embedding_matrix ⇒ Object (readonly)
Returns the value of attribute embedding_matrix.
51 52 53 |
# File 'lib/kotoshu/models/onnx_model.rb', line 51 def @embedding_matrix end |
#onnx_path ⇒ Object (readonly)
Returns the value of attribute onnx_path.
51 52 53 |
# File 'lib/kotoshu/models/onnx_model.rb', line 51 def onnx_path @onnx_path end |
#vocabulary ⇒ Array<String> (readonly)
Get the vocabulary (all words in the model).
150 151 152 |
# File 'lib/kotoshu/models/onnx_model.rb', line 150 def vocabulary @vocabulary end |
Class Method Details
.detect_language_from_path(path) ⇒ String
Detect language code from file path.
323 324 325 326 327 328 329 330 |
# File 'lib/kotoshu/models/onnx_model.rb', line 323 def self.detect_language_from_path(path) # Extract from path like "fasttext.en.onnx" if path =~ /\.([a-z]{2})\./i Regexp.last_match(1).downcase else 'en' # Default to English end end |
.from_file(onnx_path, language_code: nil) ⇒ OnnxModel
Load ONNX model from a file.
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/kotoshu/models/onnx_model.rb', line 80 def self.from_file(onnx_path, language_code: nil) raise ArgumentError, "File not found: #{onnx_path}" unless File.exist?(onnx_path) # Detect language from filename if not provided language_code ||= detect_language_from_path(onnx_path) # Load vocabulary from .vocab.json file vocab_path = onnx_path.sub('.onnx', '.vocab.json') unless File.exist?(vocab_path) raise ArgumentError, "Vocabulary file not found: #{vocab_path}" end require 'json' vocabulary = JSON.parse(File.read(vocab_path)) # Load metadata = onnx_path.sub('.onnx', '.metadata.json') dimension = DEFAULT_DIMENSION if File.exist?() = JSON.parse(File.read()) dimension = ['dimension'] end new( language_code: language_code, dimension: dimension, onnx_path: onnx_path, vocabulary: vocabulary ) end |
.from_github(language_code, cache: nil) ⇒ OnnxModel
Load ONNX model from GitHub (via ModelCache).
Downloads the .onnx file from kotoshu/dictionaries repository.
120 121 122 123 124 125 126 127 128 129 |
# File 'lib/kotoshu/models/onnx_model.rb', line 120 def self.from_github(language_code, cache: nil) require_relative '../cache/model_cache' cache ||= Cache::ModelCache.new # Get the .onnx file path from cache onnx_file = cache.get_onnx_model(language_code) from_file(onnx_file, language_code: language_code) end |
Instance Method Details
#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>
Batch lookup of embeddings for multiple words.
More efficient than individual lookups when using ONNX.
187 188 189 190 191 192 193 194 195 196 197 198 |
# File 'lib/kotoshu/models/onnx_model.rb', line 187 def (words) ensure_session_loaded indices = words.map { |w| @vocabulary[w] } vectors = (indices) words.zip(indices, vectors).each_with_object({}) do |(word, idx, vec)| next unless idx && vec [word, WordEmbedding.new(word, vec, @language_code, dimension: @dimension)] end end |
#embedding_for(word) ⇒ WordEmbedding?
Get embedding vector for a word.
135 136 137 138 139 140 141 142 143 144 145 |
# File 'lib/kotoshu/models/onnx_model.rb', line 135 def (word) return nil if word.nil? || word.empty? index = @vocabulary[word] return nil unless index # Get embedding from ONNX model vector = (index) WordEmbedding.new(word, vector, @language_code, dimension: @dimension) end |
#loaded? ⇒ Boolean
Check if model is loaded.
157 158 159 |
# File 'lib/kotoshu/models/onnx_model.rb', line 157 def loaded? @loaded end |
#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for a word.
166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
# File 'lib/kotoshu/models/onnx_model.rb', line 166 def nearest_neighbors(word, k: 10) ensure_session_loaded # Get query embedding query = (word) return [] unless query # If embedding matrix is pre-loaded, use it for faster search if @embedding_matrix nearest_neighbors_from_matrix(query, k) else super end end |
#preload_embedding_matrix ⇒ Boolean
Preload the embedding matrix into memory for faster nearest neighbor search.
Useful when doing many nearest neighbor queries.
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
# File 'lib/kotoshu/models/onnx_model.rb', line 205 def ensure_session_loaded # Get all embeddings at once all_indices = (0...@vocabulary_size).to_a vectors = (all_indices) # Convert to matrix (using Numo::SFloat for efficiency) require 'numo/narray' @embedding_matrix = Numo::Sfloat.cast(vectors).reshape(@vocabulary_size, @dimension) true rescue StandardError => e warn "Failed to preload embedding matrix: #{e.}" false end |