Class: Kotoshu::Models::OnnxModel

Inherits:
EmbeddingModel show all
Defined in:
lib/kotoshu/models/onnx_model.rb

Overview

ONNX embedding model implementation.

Loads FastText models converted to ONNX format for faster inference. Uses ONNX Runtime for efficient embedding lookup.

Examples:

Loading from file

model = OnnxModel.from_file('fasttext.en.onnx')
embedding = model.embedding_for('hello')

Loading from GitHub (via ModelCache)

model = OnnxModel.from_github('en')
neighbors = model.nearest_neighbors('hello', k: 10)

Defined Under Namespace

Classes: OnnxUnavailable

Constant Summary collapse

ONNX_LOADED =

Soft-load onnxruntime. The gem is intentionally NOT a hard runtime dependency — it fails to build on some platforms and would block install for users who only want traditional spell-checking. Semantic features light up automatically when the gem is present.

KOTOSHU_NO_ONNX=1 forces semantic analysis off even when the gem is installed (useful for benchmarks / CI determinism).

begin
  if ENV["KOTOSHU_NO_ONNX"] == "1"
    false
  else
    require "onnxruntime"
    true
  end
rescue LoadError
  false
end
DEFAULT_DIMENSION =

Default dimension for FastText models

300

Instance Attribute Summary collapse

Attributes inherited from EmbeddingModel

#dimension, #language_code, #vocabulary_size

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from EmbeddingModel

#distance, #has_word?, #metadata, #nearest_neighbors_for_embedding, #similarity, #statistics, #to_s

Constructor Details

#initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil) ⇒ OnnxModel

Create a new ONNX model.

Parameters:

  • language_code (String)

    ISO 639-1 language code

  • dimension (Integer) (defaults to: DEFAULT_DIMENSION)

    Vector dimension

  • onnx_path (String)

    Path to .onnx file

  • vocabulary (Hash<String, Integer>)

    Word-to-index mapping

  • embedding_matrix (Numo::SFloat) (defaults to: nil)

    Pre-loaded embeddings (optional)



60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/kotoshu/models/onnx_model.rb', line 60

def initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil)
  super(language_code: language_code, dimension: dimension)
  @onnx_path = onnx_path
  @vocabulary = vocabulary.freeze
  @vocabulary_size = @vocabulary.size

  # Pre-load embedding matrix if provided (for faster nearest neighbor search)
  @embedding_matrix = embedding_matrix

  # Lazy load session
  @session = nil
  @loaded = false
end

Instance Attribute Details

#embedding_matrixObject (readonly)

Returns the value of attribute embedding_matrix.



51
52
53
# File 'lib/kotoshu/models/onnx_model.rb', line 51

def embedding_matrix
  @embedding_matrix
end

#onnx_pathObject (readonly)

Returns the value of attribute onnx_path.



51
52
53
# File 'lib/kotoshu/models/onnx_model.rb', line 51

def onnx_path
  @onnx_path
end

#vocabularyArray<String> (readonly)

Get the vocabulary (all words in the model).

Returns:

  • (Array<String>)

    Vocabulary words



150
151
152
# File 'lib/kotoshu/models/onnx_model.rb', line 150

def vocabulary
  @vocabulary
end

Class Method Details

.detect_language_from_path(path) ⇒ String

Detect language code from file path.

Parameters:

  • path (String)

    File path

Returns:

  • (String)

    Detected language code



323
324
325
326
327
328
329
330
# File 'lib/kotoshu/models/onnx_model.rb', line 323

def self.detect_language_from_path(path)
  # Extract from path like "fasttext.en.onnx"
  if path =~ /\.([a-z]{2})\./i
    Regexp.last_match(1).downcase
  else
    'en'  # Default to English
  end
end

.from_file(onnx_path, language_code: nil) ⇒ OnnxModel

Load ONNX model from a file.

Parameters:

  • onnx_path (String)

    Path to .onnx file

  • language_code (String) (defaults to: nil)

    Language code (auto-detected from filename)

Returns:

Raises:

  • (ArgumentError)

    if file doesn’t exist



80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/kotoshu/models/onnx_model.rb', line 80

def self.from_file(onnx_path, language_code: nil)
  raise ArgumentError, "File not found: #{onnx_path}" unless File.exist?(onnx_path)

  # Detect language from filename if not provided
  language_code ||= detect_language_from_path(onnx_path)

  # Load vocabulary from .vocab.json file
  vocab_path = onnx_path.sub('.onnx', '.vocab.json')
  unless File.exist?(vocab_path)
    raise ArgumentError, "Vocabulary file not found: #{vocab_path}"
  end

  require 'json'
  vocabulary = JSON.parse(File.read(vocab_path))

  # Load metadata
   = onnx_path.sub('.onnx', '.metadata.json')
  dimension = DEFAULT_DIMENSION

  if File.exist?()
     = JSON.parse(File.read())
    dimension = ['dimension']
  end

  new(
    language_code: language_code,
    dimension: dimension,
    onnx_path: onnx_path,
    vocabulary: vocabulary
  )
end

.from_github(language_code, cache: nil) ⇒ OnnxModel

Load ONNX model from GitHub (via ModelCache).

Downloads the .onnx file from kotoshu/dictionaries repository.

Parameters:

  • language_code (String)

    ISO 639-1 language code (de, en, es, fr, pt, ru)

  • cache (ModelCache, nil) (defaults to: nil)

    Optional cache instance

Returns:

Raises:

  • (ArgumentError)

    if language not supported



120
121
122
123
124
125
126
127
128
129
# File 'lib/kotoshu/models/onnx_model.rb', line 120

def self.from_github(language_code, cache: nil)
  require_relative '../cache/model_cache'

  cache ||= Cache::ModelCache.new

  # Get the .onnx file path from cache
  onnx_file = cache.get_onnx_model(language_code)

  from_file(onnx_file, language_code: language_code)
end

Instance Method Details

#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>

Batch lookup of embeddings for multiple words.

More efficient than individual lookups when using ONNX.

Parameters:

  • words (Array<String>)

    Words to lookup

Returns:



187
188
189
190
191
192
193
194
195
196
197
198
# File 'lib/kotoshu/models/onnx_model.rb', line 187

def batch_embeddings(words)
  ensure_session_loaded

  indices = words.map { |w| @vocabulary[w] }
  vectors = batch_get_embeddings(indices)

  words.zip(indices, vectors).each_with_object({}) do |(word, idx, vec)|
    next unless idx && vec

    [word, WordEmbedding.new(word, vec, @language_code, dimension: @dimension)]
  end
end

#embedding_for(word) ⇒ WordEmbedding?

Get embedding vector for a word.

Parameters:

  • word (String)

    The word to lookup

Returns:



135
136
137
138
139
140
141
142
143
144
145
# File 'lib/kotoshu/models/onnx_model.rb', line 135

def embedding_for(word)
  return nil if word.nil? || word.empty?

  index = @vocabulary[word]
  return nil unless index

  # Get embedding from ONNX model
  vector = get_embedding_vector(index)

  WordEmbedding.new(word, vector, @language_code, dimension: @dimension)
end

#loaded?Boolean

Check if model is loaded.

Returns:

  • (Boolean)

    True if ONNX session is loaded



157
158
159
# File 'lib/kotoshu/models/onnx_model.rb', line 157

def loaded?
  @loaded
end

#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>

Find k nearest neighbors for a word.

Parameters:

  • word (String)

    The query word

  • k (Integer) (defaults to: 10)

    Number of neighbors to return

Returns:



166
167
168
169
170
171
172
173
174
175
176
177
178
179
# File 'lib/kotoshu/models/onnx_model.rb', line 166

def nearest_neighbors(word, k: 10)
  ensure_session_loaded

  # Get query embedding
  query = embedding_for(word)
  return [] unless query

  # If embedding matrix is pre-loaded, use it for faster search
  if @embedding_matrix
    nearest_neighbors_from_matrix(query, k)
  else
    super
  end
end

#preload_embedding_matrixBoolean

Preload the embedding matrix into memory for faster nearest neighbor search.

Useful when doing many nearest neighbor queries.

Returns:

  • (Boolean)

    True if loaded successfully



205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# File 'lib/kotoshu/models/onnx_model.rb', line 205

def preload_embedding_matrix
  ensure_session_loaded

  # Get all embeddings at once
  all_indices = (0...@vocabulary_size).to_a
  vectors = batch_get_embeddings(all_indices)

  # Convert to matrix (using Numo::SFloat for efficiency)
  require 'numo/narray'
  @embedding_matrix = Numo::Sfloat.cast(vectors).reshape(@vocabulary_size, @dimension)

  true
rescue StandardError => e
  warn "Failed to preload embedding matrix: #{e.message}"
  false
end