Class: Kotoshu::Models::OnnxModel

Inherits:

EmbeddingModel

Object
EmbeddingModel
Kotoshu::Models::OnnxModel

show all

Defined in:: lib/kotoshu/models/onnx_model.rb

Overview

ONNX embedding model implementation.

Loads FastText models converted to ONNX format for faster inference. Uses ONNX Runtime for efficient embedding lookup.

Examples:

Loading from file

model = OnnxModel.from_file('fasttext.en.onnx')
embedding = model.embedding_for('hello')

Loading from GitHub (via ModelCache)

model = OnnxModel.from_github('en')
neighbors = model.nearest_neighbors('hello', k: 10)

Defined Under Namespace

Classes: OnnxUnavailable

Constant Summary collapse

ONNX_LOADED = Soft-load onnxruntime. The gem is intentionally NOT a hard runtime dependency — it fails to build on some platforms and would block install for users who only want traditional spell-checking. Semantic features light up automatically when the gem is present. KOTOSHU_NO_ONNX=1 forces semantic analysis off even when the gem is installed (useful for benchmarks / CI determinism).

begin
  if ENV["KOTOSHU_NO_ONNX"] == "1"
    false
  else
    require "onnxruntime"
    true
  end
rescue LoadError
  false
end

DEFAULT_DIMENSION = Default dimension for FastText models

Instance Attribute Summary collapse

#embedding_matrix ⇒ Object readonly

Returns the value of attribute embedding_matrix.
#onnx_path ⇒ Object readonly

Returns the value of attribute onnx_path.
#vocabulary ⇒ Array<String> readonly

Get the vocabulary (all words in the model).

Attributes inherited from EmbeddingModel

#dimension, #language_code, #vocabulary_size

Class Method Summary collapse

.detect_language_from_path(path) ⇒ String

Detect language code from file path.
.from_file(onnx_path, language_code: nil) ⇒ OnnxModel

Load ONNX model from a file.
.from_github(language_code, cache: nil) ⇒ OnnxModel

Load ONNX model from GitHub (via ModelCache).

Instance Method Summary collapse

#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>

Batch lookup of embeddings for multiple words.
#embedding_for(word) ⇒ WordEmbedding^?

Get embedding vector for a word.
#initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil) ⇒ OnnxModel constructor

Create a new ONNX model.
#loaded? ⇒ Boolean

Check if model is loaded.
#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>

Find k nearest neighbors for a word.
#preload_embedding_matrix ⇒ Boolean

Preload the embedding matrix into memory for faster nearest neighbor search.

Methods inherited from EmbeddingModel

#distance, #has_word?, #metadata, #nearest_neighbors_for_embedding, #similarity, #statistics, #to_s

Constructor Details

#initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil) ⇒ `OnnxModel`

Create a new ONNX model.

Parameters:

language_code (String) —

ISO 639-1 language code
dimension (Integer) (defaults to: DEFAULT_DIMENSION) —

Vector dimension
onnx_path (String) —

Path to .onnx file
vocabulary (Hash<String, Integer>) —

Word-to-index mapping
embedding_matrix (Numo::SFloat) (defaults to: nil) —

Pre-loaded embeddings (optional)

# File 'lib/kotoshu/models/onnx_model.rb', line 60

def initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil)
  super(language_code: language_code, dimension: dimension)
  @onnx_path = onnx_path
  @vocabulary = vocabulary.freeze
  @vocabulary_size = @vocabulary.size

  # Pre-load embedding matrix if provided (for faster nearest neighbor search)
  @embedding_matrix = embedding_matrix

  # Lazy load session
  @session = nil
  @loaded = false
end

Instance Attribute Details

#embedding_matrix ⇒ `Object` (readonly)

Returns the value of attribute embedding_matrix.



51
52
53

# File 'lib/kotoshu/models/onnx_model.rb', line 51

def embedding_matrix
  @embedding_matrix
end

#onnx_path ⇒ `Object` (readonly)

Returns the value of attribute onnx_path.



51
52
53

# File 'lib/kotoshu/models/onnx_model.rb', line 51

def onnx_path
  @onnx_path
end

#vocabulary ⇒ `Array<String>` (readonly)

Get the vocabulary (all words in the model).

Returns:

(Array<String>) —

Vocabulary words



150
151
152

# File 'lib/kotoshu/models/onnx_model.rb', line 150

def vocabulary
  @vocabulary
end

Class Method Details

.detect_language_from_path(path) ⇒ `String`

Detect language code from file path.

Parameters:

path (String) —

File path

Returns:

(String) —

Detected language code

# File 'lib/kotoshu/models/onnx_model.rb', line 323

def self.detect_language_from_path(path)
  # Extract from path like "fasttext.en.onnx"
  if path =~ /\.([a-z]{2})\./i
    Regexp.last_match(1).downcase
  else
    'en'  # Default to English
  end
end

.from_file(onnx_path, language_code: nil) ⇒ `OnnxModel`

Load ONNX model from a file.

Parameters:

onnx_path (String) —

Path to .onnx file
language_code (String) (defaults to: nil) —

Language code (auto-detected from filename)

Returns:

(OnnxModel) —

Loaded model

Raises:

(ArgumentError) —

if file doesn’t exist

# File 'lib/kotoshu/models/onnx_model.rb', line 80

def self.from_file(onnx_path, language_code: nil)
  raise ArgumentError, "File not found: #{onnx_path}" unless File.exist?(onnx_path)

  # Detect language from filename if not provided
  language_code ||= detect_language_from_path(onnx_path)

  # Load vocabulary from .vocab.json file
  vocab_path = onnx_path.sub('.onnx', '.vocab.json')
  unless File.exist?(vocab_path)
    raise ArgumentError, "Vocabulary file not found: #{vocab_path}"
  end

  require 'json'
  vocabulary = JSON.parse(File.read(vocab_path))

  # Load metadata
  metadata_path = onnx_path.sub('.onnx', '.metadata.json')
  dimension = DEFAULT_DIMENSION

  if File.exist?(metadata_path)
    metadata = JSON.parse(File.read(metadata_path))
    dimension = metadata['dimension']
  end

  new(
    language_code: language_code,
    dimension: dimension,
    onnx_path: onnx_path,
    vocabulary: vocabulary
  )
end

.from_github(language_code, cache: nil) ⇒ `OnnxModel`

Load ONNX model from GitHub (via ModelCache).

Downloads the .onnx file from kotoshu/dictionaries repository.

Parameters:

language_code (String) —

ISO 639-1 language code (de, en, es, fr, pt, ru)
cache (ModelCache, nil) (defaults to: nil) —

Optional cache instance

Returns:

(OnnxModel) —

Loaded model

Raises:

(ArgumentError) —

if language not supported

# File 'lib/kotoshu/models/onnx_model.rb', line 120

def self.from_github(language_code, cache: nil)
  require_relative '../cache/model_cache'

  cache ||= Cache::ModelCache.new

  # Get the .onnx file path from cache
  onnx_file = cache.get_onnx_model(language_code)

  from_file(onnx_file, language_code: language_code)
end

Instance Method Details

#batch_embeddings(words) ⇒ `Hash<String, WordEmbedding>`

Batch lookup of embeddings for multiple words.

More efficient than individual lookups when using ONNX.

Parameters:

words (Array<String>) —

Words to lookup

Returns:

(Hash<String, WordEmbedding>) —

Word to embedding mapping

# File 'lib/kotoshu/models/onnx_model.rb', line 187

def batch_embeddings(words)
  ensure_session_loaded

  indices = words.map { |w| @vocabulary[w] }
  vectors = batch_get_embeddings(indices)

  words.zip(indices, vectors).each_with_object({}) do |(word, idx, vec)|
    next unless idx && vec

    [word, WordEmbedding.new(word, vec, @language_code, dimension: @dimension)]
  end
end

#embedding_for(word) ⇒ `WordEmbedding`^?

Get embedding vector for a word.

Parameters:

word (String) —

The word to lookup

Returns:

(WordEmbedding, nil) —

Embedding vector or nil if not found

# File 'lib/kotoshu/models/onnx_model.rb', line 135

def embedding_for(word)
  return nil if word.nil? || word.empty?

  index = @vocabulary[word]
  return nil unless index

  # Get embedding from ONNX model
  vector = get_embedding_vector(index)

  WordEmbedding.new(word, vector, @language_code, dimension: @dimension)
end

#loaded? ⇒ `Boolean`

Check if model is loaded.

Returns:

(Boolean) —

True if ONNX session is loaded



157
158
159

# File 'lib/kotoshu/models/onnx_model.rb', line 157

def loaded?
  @loaded
end

#nearest_neighbors(word, k: 10) ⇒ `Array<NearestNeighbor>`

Find k nearest neighbors for a word.

Parameters:

word (String) —

The query word
k (Integer) (defaults to: 10) —

Number of neighbors to return

Returns:

(Array<NearestNeighbor>) —

Nearest neighbors sorted by similarity

# File 'lib/kotoshu/models/onnx_model.rb', line 166

def nearest_neighbors(word, k: 10)
  ensure_session_loaded

  # Get query embedding
  query = embedding_for(word)
  return [] unless query

  # If embedding matrix is pre-loaded, use it for faster search
  if @embedding_matrix
    nearest_neighbors_from_matrix(query, k)
  else
    super
  end
end

#preload_embedding_matrix ⇒ `Boolean`

Preload the embedding matrix into memory for faster nearest neighbor search.

Useful when doing many nearest neighbor queries.

Returns:

(Boolean) —

True if loaded successfully

# File 'lib/kotoshu/models/onnx_model.rb', line 205

def preload_embedding_matrix
  ensure_session_loaded

  # Get all embeddings at once
  all_indices = (0...@vocabulary_size).to_a
  vectors = batch_get_embeddings(all_indices)

  # Convert to matrix (using Numo::SFloat for efficiency)
  require 'numo/narray'
  @embedding_matrix = Numo::Sfloat.cast(vectors).reshape(@vocabulary_size, @dimension)

  true
rescue StandardError => e
  warn "Failed to preload embedding matrix: #{e.message}"
  false
end

Class: Kotoshu::Models::OnnxModel

Overview

Examples:

Loading from file

Loading from GitHub (via ModelCache)

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Attributes inherited from EmbeddingModel

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from EmbeddingModel

Constructor Details

#initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil) ⇒ OnnxModel

Instance Attribute Details

#embedding_matrix ⇒ Object (readonly)

#onnx_path ⇒ Object (readonly)

#vocabulary ⇒ Array<String> (readonly)

Class Method Details

.detect_language_from_path(path) ⇒ String

.from_file(onnx_path, language_code: nil) ⇒ OnnxModel

.from_github(language_code, cache: nil) ⇒ OnnxModel

Instance Method Details

#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>

#embedding_for(word) ⇒ WordEmbedding?

#loaded? ⇒ Boolean

#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>

#preload_embedding_matrix ⇒ Boolean

#initialize(language_code:, dimension: DEFAULT_DIMENSION, onnx_path:, vocabulary:, embedding_matrix: nil) ⇒ `OnnxModel`

#embedding_matrix ⇒ `Object` (readonly)

#onnx_path ⇒ `Object` (readonly)

#vocabulary ⇒ `Array<String>` (readonly)

.detect_language_from_path(path) ⇒ `String`

.from_file(onnx_path, language_code: nil) ⇒ `OnnxModel`

.from_github(language_code, cache: nil) ⇒ `OnnxModel`

#batch_embeddings(words) ⇒ `Hash<String, WordEmbedding>`

#embedding_for(word) ⇒ `WordEmbedding`^?

#loaded? ⇒ `Boolean`

#nearest_neighbors(word, k: 10) ⇒ `Array<NearestNeighbor>`

#preload_embedding_matrix ⇒ `Boolean`