Class: Kotoshu::Models::EmbeddingModel Abstract

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/models/embedding_model.rb

Overview

This class is abstract.

Subclasses must implement #embedding_for and #vocabulary

Abstract base class for word embedding models.

Provides a unified interface for loading and querying word embeddings from different sources (FastText, Word2Vec, GloVe, ONNX, etc.).

Examples:

Using an embedding model

model = FastTextModel.new('cc.en.300.vec')
embedding = model.embedding_for('hello')
similarity = model.similarity('hello', 'world')
neighbors = model.nearest_neighbors('hello', k: 10)

Direct Known Subclasses

FastTextModel, OnnxModel

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(language_code:, dimension:) ⇒ EmbeddingModel

Create a new embedding model.

Parameters:

  • language_code (String)

    ISO 639-1 language code

  • dimension (Integer)

    Vector dimensionality (e.g., 300)

Raises:

  • (ArgumentError)


24
25
26
27
28
29
30
31
32
# File 'lib/kotoshu/models/embedding_model.rb', line 24

def initialize(language_code:, dimension:)
  raise ArgumentError, "Language code cannot be nil" if language_code.nil?
  raise ArgumentError, "Dimension must be positive" unless dimension&.positive?

  @language_code = language_code
  @dimension = dimension
  @vocabulary_size = 0
  freeze
end

Instance Attribute Details

#dimensionObject (readonly)

Returns the value of attribute dimension.



18
19
20
# File 'lib/kotoshu/models/embedding_model.rb', line 18

def dimension
  @dimension
end

#language_codeObject (readonly)

Returns the value of attribute language_code.



18
19
20
# File 'lib/kotoshu/models/embedding_model.rb', line 18

def language_code
  @language_code
end

#vocabulary_sizeObject (readonly)

Returns the value of attribute vocabulary_size.



18
19
20
# File 'lib/kotoshu/models/embedding_model.rb', line 18

def vocabulary_size
  @vocabulary_size
end

Instance Method Details

#distance(word1, word2) ⇒ Float?

Calculate Euclidean distance between two words.

Parameters:

  • word1 (String)

    First word

  • word2 (String)

    Second word

Returns:

  • (Float, nil)

    Distance or nil if words not found



70
71
72
73
74
75
76
77
# File 'lib/kotoshu/models/embedding_model.rb', line 70

def distance(word1, word2)
  emb1 = embedding_for(word1)
  emb2 = embedding_for(word2)

  return nil unless emb1 && emb2

  emb1.distance(emb2)
end

#embedding_for(word) ⇒ WordEmbedding?

This method is abstract.

Subclass must implement

Get embedding vector for a word.

Parameters:

  • word (String)

    The word to lookup

Returns:

Raises:

  • (NotImplementedError)


39
40
41
# File 'lib/kotoshu/models/embedding_model.rb', line 39

def embedding_for(word)
  raise NotImplementedError, "#{self.class} must implement #embedding_for"
end

#has_word?(word) ⇒ Boolean

Check if a word is in the vocabulary.

Parameters:

  • word (String)

    The word to check

Returns:

  • (Boolean)

    True if word exists in vocabulary



47
48
49
# File 'lib/kotoshu/models/embedding_model.rb', line 47

def has_word?(word)
  vocabulary.include?(word)
end

#loaded?Boolean

Check if model is loaded.

Returns:

  • (Boolean)

    True if model is loaded and ready



157
158
159
# File 'lib/kotoshu/models/embedding_model.rb', line 157

def loaded?
  @vocabulary_size&.positive? || vocabulary&.any?
end

#metadataHash

Get model metadata.

Returns:

  • (Hash)

    Model metadata



137
138
139
140
141
142
143
144
# File 'lib/kotoshu/models/embedding_model.rb', line 137

def 
  {
    language_code: @language_code,
    dimension: @dimension,
    vocabulary_size: @vocabulary_size,
    model_type: self.class.name
  }
end

#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>

Find k nearest neighbors for a word.

Parameters:

  • word (String)

    The query word

  • k (Integer) (defaults to: 10)

    Number of neighbors to return

Returns:



84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# File 'lib/kotoshu/models/embedding_model.rb', line 84

def nearest_neighbors(word, k: 10)
  embedding = embedding_for(word)
  return [] unless embedding

  # Calculate similarity with all words in vocabulary
  neighbors = vocabulary.map do |vocab_word|
    next if vocab_word == word

    vocab_embedding = embedding_for(vocab_word)
    next unless vocab_embedding

    sim = embedding.similarity(vocab_embedding)
    NearestNeighbor.new(
      word: vocab_word,
      similarity: sim,
      distance: embedding.distance(vocab_embedding),
      embedding: vocab_embedding
    )
  end.compact

  # Sort by similarity (descending) and take top k
  neighbors.sort.reverse.first(k)
end

#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ Array<NearestNeighbor>

Find k nearest neighbors for an embedding vector.

Parameters:

  • embedding (WordEmbedding)

    The query embedding

  • k (Integer) (defaults to: 10)

    Number of neighbors to return

Returns:



113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/kotoshu/models/embedding_model.rb', line 113

def nearest_neighbors_for_embedding(embedding, k: 10)
  return [] unless embedding

  # Calculate similarity with all words in vocabulary
  neighbors = vocabulary.map do |vocab_word|
    vocab_embedding = embedding_for(vocab_word)
    next unless vocab_embedding

    sim = embedding.similarity(vocab_embedding)
    NearestNeighbor.new(
      word: vocab_word,
      similarity: sim,
      distance: embedding.distance(vocab_embedding),
      embedding: vocab_embedding
    )
  end.compact

  # Sort by similarity (descending) and take top k
  neighbors.sort.reverse.first(k)
end

#similarity(word1, word2) ⇒ Float?

Calculate cosine similarity between two words.

Parameters:

  • word1 (String)

    First word

  • word2 (String)

    Second word

Returns:

  • (Float, nil)

    Similarity score (0.0 to 1.0) or nil if words not found



56
57
58
59
60
61
62
63
# File 'lib/kotoshu/models/embedding_model.rb', line 56

def similarity(word1, word2)
  emb1 = embedding_for(word1)
  emb2 = embedding_for(word2)

  return nil unless emb1 && emb2

  emb1.similarity(emb2)
end

#statisticsHash

Get model statistics.

Returns:

  • (Hash)

    Statistics about the model



164
165
166
167
168
169
170
171
# File 'lib/kotoshu/models/embedding_model.rb', line 164

def statistics
  {
    language: @language_code,
    dimension: @dimension,
    vocabulary_size: @vocabulary_size,
    loaded: loaded?
  }
end

#to_sString Also known as: inspect

String representation.

Returns:

  • (String)

    Human-readable representation



176
177
178
# File 'lib/kotoshu/models/embedding_model.rb', line 176

def to_s
  "#{self.class.name}(language: #{@language_code}, dim: #{@dimension}, vocab: #{@vocabulary_size})"
end

#vocabularyArray<String>

This method is abstract.

Subclass must implement

Get the vocabulary (all words in the model).

Returns:

  • (Array<String>)

    Vocabulary words

Raises:

  • (NotImplementedError)


150
151
152
# File 'lib/kotoshu/models/embedding_model.rb', line 150

def vocabulary
  raise NotImplementedError, "#{self.class} must implement #vocabulary"
end