Class: Kotoshu::Models::EmbeddingModel Abstract
- Inherits:
-
Object
- Object
- Kotoshu::Models::EmbeddingModel
- Defined in:
- lib/kotoshu/models/embedding_model.rb
Overview
Subclasses must implement #embedding_for and #vocabulary
Abstract base class for word embedding models.
Provides a unified interface for loading and querying word embeddings from different sources (FastText, Word2Vec, GloVe, ONNX, etc.).
Direct Known Subclasses
Instance Attribute Summary collapse
-
#dimension ⇒ Object
readonly
Returns the value of attribute dimension.
-
#language_code ⇒ Object
readonly
Returns the value of attribute language_code.
-
#vocabulary_size ⇒ Object
readonly
Returns the value of attribute vocabulary_size.
Instance Method Summary collapse
-
#distance(word1, word2) ⇒ Float?
Calculate Euclidean distance between two words.
-
#embedding_for(word) ⇒ WordEmbedding?
abstract
Get embedding vector for a word.
-
#has_word?(word) ⇒ Boolean
Check if a word is in the vocabulary.
-
#initialize(language_code:, dimension:) ⇒ EmbeddingModel
constructor
Create a new embedding model.
-
#loaded? ⇒ Boolean
Check if model is loaded.
-
#metadata ⇒ Hash
Get model metadata.
-
#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for a word.
-
#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for an embedding vector.
-
#similarity(word1, word2) ⇒ Float?
Calculate cosine similarity between two words.
-
#statistics ⇒ Hash
Get model statistics.
-
#to_s ⇒ String
(also: #inspect)
String representation.
-
#vocabulary ⇒ Array<String>
abstract
Get the vocabulary (all words in the model).
Constructor Details
#initialize(language_code:, dimension:) ⇒ EmbeddingModel
Create a new embedding model.
24 25 26 27 28 29 30 31 32 |
# File 'lib/kotoshu/models/embedding_model.rb', line 24 def initialize(language_code:, dimension:) raise ArgumentError, "Language code cannot be nil" if language_code.nil? raise ArgumentError, "Dimension must be positive" unless dimension&.positive? @language_code = language_code @dimension = dimension @vocabulary_size = 0 freeze end |
Instance Attribute Details
#dimension ⇒ Object (readonly)
Returns the value of attribute dimension.
18 19 20 |
# File 'lib/kotoshu/models/embedding_model.rb', line 18 def dimension @dimension end |
#language_code ⇒ Object (readonly)
Returns the value of attribute language_code.
18 19 20 |
# File 'lib/kotoshu/models/embedding_model.rb', line 18 def language_code @language_code end |
#vocabulary_size ⇒ Object (readonly)
Returns the value of attribute vocabulary_size.
18 19 20 |
# File 'lib/kotoshu/models/embedding_model.rb', line 18 def vocabulary_size @vocabulary_size end |
Instance Method Details
#distance(word1, word2) ⇒ Float?
Calculate Euclidean distance between two words.
70 71 72 73 74 75 76 77 |
# File 'lib/kotoshu/models/embedding_model.rb', line 70 def distance(word1, word2) emb1 = (word1) emb2 = (word2) return nil unless emb1 && emb2 emb1.distance(emb2) end |
#embedding_for(word) ⇒ WordEmbedding?
Subclass must implement
Get embedding vector for a word.
39 40 41 |
# File 'lib/kotoshu/models/embedding_model.rb', line 39 def (word) raise NotImplementedError, "#{self.class} must implement #embedding_for" end |
#has_word?(word) ⇒ Boolean
Check if a word is in the vocabulary.
47 48 49 |
# File 'lib/kotoshu/models/embedding_model.rb', line 47 def has_word?(word) vocabulary.include?(word) end |
#loaded? ⇒ Boolean
Check if model is loaded.
157 158 159 |
# File 'lib/kotoshu/models/embedding_model.rb', line 157 def loaded? @vocabulary_size&.positive? || vocabulary&.any? end |
#metadata ⇒ Hash
Get model metadata.
137 138 139 140 141 142 143 144 |
# File 'lib/kotoshu/models/embedding_model.rb', line 137 def { language_code: @language_code, dimension: @dimension, vocabulary_size: @vocabulary_size, model_type: self.class.name } end |
#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for a word.
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/kotoshu/models/embedding_model.rb', line 84 def nearest_neighbors(word, k: 10) = (word) return [] unless # Calculate similarity with all words in vocabulary neighbors = vocabulary.map do |vocab_word| next if vocab_word == word = (vocab_word) next unless sim = .similarity() NearestNeighbor.new( word: vocab_word, similarity: sim, distance: .distance(), embedding: ) end.compact # Sort by similarity (descending) and take top k neighbors.sort.reverse.first(k) end |
#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for an embedding vector.
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
# File 'lib/kotoshu/models/embedding_model.rb', line 113 def (, k: 10) return [] unless # Calculate similarity with all words in vocabulary neighbors = vocabulary.map do |vocab_word| = (vocab_word) next unless sim = .similarity() NearestNeighbor.new( word: vocab_word, similarity: sim, distance: .distance(), embedding: ) end.compact # Sort by similarity (descending) and take top k neighbors.sort.reverse.first(k) end |
#similarity(word1, word2) ⇒ Float?
Calculate cosine similarity between two words.
56 57 58 59 60 61 62 63 |
# File 'lib/kotoshu/models/embedding_model.rb', line 56 def similarity(word1, word2) emb1 = (word1) emb2 = (word2) return nil unless emb1 && emb2 emb1.similarity(emb2) end |
#statistics ⇒ Hash
Get model statistics.
164 165 166 167 168 169 170 171 |
# File 'lib/kotoshu/models/embedding_model.rb', line 164 def statistics { language: @language_code, dimension: @dimension, vocabulary_size: @vocabulary_size, loaded: loaded? } end |
#to_s ⇒ String Also known as: inspect
String representation.
176 177 178 |
# File 'lib/kotoshu/models/embedding_model.rb', line 176 def to_s "#{self.class.name}(language: #{@language_code}, dim: #{@dimension}, vocab: #{@vocabulary_size})" end |
#vocabulary ⇒ Array<String>
Subclass must implement
Get the vocabulary (all words in the model).
150 151 152 |
# File 'lib/kotoshu/models/embedding_model.rb', line 150 def vocabulary raise NotImplementedError, "#{self.class} must implement #vocabulary" end |