Class: Kotoshu::Models::FastTextModel
- Inherits:
-
EmbeddingModel
- Object
- EmbeddingModel
- Kotoshu::Models::FastTextModel
- Defined in:
- lib/kotoshu/models/fasttext_model.rb
Overview
FastText embedding model implementation.
Loads FastText pre-trained word vectors from .vec files. Supports Common Crawl and Wikipedia trained vectors.
Constant Summary collapse
- DEFAULT_DIMENSION =
Standard FastText dimension for crawl vectors
300- DEFAULT_MAX_VECTORS =
Number of vectors to load when reading from file FastText .vec files contain up to 2M words; we load a subset by default
1_000_000
Instance Attribute Summary collapse
-
#embeddings ⇒ Object
readonly
Returns the value of attribute embeddings.
-
#max_vectors ⇒ Object
readonly
Returns the value of attribute max_vectors.
Attributes inherited from EmbeddingModel
#dimension, #language_code, #vocabulary_size
Class Method Summary collapse
-
.detect_language_from_path(path) ⇒ String
Detect language code from file path.
-
.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil) ⇒ FastTextModel
Load FastText model from a .vec file.
-
.from_github(language_code, max_vectors: 500_000, cache: nil) ⇒ FastTextModel
Load FastText model from GitHub (via ModelCache).
Instance Method Summary collapse
-
#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>
Get batch embeddings for multiple words.
-
#batch_similarities(pairs) ⇒ Array<Float>
Get batch similarities for word pairs.
-
#embedding_for(word) ⇒ WordEmbedding?
Get embedding vector for a word.
-
#initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS) ⇒ FastTextModel
constructor
Create a new FastText model.
-
#loaded? ⇒ Boolean
Check if model is loaded.
-
#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for a word (optimized version).
-
#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for an embedding vector (optimized version).
-
#vocabulary ⇒ Array<String>
Get the vocabulary (all words in the model).
Methods inherited from EmbeddingModel
#distance, #has_word?, #metadata, #similarity, #statistics, #to_s
Constructor Details
#initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS) ⇒ FastTextModel
Create a new FastText model.
40 41 42 43 44 45 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 40 def initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS) super(language_code: language_code, dimension: dimension) @embeddings = .freeze @max_vectors = max_vectors @vocabulary_size = @embeddings.size end |
Instance Attribute Details
#embeddings ⇒ Object (readonly)
Returns the value of attribute embeddings.
32 33 34 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 32 def @embeddings end |
#max_vectors ⇒ Object (readonly)
Returns the value of attribute max_vectors.
32 33 34 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 32 def max_vectors @max_vectors end |
Class Method Details
.detect_language_from_path(path) ⇒ String
Detect language code from file path.
210 211 212 213 214 215 216 217 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 210 def self.detect_language_from_path(path) # Extract from path like "cc.en.300.vec" or "wiki.de.vec" if path =~ /\.([a-z]{2})\./i Regexp.last_match(1).downcase else 'en' # Default to English end end |
.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil) ⇒ FastTextModel
Load FastText model from a .vec file.
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 54 def self.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil) raise ArgumentError, "File not found: #{file_path}" unless File.exist?(file_path) # Detect language from filename if not provided language_code ||= detect_language_from_path(file_path) # Parse the .vec file = {} dimension = nil count = 0 File.open(file_path, 'r', encoding: 'UTF-8') do |file| # First line: vocab_size dimension first_line = file.getline = first_line.split _vocab_size = [0].to_i dimension = [1].to_i # Read vectors file.each_line do |line| break if count >= max_vectors parts = line.split word = parts[0] vector = parts[1..-1].map(&:to_f) next unless vector.size == dimension [word] = WordEmbedding.new(word, vector, language_code, dimension: dimension) count += 1 end end new(language_code: language_code, dimension: dimension, embeddings: , max_vectors: max_vectors) end |
.from_github(language_code, max_vectors: 500_000, cache: nil) ⇒ FastTextModel
Load FastText model from GitHub (via ModelCache).
Downloads the .vec file from kotoshu/dictionaries repository.
99 100 101 102 103 104 105 106 107 108 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 99 def self.from_github(language_code, max_vectors: 500_000, cache: nil) require_relative '../cache/model_cache' cache ||= Cache::ModelCache.new # Get the .vec file path from cache vec_file = cache.get_fasttext_model(language_code) from_file(vec_file, max_vectors: max_vectors, language_code: language_code) end |
Instance Method Details
#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>
Get batch embeddings for multiple words.
189 190 191 192 193 194 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 189 def (words) words.each_with_object({}) do |word, hash| emb = (word) hash[word] = emb if emb end end |
#batch_similarities(pairs) ⇒ Array<Float>
Get batch similarities for word pairs.
200 201 202 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 200 def batch_similarities(pairs) pairs.map { |word1, word2| similarity(word1, word2) } end |
#embedding_for(word) ⇒ WordEmbedding?
Get embedding vector for a word.
114 115 116 117 118 119 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 114 def (word) return nil if word.nil? || word.empty? # Direct lookup @embeddings[word] end |
#loaded? ⇒ Boolean
Check if model is loaded.
131 132 133 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 131 def loaded? @embeddings&.any? end |
#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for a word (optimized version).
Overrides the base implementation for better performance using pre-loaded embeddings instead of repeated lookups.
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 143 def nearest_neighbors(word, k: 10) = (word) return [] unless # Calculate similarity with all words in vocabulary neighbors = @embeddings.map do |vocab_word, | next if vocab_word == word sim = .similarity() NearestNeighbor.new( word: vocab_word, similarity: sim, embedding: ) end.compact # Sort by similarity (descending) and take top k neighbors.sort.reverse.first(k) end |
#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ Array<NearestNeighbor>
Find k nearest neighbors for an embedding vector (optimized version).
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 168 def (, k: 10) return [] unless # Calculate similarity with all words in vocabulary neighbors = @embeddings.map do |vocab_word, | sim = .similarity() NearestNeighbor.new( word: vocab_word, similarity: sim, embedding: ) end.compact # Sort by similarity (descending) and take top k neighbors.sort.reverse.first(k) end |
#vocabulary ⇒ Array<String>
Get the vocabulary (all words in the model).
124 125 126 |
# File 'lib/kotoshu/models/fasttext_model.rb', line 124 def vocabulary @embeddings.keys end |