Class: Kotoshu::Models::FastTextModel

Inherits:
EmbeddingModel show all
Defined in:
lib/kotoshu/models/fasttext_model.rb

Overview

FastText embedding model implementation.

Loads FastText pre-trained word vectors from .vec files. Supports Common Crawl and Wikipedia trained vectors.

Examples:

Loading from file

model = FastTextModel.from_file('cc.en.300.vec')
model.embedding_for('hello')

Loading from GitHub

model = FastTextModel.from_github('en')
model.nearest_neighbors('hello', k: 10)

See Also:

Constant Summary collapse

DEFAULT_DIMENSION =

Standard FastText dimension for crawl vectors

300
DEFAULT_MAX_VECTORS =

Number of vectors to load when reading from file FastText .vec files contain up to 2M words; we load a subset by default

1_000_000

Instance Attribute Summary collapse

Attributes inherited from EmbeddingModel

#dimension, #language_code, #vocabulary_size

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from EmbeddingModel

#distance, #has_word?, #metadata, #similarity, #statistics, #to_s

Constructor Details

#initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS) ⇒ FastTextModel

Create a new FastText model.

Parameters:

  • language_code (String)

    ISO 639-1 language code

  • dimension (Integer) (defaults to: DEFAULT_DIMENSION)

    Vector dimension (default: 300)

  • embeddings (Hash<String, WordEmbedding>) (defaults to: {})

    Pre-loaded embeddings

  • max_vectors (Integer) (defaults to: DEFAULT_MAX_VECTORS)

    Maximum vectors to load from file



40
41
42
43
44
45
# File 'lib/kotoshu/models/fasttext_model.rb', line 40

def initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS)
  super(language_code: language_code, dimension: dimension)
  @embeddings = embeddings.freeze
  @max_vectors = max_vectors
  @vocabulary_size = @embeddings.size
end

Instance Attribute Details

#embeddingsObject (readonly)

Returns the value of attribute embeddings.



32
33
34
# File 'lib/kotoshu/models/fasttext_model.rb', line 32

def embeddings
  @embeddings
end

#max_vectorsObject (readonly)

Returns the value of attribute max_vectors.



32
33
34
# File 'lib/kotoshu/models/fasttext_model.rb', line 32

def max_vectors
  @max_vectors
end

Class Method Details

.detect_language_from_path(path) ⇒ String

Detect language code from file path.

Parameters:

  • path (String)

    File path

Returns:

  • (String)

    Detected language code



210
211
212
213
214
215
216
217
# File 'lib/kotoshu/models/fasttext_model.rb', line 210

def self.detect_language_from_path(path)
  # Extract from path like "cc.en.300.vec" or "wiki.de.vec"
  if path =~ /\.([a-z]{2})\./i
    Regexp.last_match(1).downcase
  else
    'en'  # Default to English
  end
end

.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil) ⇒ FastTextModel

Load FastText model from a .vec file.

Parameters:

  • file_path (String)

    Path to FastText .vec file

  • max_vectors (Integer) (defaults to: DEFAULT_MAX_VECTORS)

    Maximum vectors to load (default: 1M)

  • language_code (String) (defaults to: nil)

    Language code (auto-detected from filename)

Returns:

Raises:

  • (ArgumentError)

    if file doesn’t exist



54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/kotoshu/models/fasttext_model.rb', line 54

def self.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil)
  raise ArgumentError, "File not found: #{file_path}" unless File.exist?(file_path)

  # Detect language from filename if not provided
  language_code ||= detect_language_from_path(file_path)

  # Parse the .vec file
  embeddings = {}
  dimension = nil
  count = 0

  File.open(file_path, 'r', encoding: 'UTF-8') do |file|
    # First line: vocab_size dimension
    first_line = file.getline
     = first_line.split
    _vocab_size = [0].to_i
    dimension = [1].to_i

    # Read vectors
    file.each_line do |line|
      break if count >= max_vectors

      parts = line.split
      word = parts[0]
      vector = parts[1..-1].map(&:to_f)

      next unless vector.size == dimension

      embeddings[word] = WordEmbedding.new(word, vector, language_code, dimension: dimension)
      count += 1
    end
  end

  new(language_code: language_code, dimension: dimension, embeddings: embeddings, max_vectors: max_vectors)
end

.from_github(language_code, max_vectors: 500_000, cache: nil) ⇒ FastTextModel

Load FastText model from GitHub (via ModelCache).

Downloads the .vec file from kotoshu/dictionaries repository.

Parameters:

  • language_code (String)

    ISO 639-1 language code (de, en, es, fr, pt, ru)

  • max_vectors (Integer) (defaults to: 500_000)

    Maximum vectors to load (default: 500K for GitHub)

  • cache (ModelCache, nil) (defaults to: nil)

    Optional cache instance

Returns:

Raises:

  • (ArgumentError)

    if language not supported



99
100
101
102
103
104
105
106
107
108
# File 'lib/kotoshu/models/fasttext_model.rb', line 99

def self.from_github(language_code, max_vectors: 500_000, cache: nil)
  require_relative '../cache/model_cache'

  cache ||= Cache::ModelCache.new

  # Get the .vec file path from cache
  vec_file = cache.get_fasttext_model(language_code)

  from_file(vec_file, max_vectors: max_vectors, language_code: language_code)
end

Instance Method Details

#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>

Get batch embeddings for multiple words.

Parameters:

  • words (Array<String>)

    Words to lookup

Returns:



189
190
191
192
193
194
# File 'lib/kotoshu/models/fasttext_model.rb', line 189

def batch_embeddings(words)
  words.each_with_object({}) do |word, hash|
    emb = embedding_for(word)
    hash[word] = emb if emb
  end
end

#batch_similarities(pairs) ⇒ Array<Float>

Get batch similarities for word pairs.

Parameters:

  • pairs (Array<Array<String, String>>)

    Word pairs

Returns:

  • (Array<Float>)

    Similarity scores



200
201
202
# File 'lib/kotoshu/models/fasttext_model.rb', line 200

def batch_similarities(pairs)
  pairs.map { |word1, word2| similarity(word1, word2) }
end

#embedding_for(word) ⇒ WordEmbedding?

Get embedding vector for a word.

Parameters:

  • word (String)

    The word to lookup

Returns:



114
115
116
117
118
119
# File 'lib/kotoshu/models/fasttext_model.rb', line 114

def embedding_for(word)
  return nil if word.nil? || word.empty?

  # Direct lookup
  @embeddings[word]
end

#loaded?Boolean

Check if model is loaded.

Returns:

  • (Boolean)

    True if embeddings are loaded



131
132
133
# File 'lib/kotoshu/models/fasttext_model.rb', line 131

def loaded?
  @embeddings&.any?
end

#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>

Find k nearest neighbors for a word (optimized version).

Overrides the base implementation for better performance using pre-loaded embeddings instead of repeated lookups.

Parameters:

  • word (String)

    The query word

  • k (Integer) (defaults to: 10)

    Number of neighbors to return

Returns:



143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# File 'lib/kotoshu/models/fasttext_model.rb', line 143

def nearest_neighbors(word, k: 10)
  embedding = embedding_for(word)
  return [] unless embedding

  # Calculate similarity with all words in vocabulary
  neighbors = @embeddings.map do |vocab_word, vocab_embedding|
    next if vocab_word == word

    sim = embedding.similarity(vocab_embedding)
    NearestNeighbor.new(
      word: vocab_word,
      similarity: sim,
      embedding: vocab_embedding
    )
  end.compact

  # Sort by similarity (descending) and take top k
  neighbors.sort.reverse.first(k)
end

#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ Array<NearestNeighbor>

Find k nearest neighbors for an embedding vector (optimized version).

Parameters:

  • embedding (WordEmbedding)

    The query embedding

  • k (Integer) (defaults to: 10)

    Number of neighbors to return

Returns:



168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# File 'lib/kotoshu/models/fasttext_model.rb', line 168

def nearest_neighbors_for_embedding(embedding, k: 10)
  return [] unless embedding

  # Calculate similarity with all words in vocabulary
  neighbors = @embeddings.map do |vocab_word, vocab_embedding|
    sim = embedding.similarity(vocab_embedding)
    NearestNeighbor.new(
      word: vocab_word,
      similarity: sim,
      embedding: vocab_embedding
    )
  end.compact

  # Sort by similarity (descending) and take top k
  neighbors.sort.reverse.first(k)
end

#vocabularyArray<String>

Get the vocabulary (all words in the model).

Returns:

  • (Array<String>)

    Vocabulary words



124
125
126
# File 'lib/kotoshu/models/fasttext_model.rb', line 124

def vocabulary
  @embeddings.keys
end