Class: Kotoshu::Models::FastTextModel

Inherits:

EmbeddingModel

Object
EmbeddingModel
Kotoshu::Models::FastTextModel

show all

Defined in:: lib/kotoshu/models/fasttext_model.rb

Overview

FastText embedding model implementation.

Loads FastText pre-trained word vectors from .vec files. Supports Common Crawl and Wikipedia trained vectors.

Examples:

Loading from file

model = FastTextModel.from_file('cc.en.300.vec')
model.embedding_for('hello')

Loading from GitHub

model = FastTextModel.from_github('en')
model.nearest_neighbors('hello', k: 10)

Constant Summary collapse

DEFAULT_DIMENSION = Standard FastText dimension for crawl vectors

DEFAULT_MAX_VECTORS = Number of vectors to load when reading from file FastText .vec files contain up to 2M words; we load a subset by default

1_000_000

Instance Attribute Summary collapse

#embeddings ⇒ Object readonly

Returns the value of attribute embeddings.
#max_vectors ⇒ Object readonly

Returns the value of attribute max_vectors.

Attributes inherited from EmbeddingModel

#dimension, #language_code, #vocabulary_size

Class Method Summary collapse

.detect_language_from_path(path) ⇒ String

Detect language code from file path.
.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil) ⇒ FastTextModel

Load FastText model from a .vec file.
.from_github(language_code, max_vectors: 500_000, cache: nil) ⇒ FastTextModel

Load FastText model from GitHub (via ModelCache).

Instance Method Summary collapse

#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>

Get batch embeddings for multiple words.
#batch_similarities(pairs) ⇒ Array<Float>

Get batch similarities for word pairs.
#embedding_for(word) ⇒ WordEmbedding^?

Get embedding vector for a word.
#initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS) ⇒ FastTextModel constructor

Create a new FastText model.
#loaded? ⇒ Boolean

Check if model is loaded.
#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>

Find k nearest neighbors for a word (optimized version).
#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ Array<NearestNeighbor>

Find k nearest neighbors for an embedding vector (optimized version).
#vocabulary ⇒ Array<String>

Get the vocabulary (all words in the model).

Methods inherited from EmbeddingModel

#distance, #has_word?, #metadata, #similarity, #statistics, #to_s

Constructor Details

#initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS) ⇒ `FastTextModel`

Create a new FastText model.

Parameters:

language_code (String) —

ISO 639-1 language code
dimension (Integer) (defaults to: DEFAULT_DIMENSION) —

Vector dimension (default: 300)
embeddings (Hash<String, WordEmbedding>) (defaults to: {}) —

Pre-loaded embeddings
max_vectors (Integer) (defaults to: DEFAULT_MAX_VECTORS) —

Maximum vectors to load from file

# File 'lib/kotoshu/models/fasttext_model.rb', line 40

def initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS)
  super(language_code: language_code, dimension: dimension)
  @embeddings = embeddings.freeze
  @max_vectors = max_vectors
  @vocabulary_size = @embeddings.size
end

Instance Attribute Details

#embeddings ⇒ `Object` (readonly)

Returns the value of attribute embeddings.



32
33
34

# File 'lib/kotoshu/models/fasttext_model.rb', line 32

def embeddings
  @embeddings
end

#max_vectors ⇒ `Object` (readonly)

Returns the value of attribute max_vectors.



32
33
34

# File 'lib/kotoshu/models/fasttext_model.rb', line 32

def max_vectors
  @max_vectors
end

Class Method Details

.detect_language_from_path(path) ⇒ `String`

Detect language code from file path.

Parameters:

path (String) —

File path

Returns:

(String) —

Detected language code

# File 'lib/kotoshu/models/fasttext_model.rb', line 210

def self.detect_language_from_path(path)
  # Extract from path like "cc.en.300.vec" or "wiki.de.vec"
  if path =~ /\.([a-z]{2})\./i
    Regexp.last_match(1).downcase
  else
    'en'  # Default to English
  end
end

.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil) ⇒ `FastTextModel`

Load FastText model from a .vec file.

Parameters:

file_path (String) —

Path to FastText .vec file
max_vectors (Integer) (defaults to: DEFAULT_MAX_VECTORS) —

Maximum vectors to load (default: 1M)
language_code (String) (defaults to: nil) —

Language code (auto-detected from filename)

Returns:

(FastTextModel) —

Loaded model

Raises:

(ArgumentError) —

if file doesn’t exist

# File 'lib/kotoshu/models/fasttext_model.rb', line 54

def self.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil)
  raise ArgumentError, "File not found: #{file_path}" unless File.exist?(file_path)

  # Detect language from filename if not provided
  language_code ||= detect_language_from_path(file_path)

  # Parse the .vec file
  embeddings = {}
  dimension = nil
  count = 0

  File.open(file_path, 'r', encoding: 'UTF-8') do |file|
    # First line: vocab_size dimension
    first_line = file.getline
    metadata = first_line.split
    _vocab_size = metadata[0].to_i
    dimension = metadata[1].to_i

    # Read vectors
    file.each_line do |line|
      break if count >= max_vectors

      parts = line.split
      word = parts[0]
      vector = parts[1..-1].map(&:to_f)

      next unless vector.size == dimension

      embeddings[word] = WordEmbedding.new(word, vector, language_code, dimension: dimension)
      count += 1
    end
  end

  new(language_code: language_code, dimension: dimension, embeddings: embeddings, max_vectors: max_vectors)
end

.from_github(language_code, max_vectors: 500_000, cache: nil) ⇒ `FastTextModel`

Load FastText model from GitHub (via ModelCache).

Downloads the .vec file from kotoshu/dictionaries repository.

Parameters:

language_code (String) —

ISO 639-1 language code (de, en, es, fr, pt, ru)
max_vectors (Integer) (defaults to: 500_000) —

Maximum vectors to load (default: 500K for GitHub)
cache (ModelCache, nil) (defaults to: nil) —

Optional cache instance

Returns:

(FastTextModel) —

Loaded model

Raises:

(ArgumentError) —

if language not supported

# File 'lib/kotoshu/models/fasttext_model.rb', line 99

def self.from_github(language_code, max_vectors: 500_000, cache: nil)
  require_relative '../cache/model_cache'

  cache ||= Cache::ModelCache.new

  # Get the .vec file path from cache
  vec_file = cache.get_fasttext_model(language_code)

  from_file(vec_file, max_vectors: max_vectors, language_code: language_code)
end

Instance Method Details

#batch_embeddings(words) ⇒ `Hash<String, WordEmbedding>`

Get batch embeddings for multiple words.

Parameters:

words (Array<String>) —

Words to lookup

Returns:

(Hash<String, WordEmbedding>) —

Mapping of word to embedding

# File 'lib/kotoshu/models/fasttext_model.rb', line 189

def batch_embeddings(words)
  words.each_with_object({}) do |word, hash|
    emb = embedding_for(word)
    hash[word] = emb if emb
  end
end

#batch_similarities(pairs) ⇒ `Array<Float>`

Get batch similarities for word pairs.

Parameters:

pairs (Array<Array<String, String>>) —

Word pairs

Returns:

(Array<Float>) —

Similarity scores



200
201
202

# File 'lib/kotoshu/models/fasttext_model.rb', line 200

def batch_similarities(pairs)
  pairs.map { |word1, word2| similarity(word1, word2) }
end

#embedding_for(word) ⇒ `WordEmbedding`^?

Get embedding vector for a word.

Parameters:

word (String) —

The word to lookup

Returns:

(WordEmbedding, nil) —

Embedding vector or nil if not found

# File 'lib/kotoshu/models/fasttext_model.rb', line 114

def embedding_for(word)
  return nil if word.nil? || word.empty?

  # Direct lookup
  @embeddings[word]
end

#loaded? ⇒ `Boolean`

Check if model is loaded.

Returns:

(Boolean) —

True if embeddings are loaded



131
132
133

# File 'lib/kotoshu/models/fasttext_model.rb', line 131

def loaded?
  @embeddings&.any?
end

#nearest_neighbors(word, k: 10) ⇒ `Array<NearestNeighbor>`

Find k nearest neighbors for a word (optimized version).

Overrides the base implementation for better performance using pre-loaded embeddings instead of repeated lookups.

Parameters:

word (String) —

The query word
k (Integer) (defaults to: 10) —

Number of neighbors to return

Returns:

(Array<NearestNeighbor>) —

Nearest neighbors sorted by similarity

# File 'lib/kotoshu/models/fasttext_model.rb', line 143

def nearest_neighbors(word, k: 10)
  embedding = embedding_for(word)
  return [] unless embedding

  # Calculate similarity with all words in vocabulary
  neighbors = @embeddings.map do |vocab_word, vocab_embedding|
    next if vocab_word == word

    sim = embedding.similarity(vocab_embedding)
    NearestNeighbor.new(
      word: vocab_word,
      similarity: sim,
      embedding: vocab_embedding
    )
  end.compact

  # Sort by similarity (descending) and take top k
  neighbors.sort.reverse.first(k)
end

#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ `Array<NearestNeighbor>`

Find k nearest neighbors for an embedding vector (optimized version).

Parameters:

embedding (WordEmbedding) —

The query embedding
k (Integer) (defaults to: 10) —

Number of neighbors to return

Returns:

(Array<NearestNeighbor>) —

Nearest neighbors sorted by similarity

# File 'lib/kotoshu/models/fasttext_model.rb', line 168

def nearest_neighbors_for_embedding(embedding, k: 10)
  return [] unless embedding

  # Calculate similarity with all words in vocabulary
  neighbors = @embeddings.map do |vocab_word, vocab_embedding|
    sim = embedding.similarity(vocab_embedding)
    NearestNeighbor.new(
      word: vocab_word,
      similarity: sim,
      embedding: vocab_embedding
    )
  end.compact

  # Sort by similarity (descending) and take top k
  neighbors.sort.reverse.first(k)
end

#vocabulary ⇒ `Array<String>`

Get the vocabulary (all words in the model).

Returns:

(Array<String>) —

Vocabulary words



124
125
126

# File 'lib/kotoshu/models/fasttext_model.rb', line 124

def vocabulary
  @embeddings.keys
end

Class: Kotoshu::Models::FastTextModel

Overview

Examples:

Loading from file

Loading from GitHub

Constant Summary collapse

Instance Attribute Summary collapse

Attributes inherited from EmbeddingModel

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from EmbeddingModel

Constructor Details

#initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS) ⇒ FastTextModel

Instance Attribute Details

#embeddings ⇒ Object (readonly)

#max_vectors ⇒ Object (readonly)

Class Method Details

.detect_language_from_path(path) ⇒ String

.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil) ⇒ FastTextModel

.from_github(language_code, max_vectors: 500_000, cache: nil) ⇒ FastTextModel

Instance Method Details

#batch_embeddings(words) ⇒ Hash<String, WordEmbedding>

#batch_similarities(pairs) ⇒ Array<Float>

#embedding_for(word) ⇒ WordEmbedding?

#loaded? ⇒ Boolean

#nearest_neighbors(word, k: 10) ⇒ Array<NearestNeighbor>

#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ Array<NearestNeighbor>

#vocabulary ⇒ Array<String>

#initialize(language_code:, dimension: DEFAULT_DIMENSION, embeddings: {}, max_vectors: DEFAULT_MAX_VECTORS) ⇒ `FastTextModel`

#embeddings ⇒ `Object` (readonly)

#max_vectors ⇒ `Object` (readonly)

.detect_language_from_path(path) ⇒ `String`

.from_file(file_path, max_vectors: DEFAULT_MAX_VECTORS, language_code: nil) ⇒ `FastTextModel`

.from_github(language_code, max_vectors: 500_000, cache: nil) ⇒ `FastTextModel`

#batch_embeddings(words) ⇒ `Hash<String, WordEmbedding>`

#batch_similarities(pairs) ⇒ `Array<Float>`

#embedding_for(word) ⇒ `WordEmbedding`^?

#loaded? ⇒ `Boolean`

#nearest_neighbors(word, k: 10) ⇒ `Array<NearestNeighbor>`

#nearest_neighbors_for_embedding(embedding, k: 10) ⇒ `Array<NearestNeighbor>`

#vocabulary ⇒ `Array<String>`