Class: Kotoshu::Models::WordEmbedding

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/models/word_embedding.rb

Overview

Immutable value object for word embeddings.

Represents a word and its vector representation in a semantic space. Used for semantic similarity calculations and nearest neighbor searches.

Examples:

Creating an embedding

embedding = WordEmbedding.new("hello", [0.1, -0.2, 0.3], "en")
embedding.similarity(other_embedding)  # => 0.85

See Also:

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(word, vector, language_code, dimension: 300) ⇒ WordEmbedding

Create a new word embedding.

Parameters:

  • word (String)

    The word

  • vector (Array<Float>)

    The word’s vector representation

  • language_code (String)

    ISO 639-1 language code

  • dimension (Integer) (defaults to: 300)

    Vector dimension (default: 300 for FastText)

Raises:

  • (ArgumentError)

    if vector doesn’t match dimension



25
26
27
28
29
30
31
32
33
34
# File 'lib/kotoshu/models/word_embedding.rb', line 25

def initialize(word, vector, language_code, dimension: 300)
  raise ArgumentError, "Vector dimension mismatch" unless vector.size == dimension

  @word = word
  @vector = vector.freeze
  @language_code = language_code
  @dimension = dimension

  freeze
end

Instance Attribute Details

#dimensionObject (readonly)

Returns the value of attribute dimension.



16
17
18
# File 'lib/kotoshu/models/word_embedding.rb', line 16

def dimension
  @dimension
end

#language_codeObject (readonly)

Returns the value of attribute language_code.



16
17
18
# File 'lib/kotoshu/models/word_embedding.rb', line 16

def language_code
  @language_code
end

#vectorObject (readonly)

Returns the value of attribute vector.



16
17
18
# File 'lib/kotoshu/models/word_embedding.rb', line 16

def vector
  @vector
end

#wordObject (readonly)

Returns the value of attribute word.



16
17
18
# File 'lib/kotoshu/models/word_embedding.rb', line 16

def word
  @word
end

Instance Method Details

#==(other) ⇒ Boolean Also known as: eql?

Check if this embedding is equal to another.

Parameters:

  • other (Object)

    Another object

Returns:

  • (Boolean)

    True if words and languages match



75
76
77
78
79
# File 'lib/kotoshu/models/word_embedding.rb', line 75

def ==(other)
  return false unless other.is_a?(WordEmbedding)

  @word == other.word && @language_code == other.language_code
end

#distance(other) ⇒ Float

Calculate Euclidean distance from another embedding.

Parameters:

Returns:

  • (Float)

    Euclidean distance

Raises:

  • (TypeError)

    if other is not a WordEmbedding



63
64
65
66
67
68
69
# File 'lib/kotoshu/models/word_embedding.rb', line 63

def distance(other)
  raise TypeError, "Must be WordEmbedding" unless other.is_a?(WordEmbedding)

  return Float::INFINITY if @dimension != other.dimension

  Math.sqrt(@vector.zip(other.vector).map { |a, b| (a - b)**2 }.sum)
end

#hashInteger

Hash code for hash table usage.

Returns:

  • (Integer)

    Hash code



85
86
87
# File 'lib/kotoshu/models/word_embedding.rb', line 85

def hash
  [@word, @language_code].hash
end

#similarity(other) ⇒ Float

Calculate cosine similarity with another embedding.

Cosine similarity measures the cosine of the angle between two vectors. Returns 1.0 for identical vectors, 0.0 for orthogonal vectors.

Parameters:

Returns:

  • (Float)

    Similarity score (0.0 to 1.0)

Raises:

  • (TypeError)

    if other is not a WordEmbedding



44
45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/kotoshu/models/word_embedding.rb', line 44

def similarity(other)
  raise TypeError, "Must be WordEmbedding" unless other.is_a?(WordEmbedding)

  return 0.0 if @dimension != other.dimension

  dot_product = @vector.zip(other.vector).map { |a, b| a * b }.sum
  magnitude_a = vector_magnitude
  magnitude_b = other.vector_magnitude

  return 0.0 if magnitude_a.zero? || magnitude_b.zero?

  dot_product / (magnitude_a * magnitude_b)
end

#to_sString Also known as: inspect

String representation.

Returns:

  • (String)

    Human-readable representation



92
93
94
# File 'lib/kotoshu/models/word_embedding.rb', line 92

def to_s
  "#{self.class.name}[#{@word}, #{@language_code}, #{@dimension}D]"
end