Module: RobotLab::TextAnalysis

Defined in:
lib/robot_lab/text_analysis.rb

Overview

Shared TF-IDF text analysis utilities.

Wraps Classifier::TFIDF from the optional ‘classifier’ gem (~> 2.3). Call require_classifier! before any analysis method to get a descriptive error if the gem is missing rather than a bare NameError.

Vectors returned by transform are L2-normalized, so cosine similarity equals the dot product of two vectors — no magnitude division needed.

Class Method Summary collapse

Class Method Details

.cosine_similarity(vec_a, vec_b) ⇒ Float

Cosine similarity between two L2-normalized sparse vectors.

Since Classifier::TFIDF returns L2-normalized vectors, this is just a dot product. Result is clamped to [0.0, 1.0] to absorb float noise.

Parameters:

  • vec_a (Hash{Symbol => Float})
  • vec_b (Hash{Symbol => Float})

Returns:

  • (Float)

    in [0.0, 1.0]



58
59
60
61
62
# File 'lib/robot_lab/text_analysis.rb', line 58

def self.cosine_similarity(vec_a, vec_b)
  return 0.0 if vec_a.empty? || vec_b.empty?

  [dot(vec_a, vec_b), 1.0].min
end

.dot(vec_a, vec_b) ⇒ Float

Dot product of two sparse vectors (shared keys only).

Parameters:

  • vec_a (Hash{Symbol => Float})
  • vec_b (Hash{Symbol => Float})

Returns:

  • (Float)


69
70
71
# File 'lib/robot_lab/text_analysis.rb', line 69

def self.dot(vec_a, vec_b)
  (vec_a.keys & vec_b.keys).sum { |k| vec_a[k] * vec_b[k] }.to_f
end

.fit(corpus) ⇒ Classifier::TFIDF

Fit a TF-IDF model on a corpus of strings.

Parameters:

  • corpus (Array<String>)

    non-empty array of document strings

Returns:

  • (Classifier::TFIDF)

    fitted model



35
36
37
38
39
# File 'lib/robot_lab/text_analysis.rb', line 35

def self.fit(corpus)
  model = Classifier::TFIDF.new(min_df: 1)
  model.fit(corpus)
  model
end

.l2_normalize(vec) ⇒ Hash{Symbol => Float}

L2-normalize a sparse vector.

Parameters:

  • vec (Hash{Symbol => Numeric})

Returns:

  • (Hash{Symbol => Float})

    normalized vector; {} if magnitude is zero



77
78
79
80
81
82
# File 'lib/robot_lab/text_analysis.rb', line 77

def self.l2_normalize(vec)
  magnitude = Math.sqrt(vec.values.sum { |v| v * v }.to_f)
  return {} if magnitude.zero?

  vec.transform_values { |v| v.to_f / magnitude }
end

.load_classifier_gemObject

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Load the classifier gem. Extracted for testability.



27
28
29
# File 'lib/robot_lab/text_analysis.rb', line 27

def self.load_classifier_gem
  require "classifier"
end

.require_classifier!Object

Attempt to load the classifier gem.

Raises:



16
17
18
19
20
21
22
# File 'lib/robot_lab/text_analysis.rb', line 16

def self.require_classifier!
  load_classifier_gem
rescue LoadError
  raise DependencyError,
        "The 'classifier' gem is required for text analysis features. " \
        "Add it to your Gemfile: gem 'classifier', '~> 2.3'"
end

.tf_cosine_similarity(text_a, text_b) ⇒ Float

Cosine similarity between two texts using stemmed term-frequency vectors.

Uses String#word_hash from the classifier gem (stems, removes stopwords) and L2-normalized term frequencies. Unlike TF-IDF, this does not require a reference corpus, making it reliable for direct 2-text comparison. Returns 0.0 when either text is too short to produce a term vector.

Parameters:

  • text_a (String)
  • text_b (String)

Returns:

  • (Float)

    in [0.0, 1.0]



94
95
96
97
98
99
100
101
# File 'lib/robot_lab/text_analysis.rb', line 94

def self.tf_cosine_similarity(text_a, text_b)
  require_classifier!

  vec_a = l2_normalize(text_a.word_hash)
  vec_b = l2_normalize(text_b.word_hash)

  cosine_similarity(vec_a, vec_b)
end

.transform(model, text) ⇒ Hash{Symbol => Float}

Transform a string into an L2-normalized TF-IDF term vector.

Parameters:

  • model (Classifier::TFIDF)

    a fitted model

  • text (String)

Returns:

  • (Hash{Symbol => Float})

    sparse term vector; empty if no known terms



46
47
48
# File 'lib/robot_lab/text_analysis.rb', line 46

def self.transform(model, text)
  model.transform(text.to_s) || {}
end