Module: RobotLab::TextAnalysis
- Defined in:
- lib/robot_lab/text_analysis.rb
Overview
Shared TF-IDF text analysis utilities.
Wraps Classifier::TFIDF from the optional ‘classifier’ gem (~> 2.3). Call require_classifier! before any analysis method to get a descriptive error if the gem is missing rather than a bare NameError.
Vectors returned by transform are L2-normalized, so cosine similarity equals the dot product of two vectors — no magnitude division needed.
Class Method Summary collapse
-
.cosine_similarity(vec_a, vec_b) ⇒ Float
Cosine similarity between two L2-normalized sparse vectors.
-
.dot(vec_a, vec_b) ⇒ Float
Dot product of two sparse vectors (shared keys only).
-
.fit(corpus) ⇒ Classifier::TFIDF
Fit a TF-IDF model on a corpus of strings.
-
.l2_normalize(vec) ⇒ Hash{Symbol => Float}
L2-normalize a sparse vector.
-
.load_classifier_gem ⇒ Object
private
Load the classifier gem.
-
.require_classifier! ⇒ Object
Attempt to load the classifier gem.
-
.tf_cosine_similarity(text_a, text_b) ⇒ Float
Cosine similarity between two texts using stemmed term-frequency vectors.
-
.transform(model, text) ⇒ Hash{Symbol => Float}
Transform a string into an L2-normalized TF-IDF term vector.
Class Method Details
.cosine_similarity(vec_a, vec_b) ⇒ Float
Cosine similarity between two L2-normalized sparse vectors.
Since Classifier::TFIDF returns L2-normalized vectors, this is just a dot product. Result is clamped to [0.0, 1.0] to absorb float noise.
58 59 60 61 62 |
# File 'lib/robot_lab/text_analysis.rb', line 58 def self.cosine_similarity(vec_a, vec_b) return 0.0 if vec_a.empty? || vec_b.empty? [dot(vec_a, vec_b), 1.0].min end |
.dot(vec_a, vec_b) ⇒ Float
Dot product of two sparse vectors (shared keys only).
69 70 71 |
# File 'lib/robot_lab/text_analysis.rb', line 69 def self.dot(vec_a, vec_b) (vec_a.keys & vec_b.keys).sum { |k| vec_a[k] * vec_b[k] }.to_f end |
.fit(corpus) ⇒ Classifier::TFIDF
Fit a TF-IDF model on a corpus of strings.
35 36 37 38 39 |
# File 'lib/robot_lab/text_analysis.rb', line 35 def self.fit(corpus) model = Classifier::TFIDF.new(min_df: 1) model.fit(corpus) model end |
.l2_normalize(vec) ⇒ Hash{Symbol => Float}
L2-normalize a sparse vector.
77 78 79 80 81 82 |
# File 'lib/robot_lab/text_analysis.rb', line 77 def self.l2_normalize(vec) magnitude = Math.sqrt(vec.values.sum { |v| v * v }.to_f) return {} if magnitude.zero? vec.transform_values { |v| v.to_f / magnitude } end |
.load_classifier_gem ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Load the classifier gem. Extracted for testability.
27 28 29 |
# File 'lib/robot_lab/text_analysis.rb', line 27 def self.load_classifier_gem require "classifier" end |
.require_classifier! ⇒ Object
Attempt to load the classifier gem.
16 17 18 19 20 21 22 |
# File 'lib/robot_lab/text_analysis.rb', line 16 def self.require_classifier! load_classifier_gem rescue LoadError raise DependencyError, "The 'classifier' gem is required for text analysis features. " \ "Add it to your Gemfile: gem 'classifier', '~> 2.3'" end |
.tf_cosine_similarity(text_a, text_b) ⇒ Float
Cosine similarity between two texts using stemmed term-frequency vectors.
Uses String#word_hash from the classifier gem (stems, removes stopwords) and L2-normalized term frequencies. Unlike TF-IDF, this does not require a reference corpus, making it reliable for direct 2-text comparison. Returns 0.0 when either text is too short to produce a term vector.
94 95 96 97 98 99 100 101 |
# File 'lib/robot_lab/text_analysis.rb', line 94 def self.tf_cosine_similarity(text_a, text_b) require_classifier! vec_a = l2_normalize(text_a.word_hash) vec_b = l2_normalize(text_b.word_hash) cosine_similarity(vec_a, vec_b) end |
.transform(model, text) ⇒ Hash{Symbol => Float}
Transform a string into an L2-normalized TF-IDF term vector.
46 47 48 |
# File 'lib/robot_lab/text_analysis.rb', line 46 def self.transform(model, text) model.transform(text.to_s) || {} end |