Class: Classifier::TFIDF
- Includes:
- Streaming
- Defined in:
- lib/classifier/tfidf.rb
Overview
TF-IDF vectorizer: transforms text to weighted feature vectors. Downweights common words, upweights discriminative terms.
Example:
tfidf = Classifier::TFIDF.new
tfidf.fit(["Dogs are great pets", "Cats are independent"])
tfidf.transform("Dogs are loyal") # => {:dog=>0.7071..., :loyal=>0.7071...}
Constant Summary
Constants included from Streaming
Instance Attribute Summary collapse
-
#idf ⇒ Object
readonly
Returns the value of attribute idf.
-
#num_documents ⇒ Object
readonly
Returns the value of attribute num_documents.
-
#storage ⇒ Object
Returns the value of attribute storage.
-
#vocabulary ⇒ Object
readonly
Returns the value of attribute vocabulary.
Class Method Summary collapse
-
.from_json(json) ⇒ Object
Loads a vectorizer from JSON.
-
.load(storage:) ⇒ Object
Loads a vectorizer from the configured storage.
-
.load_checkpoint(storage:, checkpoint_id:) ⇒ Object
Loads a vectorizer from a checkpoint.
-
.load_from_file(path) ⇒ Object
Loads a vectorizer from a file.
Instance Method Summary collapse
- #as_json(_options = nil) ⇒ Object
-
#dirty? ⇒ Boolean
Returns true if there are unsaved changes.
-
#feature_names ⇒ Object
Returns vocabulary terms in index order.
-
#fit(documents) ⇒ Object
Learns vocabulary and IDF weights from the corpus.
-
#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object
Fits the vectorizer from an IO stream.
-
#fit_transform(documents) ⇒ Object
Fits and transforms in one step.
- #fitted? ⇒ Boolean
-
#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false, min_word_length: Classifier.config.min_word_length) ⇒ TFIDF
constructor
Creates a new TF-IDF vectorizer.
- #marshal_dump ⇒ Object
- #marshal_load(data) ⇒ Object
-
#reload ⇒ Object
Reloads the vectorizer from storage, raising if there are unsaved changes.
-
#reload! ⇒ Object
Force reloads the vectorizer from storage, discarding any unsaved changes.
-
#save ⇒ Object
Saves the vectorizer to the configured storage.
-
#save_to_file(path) ⇒ Object
Saves the vectorizer state to a file.
- #to_json(_options = nil) ⇒ Object
-
#train_batch ⇒ Object
TFIDF doesn’t support train_batch (use fit instead).
-
#train_from_stream ⇒ Object
TFIDF doesn’t support train_from_stream (use fit_from_stream instead).
-
#transform(document) ⇒ Object
Transforms a document into a normalized TF-IDF vector.
Methods included from Streaming
#delete_checkpoint, #list_checkpoints, #save_checkpoint
Constructor Details
#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false, min_word_length: Classifier.config.min_word_length) ⇒ TFIDF
Creates a new TF-IDF vectorizer.
-
min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
-
ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
-
sublinear_tf: use 1 + log(tf) instead of raw term frequency
-
min_word_length: minimum word length filter in tokenization
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
# File 'lib/classifier/tfidf.rb', line 44 def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false, min_word_length: Classifier.config.min_word_length) validate_df!(min_df, 'min_df') validate_df!(max_df, 'max_df') validate_ngram_range!(ngram_range) @min_df = min_df @max_df = max_df @ngram_range = ngram_range @sublinear_tf = sublinear_tf @vocabulary = {} @idf = {} @num_documents = 0 @fitted = false @dirty = false @storage = nil @min_word_length = min_word_length end |
Instance Attribute Details
#idf ⇒ Object (readonly)
Returns the value of attribute idf.
33 34 35 |
# File 'lib/classifier/tfidf.rb', line 33 def idf @idf end |
#num_documents ⇒ Object (readonly)
Returns the value of attribute num_documents.
33 34 35 |
# File 'lib/classifier/tfidf.rb', line 33 def num_documents @num_documents end |
#storage ⇒ Object
Returns the value of attribute storage.
34 35 36 |
# File 'lib/classifier/tfidf.rb', line 34 def storage @storage end |
#vocabulary ⇒ Object (readonly)
Returns the value of attribute vocabulary.
33 34 35 |
# File 'lib/classifier/tfidf.rb', line 33 def vocabulary @vocabulary end |
Class Method Details
.from_json(json) ⇒ Object
Loads a vectorizer from JSON.
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
# File 'lib/classifier/tfidf.rb', line 223 def self.from_json(json) data = json.is_a?(String) ? JSON.parse(json) : json raise ArgumentError, "Invalid vectorizer type: #{data['type']}" unless data['type'] == 'tfidf' instance = new( min_df: data['min_df'], max_df: data['max_df'], ngram_range: data['ngram_range'], sublinear_tf: data['sublinear_tf'], min_word_length: data['min_word_length'] || Classifier.config.min_word_length ) instance.instance_variable_set(:@vocabulary, symbolize_keys(data['vocabulary'])) instance.instance_variable_set(:@idf, symbolize_keys(data['idf'])) instance.instance_variable_set(:@num_documents, data['num_documents']) instance.instance_variable_set(:@fitted, data['fitted']) instance.instance_variable_set(:@dirty, false) instance.instance_variable_set(:@storage, nil) instance end |
.load(storage:) ⇒ Object
Loads a vectorizer from the configured storage.
157 158 159 160 161 162 163 164 |
# File 'lib/classifier/tfidf.rb', line 157 def self.load(storage:) data = storage.read raise StorageError, 'No saved state found' unless data instance = from_json(data) instance.storage = storage instance end |
.load_checkpoint(storage:, checkpoint_id:) ⇒ Object
Loads a vectorizer from a checkpoint.
262 263 264 265 266 267 268 269 270 271 272 273 274 |
# File 'lib/classifier/tfidf.rb', line 262 def self.load_checkpoint(storage:, checkpoint_id:) raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File) dir = File.dirname(storage.path) base = File.basename(storage.path, '.*') ext = File.extname(storage.path) checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}") checkpoint_storage = Storage::File.new(path: checkpoint_path) instance = load(storage: checkpoint_storage) instance.storage = storage instance end |
.load_from_file(path) ⇒ Object
Loads a vectorizer from a file.
168 169 170 |
# File 'lib/classifier/tfidf.rb', line 168 def self.load_from_file(path) from_json(File.read(path)) end |
Instance Method Details
#as_json(_options = nil) ⇒ Object
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
# File 'lib/classifier/tfidf.rb', line 200 def as_json( = nil) { version: 1, type: 'tfidf', min_df: @min_df, max_df: @max_df, ngram_range: @ngram_range, sublinear_tf: @sublinear_tf, vocabulary: @vocabulary, idf: @idf, num_documents: @num_documents, fitted: @fitted, min_word_length: @min_word_length } end |
#dirty? ⇒ Boolean
Returns true if there are unsaved changes.
134 135 136 |
# File 'lib/classifier/tfidf.rb', line 134 def dirty? @dirty end |
#feature_names ⇒ Object
Returns vocabulary terms in index order.
123 124 125 |
# File 'lib/classifier/tfidf.rb', line 123 def feature_names @vocabulary.keys.sort_by { |term| @vocabulary[term] } end |
#fit(documents) ⇒ Object
Learns vocabulary and IDF weights from the corpus.
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# File 'lib/classifier/tfidf.rb', line 65 def fit(documents) raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array) raise ArgumentError, 'documents cannot be empty' if documents.empty? @num_documents = documents.size document_frequencies = Hash.new(0) documents.each do |doc| terms = extract_terms(doc) terms.each_key { |term| document_frequencies[term] += 1 } end @vocabulary = {} @idf = {} vocab_index = 0 document_frequencies.each do |term, df| next unless within_df_bounds?(df, @num_documents) @vocabulary[term] = vocab_index vocab_index += 1 # IDF: log((N + 1) / (df + 1)) + 1 @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1 end @fitted = true @dirty = true self end |
#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object
Fits the vectorizer from an IO stream. Collects all documents from the stream, then fits the model. Note: All documents must be collected in memory for IDF calculation.
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 |
# File 'lib/classifier/tfidf.rb', line 289 def fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) reader = Streaming::LineReader.new(io, batch_size: batch_size) total = reader.estimate_line_count progress = Streaming::Progress.new(total: total) documents = [] #: Array[String] reader.each_batch do |batch| documents.concat(batch) progress.completed += batch.size progress.current_batch += 1 yield progress if block_given? end fit(documents) unless documents.empty? self end |
#fit_transform(documents) ⇒ Object
Fits and transforms in one step.
116 117 118 119 |
# File 'lib/classifier/tfidf.rb', line 116 def fit_transform(documents) fit(documents) documents.map { |doc| transform(doc) } end |
#fitted? ⇒ Boolean
128 129 130 |
# File 'lib/classifier/tfidf.rb', line 128 def fitted? @fitted end |
#marshal_dump ⇒ Object
246 247 248 249 |
# File 'lib/classifier/tfidf.rb', line 246 def marshal_dump [@min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted, @min_word_length] end |
#marshal_load(data) ⇒ Object
252 253 254 255 256 257 |
# File 'lib/classifier/tfidf.rb', line 252 def marshal_load(data) @min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted, @min_word_length = data @dirty = false @storage = nil end |
#reload ⇒ Object
Reloads the vectorizer from storage, raising if there are unsaved changes.
174 175 176 177 178 179 180 181 182 183 184 |
# File 'lib/classifier/tfidf.rb', line 174 def reload raise ArgumentError, 'No storage configured' unless storage raise UnsavedChangesError, 'Unsaved changes would be lost. Call save first or use reload!' if @dirty data = storage.read raise StorageError, 'No saved state found' unless data restore_from_json(data) @dirty = false self end |
#reload! ⇒ Object
Force reloads the vectorizer from storage, discarding any unsaved changes.
188 189 190 191 192 193 194 195 196 197 |
# File 'lib/classifier/tfidf.rb', line 188 def reload! raise ArgumentError, 'No storage configured' unless storage data = storage.read raise StorageError, 'No saved state found' unless data restore_from_json(data) @dirty = false self end |
#save ⇒ Object
Saves the vectorizer to the configured storage.
140 141 142 143 144 145 |
# File 'lib/classifier/tfidf.rb', line 140 def save raise ArgumentError, 'No storage configured' unless storage storage.write(to_json) @dirty = false end |
#save_to_file(path) ⇒ Object
Saves the vectorizer state to a file.
149 150 151 152 153 |
# File 'lib/classifier/tfidf.rb', line 149 def save_to_file(path) result = File.write(path, to_json) @dirty = false result end |
#to_json(_options = nil) ⇒ Object
217 218 219 |
# File 'lib/classifier/tfidf.rb', line 217 def to_json( = nil) JSON.generate(as_json) end |
#train_batch ⇒ Object
TFIDF doesn’t support train_batch (use fit instead). This method raises NotImplementedError with guidance.
319 320 321 |
# File 'lib/classifier/tfidf.rb', line 319 def train_batch(*) # steep:ignore raise NotImplementedError, 'TFIDF uses fit instead of train_batch' end |
#train_from_stream ⇒ Object
TFIDF doesn’t support train_from_stream (use fit_from_stream instead). This method raises NotImplementedError with guidance.
311 312 313 |
# File 'lib/classifier/tfidf.rb', line 311 def train_from_stream(*) # steep:ignore raise NotImplementedError, 'TFIDF uses fit_from_stream instead of train_from_stream' end |
#transform(document) ⇒ Object
Transforms a document into a normalized TF-IDF vector.
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/classifier/tfidf.rb', line 98 def transform(document) raise NotFittedError, 'TFIDF has not been fitted. Call fit first.' unless @fitted terms = extract_terms(document) result = {} #: Hash[Symbol, Float] terms.each do |term, tf| next unless @vocabulary.key?(term) tf_value = @sublinear_tf && tf.positive? ? 1 + Math.log(tf) : tf.to_f result[term] = (tf_value * @idf[term]).to_f end normalize_vector(result) end |