Class: Classifier::TFIDF

Inherits:
Object show all
Includes:
Streaming
Defined in:
lib/classifier/tfidf.rb

Overview

TF-IDF vectorizer: transforms text to weighted feature vectors. Downweights common words, upweights discriminative terms.

Example:

tfidf = Classifier::TFIDF.new
tfidf.fit(["Dogs are great pets", "Cats are independent"])
tfidf.transform("Dogs are loyal")  # => {:dog=>0.7071..., :loyal=>0.7071...}

Constant Summary

Constants included from Streaming

Streaming::DEFAULT_BATCH_SIZE

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Streaming

#delete_checkpoint, #list_checkpoints, #save_checkpoint

Constructor Details

#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false, min_word_length: Classifier.config.min_word_length) ⇒ TFIDF

Creates a new TF-IDF vectorizer.

  • min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)

  • ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams

  • sublinear_tf: use 1 + log(tf) instead of raw term frequency

  • min_word_length: minimum word length filter in tokenization



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/classifier/tfidf.rb', line 44

def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false,
               min_word_length: Classifier.config.min_word_length)
  validate_df!(min_df, 'min_df')
  validate_df!(max_df, 'max_df')
  validate_ngram_range!(ngram_range)

  @min_df = min_df
  @max_df = max_df
  @ngram_range = ngram_range
  @sublinear_tf = sublinear_tf
  @vocabulary = {}
  @idf = {}
  @num_documents = 0
  @fitted = false
  @dirty = false
  @storage = nil
  @min_word_length = min_word_length
end

Instance Attribute Details

#idfObject (readonly)

Returns the value of attribute idf.



33
34
35
# File 'lib/classifier/tfidf.rb', line 33

def idf
  @idf
end

#num_documentsObject (readonly)

Returns the value of attribute num_documents.



33
34
35
# File 'lib/classifier/tfidf.rb', line 33

def num_documents
  @num_documents
end

#storageObject

Returns the value of attribute storage.



34
35
36
# File 'lib/classifier/tfidf.rb', line 34

def storage
  @storage
end

#vocabularyObject (readonly)

Returns the value of attribute vocabulary.



33
34
35
# File 'lib/classifier/tfidf.rb', line 33

def vocabulary
  @vocabulary
end

Class Method Details

.from_json(json) ⇒ Object

Loads a vectorizer from JSON.

Raises:

  • (ArgumentError)


223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
# File 'lib/classifier/tfidf.rb', line 223

def self.from_json(json)
  data = json.is_a?(String) ? JSON.parse(json) : json
  raise ArgumentError, "Invalid vectorizer type: #{data['type']}" unless data['type'] == 'tfidf'

  instance = new(
    min_df: data['min_df'],
    max_df: data['max_df'],
    ngram_range: data['ngram_range'],
    sublinear_tf: data['sublinear_tf'],
    min_word_length: data['min_word_length'] || Classifier.config.min_word_length
  )

  instance.instance_variable_set(:@vocabulary, symbolize_keys(data['vocabulary']))
  instance.instance_variable_set(:@idf, symbolize_keys(data['idf']))
  instance.instance_variable_set(:@num_documents, data['num_documents'])
  instance.instance_variable_set(:@fitted, data['fitted'])
  instance.instance_variable_set(:@dirty, false)
  instance.instance_variable_set(:@storage, nil)

  instance
end

.load(storage:) ⇒ Object

Loads a vectorizer from the configured storage.

Raises:



157
158
159
160
161
162
163
164
# File 'lib/classifier/tfidf.rb', line 157

def self.load(storage:)
  data = storage.read
  raise StorageError, 'No saved state found' unless data

  instance = from_json(data)
  instance.storage = storage
  instance
end

.load_checkpoint(storage:, checkpoint_id:) ⇒ Object

Loads a vectorizer from a checkpoint.

Raises:

  • (ArgumentError)


262
263
264
265
266
267
268
269
270
271
272
273
274
# File 'lib/classifier/tfidf.rb', line 262

def self.load_checkpoint(storage:, checkpoint_id:)
  raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)

  dir = File.dirname(storage.path)
  base = File.basename(storage.path, '.*')
  ext = File.extname(storage.path)
  checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")

  checkpoint_storage = Storage::File.new(path: checkpoint_path)
  instance = load(storage: checkpoint_storage)
  instance.storage = storage
  instance
end

.load_from_file(path) ⇒ Object

Loads a vectorizer from a file.



168
169
170
# File 'lib/classifier/tfidf.rb', line 168

def self.load_from_file(path)
  from_json(File.read(path))
end

Instance Method Details

#as_json(_options = nil) ⇒ Object



200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# File 'lib/classifier/tfidf.rb', line 200

def as_json(_options = nil)
  {
    version: 1,
    type: 'tfidf',
    min_df: @min_df,
    max_df: @max_df,
    ngram_range: @ngram_range,
    sublinear_tf: @sublinear_tf,
    vocabulary: @vocabulary,
    idf: @idf,
    num_documents: @num_documents,
    fitted: @fitted,
    min_word_length: @min_word_length
  }
end

#dirty?Boolean

Returns true if there are unsaved changes.

Returns:

  • (Boolean)


134
135
136
# File 'lib/classifier/tfidf.rb', line 134

def dirty?
  @dirty
end

#feature_namesObject

Returns vocabulary terms in index order.



123
124
125
# File 'lib/classifier/tfidf.rb', line 123

def feature_names
  @vocabulary.keys.sort_by { |term| @vocabulary[term] }
end

#fit(documents) ⇒ Object

Learns vocabulary and IDF weights from the corpus.

Raises:

  • (ArgumentError)


65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# File 'lib/classifier/tfidf.rb', line 65

def fit(documents)
  raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
  raise ArgumentError, 'documents cannot be empty' if documents.empty?

  @num_documents = documents.size
  document_frequencies = Hash.new(0)

  documents.each do |doc|
    terms = extract_terms(doc)
    terms.each_key { |term| document_frequencies[term] += 1 }
  end

  @vocabulary = {}
  @idf = {}
  vocab_index = 0

  document_frequencies.each do |term, df|
    next unless within_df_bounds?(df, @num_documents)

    @vocabulary[term] = vocab_index
    vocab_index += 1

    # IDF: log((N + 1) / (df + 1)) + 1
    @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
  end

  @fitted = true
  @dirty = true
  self
end

#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object

Fits the vectorizer from an IO stream. Collects all documents from the stream, then fits the model. Note: All documents must be collected in memory for IDF calculation.

Examples:

Fit from a file

tfidf.fit_from_stream(File.open('corpus.txt'))

With progress tracking

tfidf.fit_from_stream(io, batch_size: 500) do |progress|
  puts "#{progress.completed} documents loaded"
end


289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
# File 'lib/classifier/tfidf.rb', line 289

def fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE)
  reader = Streaming::LineReader.new(io, batch_size: batch_size)
  total = reader.estimate_line_count
  progress = Streaming::Progress.new(total: total)

  documents = [] #: Array[String]

  reader.each_batch do |batch|
    documents.concat(batch)
    progress.completed += batch.size
    progress.current_batch += 1
    yield progress if block_given?
  end

  fit(documents) unless documents.empty?
  self
end

#fit_transform(documents) ⇒ Object

Fits and transforms in one step.



116
117
118
119
# File 'lib/classifier/tfidf.rb', line 116

def fit_transform(documents)
  fit(documents)
  documents.map { |doc| transform(doc) }
end

#fitted?Boolean

Returns:

  • (Boolean)


128
129
130
# File 'lib/classifier/tfidf.rb', line 128

def fitted?
  @fitted
end

#marshal_dumpObject



246
247
248
249
# File 'lib/classifier/tfidf.rb', line 246

def marshal_dump
  [@min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted,
   @min_word_length]
end

#marshal_load(data) ⇒ Object



252
253
254
255
256
257
# File 'lib/classifier/tfidf.rb', line 252

def marshal_load(data)
  @min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted,
    @min_word_length = data
  @dirty = false
  @storage = nil
end

#reloadObject

Reloads the vectorizer from storage, raising if there are unsaved changes.

Raises:

  • (ArgumentError)


174
175
176
177
178
179
180
181
182
183
184
# File 'lib/classifier/tfidf.rb', line 174

def reload
  raise ArgumentError, 'No storage configured' unless storage
  raise UnsavedChangesError, 'Unsaved changes would be lost. Call save first or use reload!' if @dirty

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#reload!Object

Force reloads the vectorizer from storage, discarding any unsaved changes.

Raises:

  • (ArgumentError)


188
189
190
191
192
193
194
195
196
197
# File 'lib/classifier/tfidf.rb', line 188

def reload!
  raise ArgumentError, 'No storage configured' unless storage

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#saveObject

Saves the vectorizer to the configured storage.

Raises:

  • (ArgumentError)


140
141
142
143
144
145
# File 'lib/classifier/tfidf.rb', line 140

def save
  raise ArgumentError, 'No storage configured' unless storage

  storage.write(to_json)
  @dirty = false
end

#save_to_file(path) ⇒ Object

Saves the vectorizer state to a file.



149
150
151
152
153
# File 'lib/classifier/tfidf.rb', line 149

def save_to_file(path)
  result = File.write(path, to_json)
  @dirty = false
  result
end

#to_json(_options = nil) ⇒ Object



217
218
219
# File 'lib/classifier/tfidf.rb', line 217

def to_json(_options = nil)
  JSON.generate(as_json)
end

#train_batchObject

TFIDF doesn’t support train_batch (use fit instead). This method raises NotImplementedError with guidance.

Raises:

  • (NotImplementedError)


319
320
321
# File 'lib/classifier/tfidf.rb', line 319

def train_batch(*) # steep:ignore
  raise NotImplementedError, 'TFIDF uses fit instead of train_batch'
end

#train_from_streamObject

TFIDF doesn’t support train_from_stream (use fit_from_stream instead). This method raises NotImplementedError with guidance.

Raises:

  • (NotImplementedError)


311
312
313
# File 'lib/classifier/tfidf.rb', line 311

def train_from_stream(*) # steep:ignore
  raise NotImplementedError, 'TFIDF uses fit_from_stream instead of train_from_stream'
end

#transform(document) ⇒ Object

Transforms a document into a normalized TF-IDF vector.

Raises:



98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/classifier/tfidf.rb', line 98

def transform(document)
  raise NotFittedError, 'TFIDF has not been fitted. Call fit first.' unless @fitted

  terms = extract_terms(document)
  result = {} #: Hash[Symbol, Float]

  terms.each do |term, tf|
    next unless @vocabulary.key?(term)

    tf_value = @sublinear_tf && tf.positive? ? 1 + Math.log(tf) : tf.to_f
    result[term] = (tf_value * @idf[term]).to_f
  end

  normalize_vector(result)
end