Class: Classifier::TFIDF

Inherits:

Object

Object
Classifier::TFIDF

show all

Includes:: Streaming

Defined in:: lib/classifier/tfidf.rb

Overview

TF-IDF vectorizer: transforms text to weighted feature vectors. Downweights common words, upweights discriminative terms.

Example:

tfidf = Classifier::TFIDF.new
tfidf.fit(["Dogs are great pets", "Cats are independent"])
tfidf.transform("Dogs are loyal")  # => {:dog=>0.7071..., :loyal=>0.7071...}

Constant Summary

Constants included from Streaming

Streaming::DEFAULT_BATCH_SIZE

Instance Attribute Summary collapse

#idf ⇒ Object readonly

Returns the value of attribute idf.
#num_documents ⇒ Object readonly

Returns the value of attribute num_documents.
#storage ⇒ Object

Returns the value of attribute storage.
#vocabulary ⇒ Object readonly

Returns the value of attribute vocabulary.

Class Method Summary collapse

.from_json(json) ⇒ Object

Loads a vectorizer from JSON.
.load(storage:) ⇒ Object

Loads a vectorizer from the configured storage.
.load_checkpoint(storage:, checkpoint_id:) ⇒ Object

Loads a vectorizer from a checkpoint.
.load_from_file(path) ⇒ Object

Loads a vectorizer from a file.

Instance Method Summary collapse

#as_json(_options = nil) ⇒ Object
#dirty? ⇒ Boolean

Returns true if there are unsaved changes.
#feature_names ⇒ Object

Returns vocabulary terms in index order.
#fit(documents) ⇒ Object

Learns vocabulary and IDF weights from the corpus.
#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object

Fits the vectorizer from an IO stream.
#fit_transform(documents) ⇒ Object

Fits and transforms in one step.
#fitted? ⇒ Boolean
#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false, min_word_length: Classifier.config.min_word_length) ⇒ TFIDF constructor

Creates a new TF-IDF vectorizer.
#marshal_dump ⇒ Object
#marshal_load(data) ⇒ Object
#reload ⇒ Object

Reloads the vectorizer from storage, raising if there are unsaved changes.
#reload! ⇒ Object

Force reloads the vectorizer from storage, discarding any unsaved changes.
#save ⇒ Object

Saves the vectorizer to the configured storage.
#save_to_file(path) ⇒ Object

Saves the vectorizer state to a file.
#to_json(_options = nil) ⇒ Object
#train_batch ⇒ Object

TFIDF doesn’t support train_batch (use fit instead).
#train_from_stream ⇒ Object

TFIDF doesn’t support train_from_stream (use fit_from_stream instead).
#transform(document) ⇒ Object

Transforms a document into a normalized TF-IDF vector.

Methods included from Streaming

#delete_checkpoint, #list_checkpoints, #save_checkpoint

Constructor Details

#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false, min_word_length: Classifier.config.min_word_length) ⇒ `TFIDF`

Creates a new TF-IDF vectorizer.

min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
sublinear_tf: use 1 + log(tf) instead of raw term frequency
min_word_length: minimum word length filter in tokenization

# File 'lib/classifier/tfidf.rb', line 44

def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false,
               min_word_length: Classifier.config.min_word_length)
  validate_df!(min_df, 'min_df')
  validate_df!(max_df, 'max_df')
  validate_ngram_range!(ngram_range)

  @min_df = min_df
  @max_df = max_df
  @ngram_range = ngram_range
  @sublinear_tf = sublinear_tf
  @vocabulary = {}
  @idf = {}
  @num_documents = 0
  @fitted = false
  @dirty = false
  @storage = nil
  @min_word_length = min_word_length
end

Instance Attribute Details

#idf ⇒ `Object` (readonly)

Returns the value of attribute idf.



33
34
35

# File 'lib/classifier/tfidf.rb', line 33

def idf
  @idf
end

#num_documents ⇒ `Object` (readonly)

Returns the value of attribute num_documents.



33
34
35

# File 'lib/classifier/tfidf.rb', line 33

def num_documents
  @num_documents
end

#storage ⇒ `Object`

Returns the value of attribute storage.



34
35
36

# File 'lib/classifier/tfidf.rb', line 34

def storage
  @storage
end

#vocabulary ⇒ `Object` (readonly)

Returns the value of attribute vocabulary.



33
34
35

# File 'lib/classifier/tfidf.rb', line 33

def vocabulary
  @vocabulary
end

Class Method Details

.from_json(json) ⇒ `Object`

Loads a vectorizer from JSON.

Raises:

(ArgumentError)

# File 'lib/classifier/tfidf.rb', line 223

def self.from_json(json)
  data = json.is_a?(String) ? JSON.parse(json) : json
  raise ArgumentError, "Invalid vectorizer type: #{data['type']}" unless data['type'] == 'tfidf'

  instance = new(
    min_df: data['min_df'],
    max_df: data['max_df'],
    ngram_range: data['ngram_range'],
    sublinear_tf: data['sublinear_tf'],
    min_word_length: data['min_word_length'] || Classifier.config.min_word_length
  )

  instance.instance_variable_set(:@vocabulary, symbolize_keys(data['vocabulary']))
  instance.instance_variable_set(:@idf, symbolize_keys(data['idf']))
  instance.instance_variable_set(:@num_documents, data['num_documents'])
  instance.instance_variable_set(:@fitted, data['fitted'])
  instance.instance_variable_set(:@dirty, false)
  instance.instance_variable_set(:@storage, nil)

  instance
end

.load(storage:) ⇒ `Object`

Loads a vectorizer from the configured storage.

Raises:

(StorageError)

# File 'lib/classifier/tfidf.rb', line 157

def self.load(storage:)
  data = storage.read
  raise StorageError, 'No saved state found' unless data

  instance = from_json(data)
  instance.storage = storage
  instance
end

.load_checkpoint(storage:, checkpoint_id:) ⇒ `Object`

Loads a vectorizer from a checkpoint.

Raises:

(ArgumentError)

# File 'lib/classifier/tfidf.rb', line 262

def self.load_checkpoint(storage:, checkpoint_id:)
  raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)

  dir = File.dirname(storage.path)
  base = File.basename(storage.path, '.*')
  ext = File.extname(storage.path)
  checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")

  checkpoint_storage = Storage::File.new(path: checkpoint_path)
  instance = load(storage: checkpoint_storage)
  instance.storage = storage
  instance
end

.load_from_file(path) ⇒ `Object`

Loads a vectorizer from a file.



168
169
170

# File 'lib/classifier/tfidf.rb', line 168

def self.load_from_file(path)
  from_json(File.read(path))
end

Instance Method Details

#as_json(_options = nil) ⇒ `Object`

# File 'lib/classifier/tfidf.rb', line 200

def as_json(_options = nil)
  {
    version: 1,
    type: 'tfidf',
    min_df: @min_df,
    max_df: @max_df,
    ngram_range: @ngram_range,
    sublinear_tf: @sublinear_tf,
    vocabulary: @vocabulary,
    idf: @idf,
    num_documents: @num_documents,
    fitted: @fitted,
    min_word_length: @min_word_length
  }
end

#dirty? ⇒ `Boolean`

Returns true if there are unsaved changes.

Returns:

(Boolean)



134
135
136

# File 'lib/classifier/tfidf.rb', line 134

def dirty?
  @dirty
end

#feature_names ⇒ `Object`

Returns vocabulary terms in index order.



123
124
125

# File 'lib/classifier/tfidf.rb', line 123

def feature_names
  @vocabulary.keys.sort_by { |term| @vocabulary[term] }
end

#fit(documents) ⇒ `Object`

Learns vocabulary and IDF weights from the corpus.

Raises:

(ArgumentError)

# File 'lib/classifier/tfidf.rb', line 65

def fit(documents)
  raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
  raise ArgumentError, 'documents cannot be empty' if documents.empty?

  @num_documents = documents.size
  document_frequencies = Hash.new(0)

  documents.each do |doc|
    terms = extract_terms(doc)
    terms.each_key { |term| document_frequencies[term] += 1 }
  end

  @vocabulary = {}
  @idf = {}
  vocab_index = 0

  document_frequencies.each do |term, df|
    next unless within_df_bounds?(df, @num_documents)

    @vocabulary[term] = vocab_index
    vocab_index += 1

    # IDF: log((N + 1) / (df + 1)) + 1
    @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
  end

  @fitted = true
  @dirty = true
  self
end

#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ `Object`

Fits the vectorizer from an IO stream. Collects all documents from the stream, then fits the model. Note: All documents must be collected in memory for IDF calculation.

Examples:

Fit from a file

tfidf.fit_from_stream(File.open('corpus.txt'))

With progress tracking

tfidf.fit_from_stream(io, batch_size: 500) do |progress|
  puts "#{progress.completed} documents loaded"
end

# File 'lib/classifier/tfidf.rb', line 289

def fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE)
  reader = Streaming::LineReader.new(io, batch_size: batch_size)
  total = reader.estimate_line_count
  progress = Streaming::Progress.new(total: total)

  documents = [] #: Array[String]

  reader.each_batch do |batch|
    documents.concat(batch)
    progress.completed += batch.size
    progress.current_batch += 1
    yield progress if block_given?
  end

  fit(documents) unless documents.empty?
  self
end

#fit_transform(documents) ⇒ `Object`

Fits and transforms in one step.

# File 'lib/classifier/tfidf.rb', line 116

def fit_transform(documents)
  fit(documents)
  documents.map { |doc| transform(doc) }
end

#fitted? ⇒ `Boolean`

Returns:

(Boolean)



128
129
130

# File 'lib/classifier/tfidf.rb', line 128

def fitted?
  @fitted
end

#marshal_dump ⇒ `Object`

# File 'lib/classifier/tfidf.rb', line 246

def marshal_dump
  [@min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted,
   @min_word_length]
end

#marshal_load(data) ⇒ `Object`

# File 'lib/classifier/tfidf.rb', line 252

def marshal_load(data)
  @min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted,
    @min_word_length = data
  @dirty = false
  @storage = nil
end

#reload ⇒ `Object`

Reloads the vectorizer from storage, raising if there are unsaved changes.

Raises:

(ArgumentError)

# File 'lib/classifier/tfidf.rb', line 174

def reload
  raise ArgumentError, 'No storage configured' unless storage
  raise UnsavedChangesError, 'Unsaved changes would be lost. Call save first or use reload!' if @dirty

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#reload! ⇒ `Object`

Force reloads the vectorizer from storage, discarding any unsaved changes.

Raises:

(ArgumentError)

# File 'lib/classifier/tfidf.rb', line 188

def reload!
  raise ArgumentError, 'No storage configured' unless storage

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#save ⇒ `Object`

Saves the vectorizer to the configured storage.

Raises:

(ArgumentError)

# File 'lib/classifier/tfidf.rb', line 140

def save
  raise ArgumentError, 'No storage configured' unless storage

  storage.write(to_json)
  @dirty = false
end

#save_to_file(path) ⇒ `Object`

Saves the vectorizer state to a file.

# File 'lib/classifier/tfidf.rb', line 149

def save_to_file(path)
  result = File.write(path, to_json)
  @dirty = false
  result
end

#to_json(_options = nil) ⇒ `Object`



217
218
219

# File 'lib/classifier/tfidf.rb', line 217

def to_json(_options = nil)
  JSON.generate(as_json)
end

#train_batch ⇒ `Object`

TFIDF doesn’t support train_batch (use fit instead). This method raises NotImplementedError with guidance.

Raises:

(NotImplementedError)



319
320
321

# File 'lib/classifier/tfidf.rb', line 319

def train_batch(*) # steep:ignore
  raise NotImplementedError, 'TFIDF uses fit instead of train_batch'
end

#train_from_stream ⇒ `Object`

TFIDF doesn’t support train_from_stream (use fit_from_stream instead). This method raises NotImplementedError with guidance.

Raises:

(NotImplementedError)



311
312
313

# File 'lib/classifier/tfidf.rb', line 311

def train_from_stream(*) # steep:ignore
  raise NotImplementedError, 'TFIDF uses fit_from_stream instead of train_from_stream'
end

#transform(document) ⇒ `Object`

Transforms a document into a normalized TF-IDF vector.

Raises:

(NotFittedError)

# File 'lib/classifier/tfidf.rb', line 98

def transform(document)
  raise NotFittedError, 'TFIDF has not been fitted. Call fit first.' unless @fitted

  terms = extract_terms(document)
  result = {} #: Hash[Symbol, Float]

  terms.each do |term, tf|
    next unless @vocabulary.key?(term)

    tf_value = @sublinear_tf && tf.positive? ? 1 + Math.log(tf) : tf.to_f
    result[term] = (tf_value * @idf[term]).to_f
  end

  normalize_vector(result)
end

Class: Classifier::TFIDF

Overview

Constant Summary

Constants included from Streaming

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Streaming

Constructor Details

#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false, min_word_length: Classifier.config.min_word_length) ⇒ TFIDF

Instance Attribute Details

#idf ⇒ Object (readonly)

#num_documents ⇒ Object (readonly)

#storage ⇒ Object

#vocabulary ⇒ Object (readonly)

Class Method Details

.from_json(json) ⇒ Object

.load(storage:) ⇒ Object

.load_checkpoint(storage:, checkpoint_id:) ⇒ Object

.load_from_file(path) ⇒ Object

Instance Method Details

#as_json(_options = nil) ⇒ Object

#dirty? ⇒ Boolean

#feature_names ⇒ Object

#fit(documents) ⇒ Object

#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object

Examples:

Fit from a file

With progress tracking

#fit_transform(documents) ⇒ Object

#fitted? ⇒ Boolean

#marshal_dump ⇒ Object

#marshal_load(data) ⇒ Object

#reload ⇒ Object

#reload! ⇒ Object

#save ⇒ Object

#save_to_file(path) ⇒ Object

#to_json(_options = nil) ⇒ Object

#train_batch ⇒ Object

#train_from_stream ⇒ Object

#transform(document) ⇒ Object

#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false, min_word_length: Classifier.config.min_word_length) ⇒ `TFIDF`

#idf ⇒ `Object` (readonly)

#num_documents ⇒ `Object` (readonly)

#storage ⇒ `Object`

#vocabulary ⇒ `Object` (readonly)

.from_json(json) ⇒ `Object`

.load(storage:) ⇒ `Object`

.load_checkpoint(storage:, checkpoint_id:) ⇒ `Object`

.load_from_file(path) ⇒ `Object`

#as_json(_options = nil) ⇒ `Object`

#dirty? ⇒ `Boolean`

#feature_names ⇒ `Object`

#fit(documents) ⇒ `Object`

#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ `Object`

#fit_transform(documents) ⇒ `Object`

#fitted? ⇒ `Boolean`

#marshal_dump ⇒ `Object`

#marshal_load(data) ⇒ `Object`

#reload ⇒ `Object`

#reload! ⇒ `Object`

#save ⇒ `Object`

#save_to_file(path) ⇒ `Object`

#to_json(_options = nil) ⇒ `Object`

#train_batch ⇒ `Object`

#train_from_stream ⇒ `Object`

#transform(document) ⇒ `Object`