Class: Classifier::Bayes
- Includes:
- Streaming, Mutex_m
- Defined in:
- lib/classifier/bayes.rb
Overview
rubocop:disable Metrics/ClassLength
Constant Summary
Constants included from Streaming
Instance Attribute Summary collapse
-
#storage ⇒ Object
Returns the value of attribute storage.
Class Method Summary collapse
-
.from_json(json) ⇒ Object
Loads a classifier from a JSON string or a Hash created by #to_json or #as_json.
-
.load(storage:) ⇒ Object
Loads a classifier from the configured storage.
-
.load_checkpoint(storage:, checkpoint_id:) ⇒ Object
Loads a classifier from a checkpoint.
-
.load_from_file(path) ⇒ Object
Loads a classifier from a file (legacy API).
Instance Method Summary collapse
-
#add_category(category) ⇒ Object
(also: #append_category)
Allows you to add categories to the classifier.
-
#as_json(_options = nil) ⇒ Object
Returns a hash representation of the classifier state.
-
#categories ⇒ Object
Provides a list of category names For example: b.categories => [‘This’, ‘That’, ‘the_other’].
-
#classifications(text) ⇒ Object
Returns the scores in each category the provided
text. -
#classify(text) ⇒ Object
Returns the classification of the provided
text, which is one of the categories given in the initializer. -
#dirty? ⇒ Boolean
Returns true if there are unsaved changes.
-
#initialize(*categories, min_word_length: Classifier.config.min_word_length) ⇒ Bayes
constructor
The class can be created with one or more categories, each of which will be initialized and given a training method.
-
#marshal_dump ⇒ Object
Custom marshal serialization to exclude mutex state.
-
#marshal_load(data) ⇒ Object
Custom marshal deserialization to recreate mutex.
-
#method_missing(name, *args) ⇒ Object
Provides training and untraining methods for the categories specified in Bayes#new For example: b = Classifier::Bayes.new ‘This’, ‘That’, ‘the_other’ b.train_this “This text” b.train_that “That text” b.untrain_that “That text” b.train_the_other “The other text”.
-
#reload ⇒ Object
Reloads the classifier from the configured storage.
-
#reload! ⇒ Object
Force reloads the classifier from storage, discarding any unsaved changes.
-
#remove_category(category) ⇒ Object
Allows you to remove categories from the classifier.
- #respond_to_missing?(name, include_private = false) ⇒ Boolean
-
#save ⇒ Object
Saves the classifier to the configured storage.
-
#save_to_file(path) ⇒ Object
Saves the classifier state to a file (legacy API).
-
#to_json(_options = nil) ⇒ Object
Serializes the classifier state to a JSON string.
-
#train(category = nil, text = nil, **categories) ⇒ Object
Trains the classifier with text for a category.
-
#train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) ⇒ Object
Trains the classifier with an array of documents in batches.
-
#train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object
Trains the classifier from an IO stream.
-
#untrain(category = nil, text = nil, **categories) ⇒ Object
Removes training data.
Methods included from Streaming
#delete_checkpoint, #list_checkpoints, #save_checkpoint
Constructor Details
#initialize(*categories, min_word_length: Classifier.config.min_word_length) ⇒ Bayes
The class can be created with one or more categories, each of which will be initialized and given a training method. E.g.,
b = Classifier::Bayes.new 'Interesting', 'Uninteresting', 'Spam'
b = Classifier::Bayes.new ['Interesting', 'Uninteresting', 'Spam']
b = Classifier::Bayes.new 'Spam', min_word_length: 1
33 34 35 36 37 38 39 40 41 42 43 44 45 |
# File 'lib/classifier/bayes.rb', line 33 def initialize(*categories, min_word_length: Classifier.config.min_word_length) super() @categories = {} categories.flatten.each { |category| @categories[category.prepare_category_name] = {} } @total_words = 0 @category_counts = Hash.new(0) @category_word_count = Hash.new(0) @cached_training_count = nil @cached_vocab_size = nil @dirty = false @storage = nil @min_word_length = min_word_length end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(name, *args) ⇒ Object
Provides training and untraining methods for the categories specified in Bayes#new For example:
b = Classifier::Bayes.new 'This', 'That', 'the_other'
b.train_this "This text"
b.train_that "That text"
b.untrain_that "That text"
b.train_the_other "The other text"
234 235 236 237 238 239 240 241 242 |
# File 'lib/classifier/bayes.rb', line 234 def method_missing(name, *args) return super unless name.to_s =~ /(un)?train_(\w+)/ category = name.to_s.gsub(/(un)?train_(\w+)/, '\2').prepare_category_name raise StandardError, "No such category: #{category}" unless @categories.key?(category) method = name.to_s.start_with?('untrain_') ? :untrain : :train args.each { |text| send(method, category, text) } end |
Instance Attribute Details
#storage ⇒ Object
Returns the value of attribute storage.
25 26 27 |
# File 'lib/classifier/bayes.rb', line 25 def storage @storage end |
Class Method Details
.from_json(json) ⇒ Object
Loads a classifier from a JSON string or a Hash created by #to_json or #as_json.
139 140 141 142 143 144 145 146 |
# File 'lib/classifier/bayes.rb', line 139 def self.from_json(json) data = json.is_a?(String) ? JSON.parse(json) : json raise ArgumentError, "Invalid classifier type: #{data['type']}" unless data['type'] == 'bayes' instance = allocate instance.send(:restore_state, data) instance end |
.load(storage:) ⇒ Object
Loads a classifier from the configured storage. The storage is set on the returned instance.
210 211 212 213 214 215 216 217 |
# File 'lib/classifier/bayes.rb', line 210 def self.load(storage:) data = storage.read raise StorageError, 'No saved state found' unless data instance = from_json(data) instance.storage = storage instance end |
.load_checkpoint(storage:, checkpoint_id:) ⇒ Object
Loads a classifier from a checkpoint.
376 377 378 379 380 381 382 383 384 385 386 387 388 |
# File 'lib/classifier/bayes.rb', line 376 def self.load_checkpoint(storage:, checkpoint_id:) raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File) dir = File.dirname(storage.path) base = File.basename(storage.path, '.*') ext = File.extname(storage.path) checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}") checkpoint_storage = Storage::File.new(path: checkpoint_path) instance = load(storage: checkpoint_storage) instance.storage = storage instance end |
.load_from_file(path) ⇒ Object
Loads a classifier from a file (legacy API).
222 223 224 |
# File 'lib/classifier/bayes.rb', line 222 def self.load_from_file(path) from_json(File.read(path)) end |
Instance Method Details
#add_category(category) ⇒ Object Also known as: append_category
Allows you to add categories to the classifier. For example:
b.add_category "Not spam"
WARNING: Adding categories to a trained classifier will result in an undertrained category that will tend to match more criteria than the trained selective categories. In short, try to initialize your categories at initialization.
269 270 271 272 273 274 275 |
# File 'lib/classifier/bayes.rb', line 269 def add_category(category) synchronize do invalidate_caches @dirty = true @categories[category.prepare_category_name] = {} end end |
#as_json(_options = nil) ⇒ Object
Returns a hash representation of the classifier state. This can be converted to JSON or used directly.
116 117 118 119 120 121 122 123 124 125 126 |
# File 'lib/classifier/bayes.rb', line 116 def as_json( = nil) { version: 1, type: 'bayes', categories: @categories.transform_keys(&:to_s).transform_values { |v| v.transform_keys(&:to_s) }, total_words: @total_words, category_counts: @category_counts.transform_keys(&:to_s), category_word_count: @category_word_count.transform_keys(&:to_s), min_word_length: @min_word_length } end |
#categories ⇒ Object
Provides a list of category names For example:
b.categories
=> ['This', 'That', 'the_other']
255 256 257 |
# File 'lib/classifier/bayes.rb', line 255 def categories synchronize { @categories.keys.collect(&:to_s) } end |
#classifications(text) ⇒ Object
Returns the scores in each category the provided text. E.g.,
b.classifications "I hate bad words and you"
=> {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
The largest of these scores (the one closest to 0) is the one picked out by #classify
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
# File 'lib/classifier/bayes.rb', line 81 def classifications(text) words = text.word_hash(@min_word_length).keys synchronize do training_count = cached_training_count vocab_size = cached_vocab_size @categories.to_h do |category, category_words| smoothed_total = ((@category_word_count[category] || 0) + vocab_size).to_f # Laplace smoothing: P(word|category) = (count + α) / (total + α * V) word_score = words.sum { |w| Math.log(((category_words[w] || 0) + 1) / smoothed_total) } prior_score = Math.log((@category_counts[category] || 0.1) / training_count) [category.to_s, word_score + prior_score] end end end |
#classify(text) ⇒ Object
Returns the classification of the provided text, which is one of the categories given in the initializer. E.g.,
b.classify "I hate bad words and you"
=> 'Uninteresting'
105 106 107 108 109 110 |
# File 'lib/classifier/bayes.rb', line 105 def classify(text) best = classifications(text).min_by { |a| -a[1] } raise StandardError, 'No classifications available' unless best best.first.to_s end |
#dirty? ⇒ Boolean
Returns true if there are unsaved changes.
202 203 204 |
# File 'lib/classifier/bayes.rb', line 202 def dirty? @dirty end |
#marshal_dump ⇒ Object
Custom marshal serialization to exclude mutex state
281 282 283 |
# File 'lib/classifier/bayes.rb', line 281 def marshal_dump [@categories, @total_words, @category_counts, @category_word_count, @dirty] end |
#marshal_load(data) ⇒ Object
Custom marshal deserialization to recreate mutex
287 288 289 290 291 292 293 |
# File 'lib/classifier/bayes.rb', line 287 def marshal_load(data) mu_initialize @categories, @total_words, @category_counts, @category_word_count, @dirty = data @cached_training_count = nil @cached_vocab_size = nil @storage = nil end |
#reload ⇒ Object
Reloads the classifier from the configured storage. Raises UnsavedChangesError if there are unsaved changes. Use reload! to force reload and discard changes.
173 174 175 176 177 178 179 180 181 182 183 |
# File 'lib/classifier/bayes.rb', line 173 def reload raise ArgumentError, 'No storage configured' unless storage raise UnsavedChangesError, 'Unsaved changes would be lost. Call save first or use reload!' if @dirty data = storage.read raise StorageError, 'No saved state found' unless data restore_from_json(data) @dirty = false self end |
#reload! ⇒ Object
Force reloads the classifier from storage, discarding any unsaved changes.
188 189 190 191 192 193 194 195 196 197 |
# File 'lib/classifier/bayes.rb', line 188 def reload! raise ArgumentError, 'No storage configured' unless storage data = storage.read raise StorageError, 'No saved state found' unless data restore_from_json(data) @dirty = false self end |
#remove_category(category) ⇒ Object
Allows you to remove categories from the classifier. For example:
b.remove_category "Spam"
WARNING: Removing categories from a trained classifier will result in the loss of all training data for that category. Make sure you really want to do this before calling this method.
304 305 306 307 308 309 310 311 312 313 314 315 316 317 |
# File 'lib/classifier/bayes.rb', line 304 def remove_category(category) category = category.prepare_category_name synchronize do raise StandardError, "No such category: #{category}" unless @categories.key?(category) invalidate_caches @dirty = true @total_words -= @category_word_count[category].to_i @categories.delete(category) @category_counts.delete(category) @category_word_count.delete(category) end end |
#respond_to_missing?(name, include_private = false) ⇒ Boolean
245 246 247 |
# File 'lib/classifier/bayes.rb', line 245 def respond_to_missing?(name, include_private = false) !!(name.to_s =~ /(un)?train_(\w+)/) || super end |
#save ⇒ Object
Saves the classifier to the configured storage. Raises ArgumentError if no storage is configured.
152 153 154 155 156 157 |
# File 'lib/classifier/bayes.rb', line 152 def save raise ArgumentError, 'No storage configured. Use save_to_file(path) or set storage=' unless storage storage.write(to_json) @dirty = false end |
#save_to_file(path) ⇒ Object
Saves the classifier state to a file (legacy API).
162 163 164 165 166 |
# File 'lib/classifier/bayes.rb', line 162 def save_to_file(path) result = File.write(path, to_json) @dirty = false result end |
#to_json(_options = nil) ⇒ Object
Serializes the classifier state to a JSON string. This can be saved to a file and later loaded with Bayes.from_json.
132 133 134 |
# File 'lib/classifier/bayes.rb', line 132 def to_json( = nil) as_json.to_json end |
#train(category = nil, text = nil, **categories) ⇒ Object
Trains the classifier with text for a category.
b.train(spam: "Buy now!", ham: ["Hello", "Meeting tomorrow"])
b.train(:spam, "legacy positional API")
53 54 55 56 57 58 59 |
# File 'lib/classifier/bayes.rb', line 53 def train(category = nil, text = nil, **categories) return train_single(category, text) if category && text categories.each do |cat, texts| (texts.is_a?(Array) ? texts : [texts]).each { |t| train_single(cat, t) } end end |
#train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) ⇒ Object
Trains the classifier with an array of documents in batches. Reduces lock contention by processing multiple documents per synchronize call.
363 364 365 366 367 368 369 370 371 |
# File 'lib/classifier/bayes.rb', line 363 def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) if category && documents train_batch_for_category(category, documents, batch_size: batch_size, &block) else categories.each do |cat, docs| train_batch_for_category(cat, Array(docs), batch_size: batch_size, &block) end end end |
#train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object
Trains the classifier from an IO stream. Each line in the stream is treated as a separate document. This is memory-efficient for large corpora.
332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 |
# File 'lib/classifier/bayes.rb', line 332 def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) category = category.prepare_category_name raise StandardError, "No such category: #{category}" unless @categories.key?(category) reader = Streaming::LineReader.new(io, batch_size: batch_size) total = reader.estimate_line_count progress = Streaming::Progress.new(total: total) reader.each_batch do |batch| train_batch_internal(category, batch) progress.completed += batch.size progress.current_batch += 1 yield progress if block_given? end end |
#untrain(category = nil, text = nil, **categories) ⇒ Object
Removes training data. Be careful with this method.
b.untrain(spam: "Buy now!")
b.untrain(:spam, "legacy positional API")
67 68 69 70 71 72 73 |
# File 'lib/classifier/bayes.rb', line 67 def untrain(category = nil, text = nil, **categories) return untrain_single(category, text) if category && text categories.each do |cat, texts| (texts.is_a?(Array) ? texts : [texts]).each { |t| untrain_single(cat, t) } end end |