Class: Classifier::LSI

Inherits:

Object

Object
Classifier::LSI

show all

Includes:: Streaming, Mutex_m

Defined in:: lib/classifier/lsi.rb,
lib/classifier/lsi.rb,
lib/classifier/lsi/incremental_svd.rb

Overview

This class implements a Latent Semantic Indexer, which can search, classify and cluster data based on underlying semantic relations. For more information on the algorithms used, please consult Wikipedia.

Defined Under Namespace

Modules: IncrementalSVD

Constant Summary collapse

DEFAULT_MAX_RANK = Default maximum rank for incremental SVD

Constants included from Streaming

Streaming::DEFAULT_BATCH_SIZE

Class Attribute Summary collapse

.backend ⇒ Object

Returns the value of attribute backend.

Instance Attribute Summary collapse

#auto_rebuild ⇒ Object

Returns the value of attribute auto_rebuild.
#singular_values ⇒ Object readonly

Returns the value of attribute singular_values.
#storage ⇒ Object

Returns the value of attribute storage.
#word_list ⇒ Object readonly

Returns the value of attribute word_list.

Class Method Summary collapse

.from_json(json) ⇒ Object

Loads an LSI index from a JSON string or Hash created by #to_json or #as_json.
.load(storage:) ⇒ Object

Loads an LSI index from the configured storage.
.load_checkpoint(storage:, checkpoint_id:) ⇒ Object

Loads an LSI index from a checkpoint.
.load_from_file(path) ⇒ Object

Loads an LSI index from a file (legacy API).
.matrix_class ⇒ Object

Get the Matrix class for the current backend.
.native_available? ⇒ Boolean

Check if using native C extension.
.vector_class ⇒ Object

Get the Vector class for the current backend.

Instance Method Summary collapse

#<<(item) ⇒ Object

A less flexible shorthand for add_item that assumes you are passing in a string with no categorries.
#add(**items) ⇒ Object

Adds items to the index using hash-style syntax.
#add_batch(batch_size: Streaming::DEFAULT_BATCH_SIZE, **items) ⇒ Object

Adds items to the index in batches from an array.
#add_item(item, *categories, &block) ⇒ Object deprecated Deprecated.

Use #add instead for clearer hash-style syntax.
#as_json ⇒ Object

Returns a hash representation of the LSI index.
#build_index(cutoff = 0.75, force: false) ⇒ Object

This function rebuilds the index if needs_rebuild? returns true.
#categories_for(item) ⇒ Object

Returns the categories for a given indexed items.
#classify(doc, cutoff = 0.30, &block) ⇒ Object

This function uses a voting system to categorize documents, based on the categories of other documents.
#classify_with_confidence(doc, cutoff = 0.30, &block) ⇒ Object

Returns the same category as classify() but also returns a confidence value derived from the vote share that the winning category got.
#current_rank ⇒ Object

Returns the current rank of the incremental SVD (number of singular values kept).
#dirty? ⇒ Boolean

Returns true if there are unsaved changes.
#disable_incremental_mode! ⇒ Object

Disables incremental mode.
#enable_incremental_mode!(max_rank: DEFAULT_MAX_RANK) ⇒ Object

Enables incremental mode with optional max_rank setting.
#find_related(doc, max_nearest = 3, &block) ⇒ Object

This function takes content and finds other documents that are semantically “close”, returning an array of documents sorted from most to least relavant.
#highest_ranked_stems(doc, count = 3) ⇒ Object

Prototype, only works on indexed documents.
#highest_relative_content(max_chunks = 10) ⇒ Object

This method returns max_chunks entries, ordered by their average semantic rating.
#incremental_enabled? ⇒ Boolean

Returns true if incremental mode is enabled and active.
#initialize(options = {}) ⇒ LSI constructor

Create a fresh index.
#items ⇒ Object

Returns an array of items that are indexed.
#marshal_dump ⇒ Object

Custom marshal serialization to exclude mutex state.
#marshal_load(data) ⇒ Object

Custom marshal deserialization to recreate mutex.
#needs_rebuild? ⇒ Boolean

Returns true if the index needs to be rebuilt.
#proximity_array_for_content(doc, &block) ⇒ Object

This function is the primitive that find_related and classify build upon.
#proximity_norms_for_content(doc, &block) ⇒ Object

Similar to proximity_array_for_content, this function takes similar arguments and returns a similar array.
#reload ⇒ Object

Reloads the LSI index from the configured storage.
#reload! ⇒ Object

Force reloads the LSI index from storage, discarding any unsaved changes.
#remove_item(item) ⇒ Object

Removes an item from the database, if it is indexed.
#save ⇒ Object

Saves the LSI index to the configured storage.
#save_to_file(path) ⇒ Object

Saves the LSI index to a file (legacy API).
#search(string, max_nearest = 3) ⇒ Object

This function allows for text-based search of your index.
#singular_value_spectrum ⇒ Object
#to_json ⇒ Object

Serializes the LSI index to a JSON string.
#train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) ⇒ Object

Alias train_batch to add_batch for API consistency with other classifiers.
#train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object

Trains the LSI index from an IO stream.
#vote(doc, cutoff = 0.30, &block) ⇒ Object

Methods included from Streaming

#delete_checkpoint, #list_checkpoints, #save_checkpoint

Constructor Details

#initialize(options = {}) ⇒ `LSI`

Create a fresh index. If you want to call #build_index manually, use

Classifier::LSI.new auto_rebuild: false

For incremental SVD mode (adds documents without full rebuild):

Classifier::LSI.new incremental: true, max_rank: 100

# File 'lib/classifier/lsi.rb', line 99

def initialize(options = {})
  super()
  @auto_rebuild = true unless options[:auto_rebuild] == false
  @word_list = WordList.new
  @items = {}
  @version = 0
  @built_at_version = -1
  @dirty = false
  @storage = nil

  # Incremental SVD settings
  @incremental_mode = options[:incremental] == true
  @max_rank = options[:max_rank] || DEFAULT_MAX_RANK
  @u_matrix = nil
  @initial_vocab_size = nil
  @min_word_length = options[:min_word_length] || Classifier.config.min_word_length
end

Class Attribute Details

.backend ⇒ `Object`

Returns the value of attribute backend.



15
16
17

# File 'lib/classifier/lsi.rb', line 15

def backend
  @backend
end

Instance Attribute Details

#auto_rebuild ⇒ `Object`

Returns the value of attribute auto_rebuild.



86
87
88

# File 'lib/classifier/lsi.rb', line 86

def auto_rebuild
  @auto_rebuild
end

#singular_values ⇒ `Object` (readonly)

Returns the value of attribute singular_values.



85
86
87

# File 'lib/classifier/lsi.rb', line 85

def singular_values
  @singular_values
end

#storage ⇒ `Object`

Returns the value of attribute storage.



86
87
88

# File 'lib/classifier/lsi.rb', line 86

def storage
  @storage
end

#word_list ⇒ `Object` (readonly)

Returns the value of attribute word_list.



85
86
87

# File 'lib/classifier/lsi.rb', line 85

def word_list
  @word_list
end

Class Method Details

.from_json(json) ⇒ `Object`

Loads an LSI index from a JSON string or Hash created by #to_json or #as_json. The index will be rebuilt after loading.

Raises:

(ArgumentError)

# File 'lib/classifier/lsi.rb', line 537

def self.from_json(json)
  data = json.is_a?(String) ? JSON.parse(json) : json
  raise ArgumentError, "Invalid classifier type: #{data['type']}" unless data['type'] == 'lsi'

  # Create instance with auto_rebuild disabled during loading
  instance = new(auto_rebuild: false)

  # Restore items (categories stay as strings, matching original storage)
  data['items'].each do |item_key, item_data|
    word_hash = item_data['word_hash'].transform_keys(&:to_sym)
    categories = item_data['categories']
    instance.instance_variable_get(:@items)[item_key] = ContentNode.new(word_hash, *categories)
    instance.instance_variable_set(:@version, instance.instance_variable_get(:@version) + 1)
  end

  # Restore auto_rebuild setting and rebuild index
  instance.auto_rebuild = data['auto_rebuild']
  instance.build_index
  instance
end

.load(storage:) ⇒ `Object`

Loads an LSI index from the configured storage. The storage is set on the returned instance.

Raises:

(StorageError)

# File 'lib/classifier/lsi.rb', line 620

def self.load(storage:)
  data = storage.read
  raise StorageError, 'No saved state found' unless data

  instance = from_json(data)
  instance.storage = storage
  instance
end

.load_checkpoint(storage:, checkpoint_id:) ⇒ `Object`

Loads an LSI index from a checkpoint.

Raises:

(ArgumentError)

# File 'lib/classifier/lsi.rb', line 639

def self.load_checkpoint(storage:, checkpoint_id:)
  raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)

  dir = File.dirname(storage.path)
  base = File.basename(storage.path, '.*')
  ext = File.extname(storage.path)
  checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")

  checkpoint_storage = Storage::File.new(path: checkpoint_path)
  instance = load(storage: checkpoint_storage)
  instance.storage = storage
  instance
end

.load_from_file(path) ⇒ `Object`

Loads an LSI index from a file (legacy API).



632
633
634

# File 'lib/classifier/lsi.rb', line 632

def self.load_from_file(path)
  from_json(File.read(path))
end

.matrix_class ⇒ `Object`

Get the Matrix class for the current backend



31
32
33

# File 'lib/classifier/lsi.rb', line 31

def matrix_class
  backend == :native ? Classifier::Linalg::Matrix : ::Matrix
end

.native_available? ⇒ `Boolean`

Check if using native C extension

Returns:

(Boolean)



19
20
21

# File 'lib/classifier/lsi.rb', line 19

def native_available?
  backend == :native
end

.vector_class ⇒ `Object`

Get the Vector class for the current backend



25
26
27

# File 'lib/classifier/lsi.rb', line 25

def vector_class
  backend == :native ? Classifier::Linalg::Vector : ::Vector
end

Instance Method Details

#<<(item) ⇒ `Object`

A less flexible shorthand for add_item that assumes you are passing in a string with no categorries. item will be duck typed via to_s .



248
249
250

# File 'lib/classifier/lsi.rb', line 248

def <<(item)
  add_item(item)
end

#add(**items) ⇒ `Object`

Adds items to the index using hash-style syntax. The hash keys are categories, and values are items (or arrays of items).

For example:

lsi = Classifier::LSI.new
lsi.add("Dog" => "Dogs are loyal pets")
lsi.add("Cat" => "Cats are independent")
lsi.add(Bird: "Birds can fly")  # Symbol keys work too

Multiple items with the same category:

lsi.add("Dog" => ["Dogs are loyal", "Puppies are cute"])

Batch operations with multiple categories:

lsi.add(
  "Dog" => ["Dogs are loyal", "Puppies are cute"],
  "Cat" => ["Cats are independent", "Kittens are playful"]
)

# File 'lib/classifier/lsi.rb', line 198

def add(**items)
  items.each do |category, value|
    Array(value).each { |doc| add_item(doc, category.to_s) }
  end
end

#add_batch(batch_size: Streaming::DEFAULT_BATCH_SIZE, **items) ⇒ `Object`

Adds items to the index in batches from an array. Documents are added without rebuilding, then the index is rebuilt at the end.

Examples:

Batch add with progress

lsi.add_batch(Dog: documents, batch_size: 100) do |progress|
  puts "#{progress.percent}% complete"
end

# File 'lib/classifier/lsi.rb', line 696

def add_batch(batch_size: Streaming::DEFAULT_BATCH_SIZE, **items)
  original_auto_rebuild = @auto_rebuild
  @auto_rebuild = false

  begin
    total_docs = items.values.sum { |v| Array(v).size }
    progress = Streaming::Progress.new(total: total_docs)

    items.each do |category, documents|
      Array(documents).each_slice(batch_size) do |batch|
        batch.each { |doc| add_item(doc, category.to_s) }
        progress.completed += batch.size
        progress.current_batch += 1
        yield progress if block_given?
      end
    end
  ensure
    @auto_rebuild = original_auto_rebuild
    build_index if original_auto_rebuild
  end
end

#add_item(item, *categories, &block) ⇒ `Object`

Deprecated.

Use #add instead for clearer hash-style syntax.

Adds an item to the index. item is assumed to be a string, but any item may be indexed so long as it responds to #to_s or if you provide an optional block explaining how the indexer can fetch fresh string data. This optional block is passed the item, so the item may only be a reference to a URL or file name.

For example:

lsi = Classifier::LSI.new
lsi.add_item "This is just plain text"
lsi.add_item "/home/me/filename.txt" { |x| File.read x }
ar = ActiveRecordObject.find( :all )
lsi.add_item ar, *ar.categories { |x| ar.content }

# File 'lib/classifier/lsi.rb', line 220

def add_item(item, *categories, &block)
  clean_word_hash =
    if block
      block.call(item).clean_word_hash(@min_word_length)
    else
      item.to_s.clean_word_hash(@min_word_length)
    end

  node = nil

  synchronize do
    node = ContentNode.new(clean_word_hash, *categories)
    @items[item] = node
    @version += 1
    @dirty = true
  end

  # Use incremental update if enabled and we have a U matrix
  return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix

  build_index if @auto_rebuild
end

#as_json ⇒ `Object`

Returns a hash representation of the LSI index. Only source data (word_hash, categories) is included, not computed vectors. This can be converted to JSON or used directly.

# File 'lib/classifier/lsi.rb', line 508

def as_json(*)
  items_data = @items.transform_values do |node|
    {
      word_hash: node.word_hash.transform_keys(&:to_s),
      categories: node.categories.map(&:to_s)
    }
  end

  {
    version: 1,
    type: 'lsi',
    auto_rebuild: @auto_rebuild,
    items: items_data
  }
end

#build_index(cutoff = 0.75, force: false) ⇒ `Object`

This function rebuilds the index if needs_rebuild? returns true. For very large document spaces, this indexing operation may take some time to complete, so it may be wise to place the operation in another thread.

As a rule, indexing will be fairly swift on modern machines until you have well over 500 documents indexed, or have an incredibly diverse vocabulary for your documents.

The optional parameter “cutoff” is a tuning parameter. When the index is built, a certain number of s-values are discarded from the system. The cutoff parameter tells the indexer how many of these values to keep. A value of 1 for cutoff means that no semantic analysis will take place, turning the LSI class into a simple vector search engine.

# File 'lib/classifier/lsi.rb', line 301

def build_index(cutoff = 0.75, force: false)
  validate_cutoff!(cutoff)

  synchronize do
    return unless force || needs_rebuild_unlocked?

    make_word_list

    doc_list = @items.values
    tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }

    if self.class.native_available?
      # Convert vectors to arrays for matrix construction
      tda_arrays = tda.map { |v| v.respond_to?(:to_a) ? v.to_a : v }
      tdm = self.class.matrix_class.alloc(*tda_arrays).trans
      ntdm, u_mat = build_reduced_matrix_with_u(tdm, cutoff)
      assign_native_ext_lsi_vectors(ntdm, doc_list)
    else
      tdm = Matrix.rows(tda).trans
      ntdm, u_mat = build_reduced_matrix_with_u(tdm, cutoff)
      assign_ruby_lsi_vectors(ntdm, doc_list)
    end

    # Store U matrix for incremental mode
    if @incremental_mode
      @u_matrix = u_mat
      @initial_vocab_size = @word_list.size
    end

    @built_at_version = @version
  end
end

#categories_for(item) ⇒ `Object`

Returns the categories for a given indexed items. You are free to add and remove items from this as you see fit. It does not invalide an index to change its categories.

# File 'lib/classifier/lsi.rb', line 256

def categories_for(item)
  synchronize do
    return [] unless @items[item]

    @items[item].categories
  end
end

#classify(doc, cutoff = 0.30, &block) ⇒ `Object`

This function uses a voting system to categorize documents, based on the categories of other documents. It uses the same logic as the find_related function to find related documents, then returns the most obvious category from this list.

# File 'lib/classifier/lsi.rb', line 429

def classify(doc, cutoff = 0.30, &block)
  validate_cutoff!(cutoff)

  synchronize do
    votes = vote_unlocked(doc, cutoff, &block)

    ranking = votes.keys.sort_by { |x| votes[x] }
    ranking[-1]
  end
end

#classify_with_confidence(doc, cutoff = 0.30, &block) ⇒ `Object`

Returns the same category as classify() but also returns a confidence value derived from the vote share that the winning category got.

e.g. category,confidence = classify_with_confidence(doc) if confidence < 0.3

category = nil

end

See classify() for argument docs

# File 'lib/classifier/lsi.rb', line 459

def classify_with_confidence(doc, cutoff = 0.30, &block)
  validate_cutoff!(cutoff)

  synchronize do
    votes = vote_unlocked(doc, cutoff, &block)
    votes_sum = votes.values.sum
    return [nil, nil] if votes_sum.zero?

    ranking = votes.keys.sort_by { |x| votes[x] }
    winner = ranking[-1]
    vote_share = votes[winner] / votes_sum.to_f
    [winner, vote_share]
  end
end

#current_rank ⇒ `Object`

Returns the current rank of the incremental SVD (number of singular values kept). Returns nil if incremental mode is not active.



157
158
159

# File 'lib/classifier/lsi.rb', line 157

def current_rank
  @singular_values&.count(&:positive?)
end

#dirty? ⇒ `Boolean`

Returns true if there are unsaved changes.

Returns:

(Boolean)



612
613
614

# File 'lib/classifier/lsi.rb', line 612

def dirty?
  @dirty
end

#disable_incremental_mode! ⇒ `Object`

Disables incremental mode. Subsequent adds will trigger full rebuilds.

# File 'lib/classifier/lsi.rb', line 164

def disable_incremental_mode!
  @incremental_mode = false
  @u_matrix = nil
  @initial_vocab_size = nil
end

#enable_incremental_mode!(max_rank: DEFAULT_MAX_RANK) ⇒ `Object`

Enables incremental mode with optional max_rank setting. The next build_index call will store the U matrix for incremental updates.

# File 'lib/classifier/lsi.rb', line 174

def enable_incremental_mode!(max_rank: DEFAULT_MAX_RANK)
  @incremental_mode = true
  @max_rank = max_rank
end

#find_related(doc, max_nearest = 3, &block) ⇒ `Object`

This function takes content and finds other documents that are semantically “close”, returning an array of documents sorted from most to least relavant. max_nearest specifies the number of documents to return. A value of 0 means that it returns all the indexed documents, sorted by relavence.

This is particularly useful for identifing clusters in your document space. For example you may want to identify several “What’s Related” items for weblog articles, or find paragraphs that relate to each other in an essay.

# File 'lib/classifier/lsi.rb', line 414

def find_related(doc, max_nearest = 3, &block)
  synchronize do
    carry =
      proximity_array_for_content_unlocked(doc, &block).reject { |pair| pair[0] == doc }
    result = carry.collect { |x| x[0] }
    result[0..(max_nearest - 1)]
  end
end

#highest_ranked_stems(doc, count = 3) ⇒ `Object`

Prototype, only works on indexed documents. I have no clue if this is going to work, but in theory it’s supposed to.

# File 'lib/classifier/lsi.rb', line 478

def highest_ranked_stems(doc, count = 3)
  synchronize do
    raise 'Requested stem ranking on non-indexed content!' unless @items[doc]

    arr = node_for_content_unlocked(doc).lsi_vector.to_a
    top_n = arr.sort.reverse[0..(count - 1)]
    top_n.collect { |x| @word_list.word_for_index(arr.index(x)) }
  end
end

#highest_relative_content(max_chunks = 10) ⇒ `Object`

This method returns max_chunks entries, ordered by their average semantic rating. Essentially, the average distance of each entry from all other entries is calculated, the highest are returned.

This can be used to build a summary service, or to provide more information about your dataset’s general content. For example, if you were to use categorize on the results of this data, you could gather information on what your dataset is generally about.

# File 'lib/classifier/lsi.rb', line 344

def highest_relative_content(max_chunks = 10)
  synchronize do
    return [] if needs_rebuild_unlocked?

    avg_density = {}
    @items.each_key { |x| avg_density[x] = proximity_array_for_content_unlocked(x).sum { |pair| pair[1] } }

    avg_density.keys.sort_by { |x| avg_density[x] }.reverse[0..(max_chunks - 1)].map
  end
end

#incremental_enabled? ⇒ `Boolean`

Returns true if incremental mode is enabled and active. Incremental mode becomes active after the first build_index call.

Returns:

(Boolean)



149
150
151

# File 'lib/classifier/lsi.rb', line 149

def incremental_enabled?
  @incremental_mode && !@u_matrix.nil?
end

#items ⇒ `Object`

Returns an array of items that are indexed.



281
282
283

# File 'lib/classifier/lsi.rb', line 281

def items
  synchronize { @items.keys }
end

#marshal_dump ⇒ `Object`

Custom marshal serialization to exclude mutex state



490
491
492

# File 'lib/classifier/lsi.rb', line 490

def marshal_dump
  [@auto_rebuild, @word_list, @items, @version, @built_at_version, @dirty, @min_word_length]
end

#marshal_load(data) ⇒ `Object`

Custom marshal deserialization to recreate mutex

# File 'lib/classifier/lsi.rb', line 496

def marshal_load(data)
  mu_initialize
  @auto_rebuild, @word_list, @items, @version, @built_at_version, @dirty,
    @min_word_length = data
  @storage = nil
end

#needs_rebuild? ⇒ `Boolean`

Returns true if the index needs to be rebuilt. The index needs to be built after all informaton is added, but before you start using it for search, classification and cluster detection.

Returns:

(Boolean)



122
123
124

# File 'lib/classifier/lsi.rb', line 122

def needs_rebuild?
  synchronize { (@items.keys.size > 1) && (@version != @built_at_version) }
end

#proximity_array_for_content(doc, &block) ⇒ `Object`

This function is the primitive that find_related and classify build upon. It returns an array of 2-element arrays. The first element of this array is a document, and the second is its “score”, defining how “close” it is to other indexed items.

These values are somewhat arbitrary, having to do with the vector space created by your content, so the magnitude is interpretable but not always meaningful between indexes.

The parameter doc is the content to compare. If that content is not indexed, you can pass an optional block to define how to create the text data. See add_item for examples of how this works.



369
370
371

# File 'lib/classifier/lsi.rb', line 369

def proximity_array_for_content(doc, &block)
  synchronize { proximity_array_for_content_unlocked(doc, &block) }
end

#proximity_norms_for_content(doc, &block) ⇒ `Object`

Similar to proximity_array_for_content, this function takes similar arguments and returns a similar array. However, it uses the normalized calculated vectors instead of their full versions. This is useful when you’re trying to perform operations on content that is much smaller than the text you’re working with. search uses this primitive.



380
381
382

# File 'lib/classifier/lsi.rb', line 380

def proximity_norms_for_content(doc, &block)
  synchronize { proximity_norms_for_content_unlocked(doc, &block) }
end

#reload ⇒ `Object`

Reloads the LSI index from the configured storage. Raises UnsavedChangesError if there are unsaved changes. Use reload! to force reload and discard changes.

Raises:

(ArgumentError)

# File 'lib/classifier/lsi.rb', line 583

def reload
  raise ArgumentError, 'No storage configured' unless storage
  raise UnsavedChangesError, 'Unsaved changes would be lost. Call save first or use reload!' if @dirty

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#reload! ⇒ `Object`

Force reloads the LSI index from storage, discarding any unsaved changes.

Raises:

(ArgumentError)

# File 'lib/classifier/lsi.rb', line 598

def reload!
  raise ArgumentError, 'No storage configured' unless storage

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#remove_item(item) ⇒ `Object`

Removes an item from the database, if it is indexed.

# File 'lib/classifier/lsi.rb', line 267

def remove_item(item)
  removed = synchronize do
    next false unless @items.key?(item)

    @items.delete(item)
    @version += 1
    @dirty = true
    true
  end
  build_index if removed && @auto_rebuild
end

#save ⇒ `Object`

Saves the LSI index to the configured storage. Raises ArgumentError if no storage is configured.

Raises:

(ArgumentError)

# File 'lib/classifier/lsi.rb', line 562

def save
  raise ArgumentError, 'No storage configured. Use save_to_file(path) or set storage=' unless storage

  storage.write(to_json)
  @dirty = false
end

#save_to_file(path) ⇒ `Object`

Saves the LSI index to a file (legacy API).

# File 'lib/classifier/lsi.rb', line 572

def save_to_file(path)
  result = File.write(path, to_json)
  @dirty = false
  result
end

#search(string, max_nearest = 3) ⇒ `Object`

This function allows for text-based search of your index. Unlike other functions like find_related and classify, search only takes short strings. It will also ignore factors like repeated words. It is best for short, google-like search terms. A search will first priortize lexical relationships, then semantic ones.

While this may seem backwards compared to the other functions that LSI supports, it is actually the same algorithm, just applied on a smaller document.

# File 'lib/classifier/lsi.rb', line 393

def search(string, max_nearest = 3)
  synchronize do
    return [] if needs_rebuild_unlocked?

    carry = proximity_norms_for_content_unlocked(string)
    result = carry.collect { |x| x[0] }
    result[0..(max_nearest - 1)]
  end
end

#singular_value_spectrum ⇒ `Object`

# File 'lib/classifier/lsi.rb', line 127

def singular_value_spectrum
  return nil unless @singular_values

  total = @singular_values.sum
  return nil if total.zero?

  cumulative = 0.0
  @singular_values.map.with_index do |value, i|
    cumulative += value
    {
      dimension: i,
      value: value,
      percentage: value / total,
      cumulative_percentage: cumulative / total
    }
  end
end

#to_json ⇒ `Object`

Serializes the LSI index to a JSON string. Only source data (word_hash, categories) is serialized, not computed vectors. On load, the index will be rebuilt automatically.



529
530
531

# File 'lib/classifier/lsi.rb', line 529

def to_json(*)
  as_json.to_json
end

#train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) ⇒ `Object`

Alias train_batch to add_batch for API consistency with other classifiers. Note: LSI uses categories differently (items have categories, not the training call).

# File 'lib/classifier/lsi.rb', line 722

def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block)
  if category && documents
    add_batch(batch_size: batch_size, **{ category.to_sym => documents }, &block)
  else
    add_batch(batch_size: batch_size, **categories, &block)
  end
end

#train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ `Object`

Trains the LSI index from an IO stream. Each line in the stream is treated as a separate document. Documents are added without rebuilding, then the index is rebuilt at the end.

Examples:

Train from a file

lsi.train_from_stream(:category, File.open('corpus.txt'))

With progress tracking

lsi.train_from_stream(:category, io, batch_size: 500) do |progress|
  puts "#{progress.completed} documents processed"
end

# File 'lib/classifier/lsi.rb', line 666

def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE)
  original_auto_rebuild = @auto_rebuild
  @auto_rebuild = false

  begin
    reader = Streaming::LineReader.new(io, batch_size: batch_size)
    total = reader.estimate_line_count
    progress = Streaming::Progress.new(total: total)

    reader.each_batch do |batch|
      batch.each { |text| add_item(text, category) }
      progress.completed += batch.size
      progress.current_batch += 1
      yield progress if block_given?
    end
  ensure
    @auto_rebuild = original_auto_rebuild
    build_index if original_auto_rebuild
  end
end

#vote(doc, cutoff = 0.30, &block) ⇒ `Object`

# File 'lib/classifier/lsi.rb', line 441

def vote(doc, cutoff = 0.30, &block)
  validate_cutoff!(cutoff)

  synchronize { vote_unlocked(doc, cutoff, &block) }
end

Class: Classifier::LSI

Overview

Defined Under Namespace

Constant Summary collapse

Constants included from Streaming

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Streaming

Constructor Details

#initialize(options = {}) ⇒ LSI

Class Attribute Details

.backend ⇒ Object

Instance Attribute Details

#auto_rebuild ⇒ Object

#singular_values ⇒ Object (readonly)

#storage ⇒ Object

#word_list ⇒ Object (readonly)

Class Method Details

.from_json(json) ⇒ Object

.load(storage:) ⇒ Object

.load_checkpoint(storage:, checkpoint_id:) ⇒ Object

.load_from_file(path) ⇒ Object

.matrix_class ⇒ Object

.native_available? ⇒ Boolean

.vector_class ⇒ Object

Instance Method Details

#<<(item) ⇒ Object

#add(**items) ⇒ Object

#add_batch(batch_size: Streaming::DEFAULT_BATCH_SIZE, **items) ⇒ Object

Examples:

Batch add with progress

#add_item(item, *categories, &block) ⇒ Object

#as_json ⇒ Object

#build_index(cutoff = 0.75, force: false) ⇒ Object

#categories_for(item) ⇒ Object

#classify(doc, cutoff = 0.30, &block) ⇒ Object

#classify_with_confidence(doc, cutoff = 0.30, &block) ⇒ Object

#current_rank ⇒ Object

#dirty? ⇒ Boolean

#disable_incremental_mode! ⇒ Object

#enable_incremental_mode!(max_rank: DEFAULT_MAX_RANK) ⇒ Object

#find_related(doc, max_nearest = 3, &block) ⇒ Object

#highest_ranked_stems(doc, count = 3) ⇒ Object

#highest_relative_content(max_chunks = 10) ⇒ Object

#incremental_enabled? ⇒ Boolean

#items ⇒ Object

#marshal_dump ⇒ Object

#marshal_load(data) ⇒ Object

#needs_rebuild? ⇒ Boolean

#proximity_array_for_content(doc, &block) ⇒ Object

#proximity_norms_for_content(doc, &block) ⇒ Object

#reload ⇒ Object

#reload! ⇒ Object

#remove_item(item) ⇒ Object

#save ⇒ Object

#save_to_file(path) ⇒ Object

#search(string, max_nearest = 3) ⇒ Object

#singular_value_spectrum ⇒ Object

#to_json ⇒ Object

#train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) ⇒ Object

#train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object

Examples:

Train from a file

With progress tracking

#vote(doc, cutoff = 0.30, &block) ⇒ Object

#initialize(options = {}) ⇒ `LSI`

.backend ⇒ `Object`

#auto_rebuild ⇒ `Object`

#singular_values ⇒ `Object` (readonly)

#storage ⇒ `Object`

#word_list ⇒ `Object` (readonly)

.from_json(json) ⇒ `Object`

.load(storage:) ⇒ `Object`

.load_checkpoint(storage:, checkpoint_id:) ⇒ `Object`

.load_from_file(path) ⇒ `Object`

.matrix_class ⇒ `Object`

.native_available? ⇒ `Boolean`

.vector_class ⇒ `Object`

#<<(item) ⇒ `Object`

#add(**items) ⇒ `Object`

#add_batch(batch_size: Streaming::DEFAULT_BATCH_SIZE, **items) ⇒ `Object`

#add_item(item, *categories, &block) ⇒ `Object`

#as_json ⇒ `Object`

#build_index(cutoff = 0.75, force: false) ⇒ `Object`

#categories_for(item) ⇒ `Object`

#classify(doc, cutoff = 0.30, &block) ⇒ `Object`

#classify_with_confidence(doc, cutoff = 0.30, &block) ⇒ `Object`

#current_rank ⇒ `Object`

#dirty? ⇒ `Boolean`

#disable_incremental_mode! ⇒ `Object`

#enable_incremental_mode!(max_rank: DEFAULT_MAX_RANK) ⇒ `Object`

#find_related(doc, max_nearest = 3, &block) ⇒ `Object`

#highest_ranked_stems(doc, count = 3) ⇒ `Object`

#highest_relative_content(max_chunks = 10) ⇒ `Object`

#incremental_enabled? ⇒ `Boolean`

#items ⇒ `Object`

#marshal_dump ⇒ `Object`

#marshal_load(data) ⇒ `Object`

#needs_rebuild? ⇒ `Boolean`

#proximity_array_for_content(doc, &block) ⇒ `Object`

#proximity_norms_for_content(doc, &block) ⇒ `Object`

#reload ⇒ `Object`

#reload! ⇒ `Object`

#remove_item(item) ⇒ `Object`

#save ⇒ `Object`

#save_to_file(path) ⇒ `Object`

#search(string, max_nearest = 3) ⇒ `Object`

#singular_value_spectrum ⇒ `Object`

#to_json ⇒ `Object`

#train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) ⇒ `Object`

#train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ `Object`

#vote(doc, cutoff = 0.30, &block) ⇒ `Object`