Class: Rubino::Memory::Deduplicator

Inherits:
Object
  • Object
show all
Defined in:
lib/rubino/memory/deduplicator.rb

Overview

Prevents duplicate memories from being stored. Uses content similarity to detect duplicates.

Scope is read-then-write WITHIN one extraction, not a write-time uniqueness constraint (#49): #duplicate? reads existing rows and Store#create inserts without a unique index, so two concurrent rubino instances that extract the SAME fact in the same instant can each pass the check and write one row —two identical rows, no data loss. This matches the field: mem0 likewise dedups per-extraction (exact/MD5 + similarity) with no cross-writer locking (mem0ai/mem0#4896). #deduplicate_all! can collapse any such pair on demand. The same-instant cross-instance collision is a rare, benign edge — by design, not a bug to gate every write on a lock.

Constant Summary collapse

SIMILARITY_THRESHOLD =

Similarity threshold (0.0 to 1.0) - above this is considered duplicate

0.85

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(store: nil) ⇒ Deduplicator

Returns a new instance of Deduplicator.



33
34
35
# File 'lib/rubino/memory/deduplicator.rb', line 33

def initialize(store: nil)
  @store = store || Store.new
end

Class Method Details

.normalize_verbatim(text) ⇒ Object

Normalize a fact for an EXACT-verbatim compare: collapse runs of whitespace to one space, strip the ends, and case-fold (#Y4). Two facts with the same normalized form are byte-equal-enough to be one fact, so a second save is a no-op. This is distinct from the 0.85 Jaccard near-dup (which a word-reordering rephrase can satisfy but #93-F4 showed misses the trivial “saved twice” repeat after the live set churns) and from the cross-instance semantic merge (#49). The single source of truth for what “the same fact” means at the write seam, shared by every backend.



29
30
31
# File 'lib/rubino/memory/deduplicator.rb', line 29

def self.normalize_verbatim(text)
  text.to_s.gsub(/\s+/, " ").strip.downcase
end

Instance Method Details

#deduplicate_all!Object

Removes duplicate memories, keeping the highest confidence version



44
45
46
47
48
49
50
# File 'lib/rubino/memory/deduplicator.rb', line 44

def deduplicate_all!
  removed = 0
  Store::VALID_KINDS.each do |kind|
    removed += deduplicate_kind(kind)
  end
  removed
end

#duplicate?(kind:, content:) ⇒ Boolean

Returns true if a similar memory already exists

Returns:

  • (Boolean)


38
39
40
41
# File 'lib/rubino/memory/deduplicator.rb', line 38

def duplicate?(kind:, content:)
  existing = @store.by_kind(kind, limit: 100)
  existing.any? { |m| similar?(m[:content], content) }
end