Class: Rubino::Memory::Deduplicator
- Inherits:
-
Object
- Object
- Rubino::Memory::Deduplicator
- Defined in:
- lib/rubino/memory/deduplicator.rb
Overview
Prevents duplicate memories from being stored. Uses content similarity to detect duplicates.
Scope is read-then-write WITHIN one extraction, not a write-time uniqueness constraint (#49): #duplicate? reads existing rows and Store#create inserts without a unique index, so two concurrent rubino instances that extract the SAME fact in the same instant can each pass the check and write one row —two identical rows, no data loss. This matches the field: mem0 likewise dedups per-extraction (exact/MD5 + similarity) with no cross-writer locking (mem0ai/mem0#4896). #deduplicate_all! can collapse any such pair on demand. The same-instant cross-instance collision is a rare, benign edge — by design, not a bug to gate every write on a lock.
Constant Summary collapse
- SIMILARITY_THRESHOLD =
Similarity threshold (0.0 to 1.0) - above this is considered duplicate
0.85
Class Method Summary collapse
-
.normalize_verbatim(text) ⇒ Object
Normalize a fact for an EXACT-verbatim compare: collapse runs of whitespace to one space, strip the ends, and case-fold (#Y4).
Instance Method Summary collapse
-
#deduplicate_all! ⇒ Object
Removes duplicate memories, keeping the highest confidence version.
-
#duplicate?(kind:, content:) ⇒ Boolean
Returns true if a similar memory already exists.
-
#initialize(store: nil) ⇒ Deduplicator
constructor
A new instance of Deduplicator.
Constructor Details
#initialize(store: nil) ⇒ Deduplicator
Returns a new instance of Deduplicator.
33 34 35 |
# File 'lib/rubino/memory/deduplicator.rb', line 33 def initialize(store: nil) @store = store || Store.new end |
Class Method Details
.normalize_verbatim(text) ⇒ Object
Normalize a fact for an EXACT-verbatim compare: collapse runs of whitespace to one space, strip the ends, and case-fold (#Y4). Two facts with the same normalized form are byte-equal-enough to be one fact, so a second save is a no-op. This is distinct from the 0.85 Jaccard near-dup (which a word-reordering rephrase can satisfy but #93-F4 showed misses the trivial “saved twice” repeat after the live set churns) and from the cross-instance semantic merge (#49). The single source of truth for what “the same fact” means at the write seam, shared by every backend.
29 30 31 |
# File 'lib/rubino/memory/deduplicator.rb', line 29 def self.normalize_verbatim(text) text.to_s.gsub(/\s+/, " ").strip.downcase end |
Instance Method Details
#deduplicate_all! ⇒ Object
Removes duplicate memories, keeping the highest confidence version
44 45 46 47 48 49 50 |
# File 'lib/rubino/memory/deduplicator.rb', line 44 def deduplicate_all! removed = 0 Store::VALID_KINDS.each do |kind| removed += deduplicate_kind(kind) end removed end |
#duplicate?(kind:, content:) ⇒ Boolean
Returns true if a similar memory already exists
38 39 40 41 |
# File 'lib/rubino/memory/deduplicator.rb', line 38 def duplicate?(kind:, content:) existing = @store.by_kind(kind, limit: 100) existing.any? { |m| similar?(m[:content], content) } end |