Class: BloomFit

Inherits:
Object
  • Object
show all
Extended by:
Forwardable
Defined in:
lib/bloom_fit.rb,
lib/bloom_fit/version.rb

Overview

BloomFit is an in-memory Bloom filter with a small, Set-like API.

Bloom filters are probabilistic membership structures: they can report false positives, but they do not report false negatives for values that have been added. That makes BloomFit useful for cheaply ruling out missing values before doing more expensive work, while keeping memory usage low.

The class wraps the native CBloomFilter implementation in Ruby-friendly methods such as add, include?, merge, &, and |. Instances can be serialized to msgpack with to_msgpack or persisted to disk with save and later restored with BloomFit.unpack or BloomFit.load.

Filters can only be combined when they were created with the same size and hashes values; otherwise the native extension raises ArgumentError.

filter = BloomFit.new(size: 10_000, hashes: 6)
filter.add("cat")
filter.include?("cat") # => true
filter.include?("dog") # => false

Choose size and hashes based on the expected number of inserts and the false-positive rate you can tolerate.

Constant Summary collapse

LN2 =
Math.log(2.0).freeze
VERSION =
"1.2.0".freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(capacity: 100, false_positive_rate: 0.001, size: nil, hashes: 4) ⇒ BloomFit

Creates an empty Bloom filter.

The defaults are a reasonable starting point for small in-memory filters, but the best values depend on how many keys you expect to insert and how many false positives you can tolerate.

Parameters:

  • capacity (Integer) (defaults to: 100)

    expected number of elements to store in the set

  • false_positive_rate (Integer) (defaults to: 0.001)

    expected number of elements to store in the set

  • size (Integer) (defaults to: nil)

    number of buckets in a bloom filter

  • hashes (Integer) (defaults to: 4)

    number of hash functions



50
51
52
53
54
55
56
57
58
59
60
# File 'lib/bloom_fit.rb', line 50

def initialize(capacity: 100, false_positive_rate: 0.001, size: nil, hashes: 4)
  if size.nil? || hashes.nil?
    raise ArgumentError, "capacity must be > 0" unless capacity.positive?
    raise ArgumentError, "false_positive_rate must be between 0 and 1" if false_positive_rate <= 0.0 || false_positive_rate >= 1.0

    size = (-capacity.to_f * Math.log(false_positive_rate) / (LN2**2)).ceil
    hashes = (size / capacity * LN2).ceil
  end

  @bf = CBloomFilter.new(size, hashes)
end

Instance Attribute Details

#bfObject (readonly)

The wrapped native CBloomFilter instance.

This is mostly useful for low-level integrations and internal filter operations such as merge, union, and intersection.



38
39
40
# File 'lib/bloom_fit.rb', line 38

def bf
  @bf
end

Class Method Details

.load(filename) ⇒ Object

Loads a filter from a file previously written by save.

The file contents are decoded from msgpack.



260
261
262
# File 'lib/bloom_fit.rb', line 260

def self.load(filename)
  unpack(File.binread(filename))
end

.unpack(msg) ⇒ Object

Rebuilds a filter from the serialized data returned by to_msgpack.

The payload stores size, hashes, and the raw bitmap in msgpack format, making it suitable for compact transport or persistence outside Ruby’s Marshal.



243
244
245
246
247
# File 'lib/bloom_fit.rb', line 243

def self.unpack(msg)
  BloomFit.allocate.tap do |bf|
    bf.marshal_load(MessagePack.unpack(msg))
  end
end

Instance Method Details

#&(other) ⇒ Object Also known as: intersection

Returns a new filter containing the bitwise intersection of two filters.

Both filters must have the same size and hashes values or the native extension raises ArgumentError.

Like all Bloom filter operations, membership checks on the result remain probabilistic and may still produce false positives.



187
188
189
190
191
# File 'lib/bloom_fit.rb', line 187

def &(other)
  self.class.new(size:, hashes:).tap do |result|
    result.instance_variable_set(:@bf, @bf.&(other.bf))
  end
end

#[]=(key, value) ⇒ Object

Adds key to the filter when value is truthy.

This makes BloomFit behave like a write-only membership hash: truthy values add the key, while false and nil are ignored.



125
126
127
# File 'lib/bloom_fit.rb', line 125

def []=(key, value)
  @bf.add(key) if value
end

#add(key) ⇒ Object Also known as: <<

Adds key to the filter and returns self.

This mimics the behavior of Set#add and allows chaining with #<<.



109
110
111
112
# File 'lib/bloom_fit.rb', line 109

def add(key)
  @bf.add(key)
  self
end

#add?(key) ⇒ Boolean

Adds key only if it does not already appear to be present.

Returns self when the key is added and nil when include? is already true. This mimics Set#add?.

Because Bloom filters can return false positives, add? may occasionally return nil for a key that has not actually been inserted before.

Returns:

  • (Boolean)


136
137
138
139
# File 'lib/bloom_fit.rb', line 136

def add?(key)
  return nil if include?(key) # rubocop:disable Style/ReturnNilInPredicateMethodDefinition
  add(key)
end

#clearObject

Clears the filter by resetting all bits to 0 and returns self.



116
117
118
119
# File 'lib/bloom_fit.rb', line 116

def clear
  @bf.clear
  self
end

#empty?Boolean

Returns true when no bits are set.

This is an exact check on the filter state, unlike include?, which is probabilistic for positive matches.

Returns:

  • (Boolean)


102
103
104
# File 'lib/bloom_fit.rb', line 102

def empty?
  set_bits.zero?
end

#marshal_dumpObject

Returns the data Ruby’s Marshal uses to serialize this filter.



234
235
236
# File 'lib/bloom_fit.rb', line 234

def marshal_dump
  [size, hashes, bitmap]
end

#marshal_load(ary) ⇒ Object

Rebuilds the filter from the serialized data returned by marshal_dump.

This hook is used by Ruby’s Marshal support.



226
227
228
229
230
231
# File 'lib/bloom_fit.rb', line 226

def marshal_load(ary)
  size, hashes, bitmap = *ary

  initialize(size:, hashes:)
  @bf.load(bitmap) if bitmap
end

#merge(other) ⇒ Object

Merges another filter or collection of keys into this filter.

When other is a BloomFit, the merge is performed bitwise and both filters must have the same size and hashes values. When other behaves like a hash, only keys with truthy values are added. Any other enumerable is treated as a list of keys.

This method mutates the receiver and mimics Set#merge.



166
167
168
169
170
171
172
173
174
175
176
177
178
# File 'lib/bloom_fit.rb', line 166

def merge(other)
  if other.is_a?(BloomFit)
    @bf.merge(other.bf)
  elsif other.respond_to?(:each_key)
    other.each { |k, v| add(k) if v }
  elsif other.is_a?(Enumerable)
    other.each { |k| add(k) }
  else
    raise ArgumentError, "value must be enumerable or another BloomFit filter"
  end

  self
end

#save(filename) ⇒ Object

Writes the filter to filename using msgpack format.

This produces a compact binary payload that can be restored with BloomFit.load.



268
269
270
# File 'lib/bloom_fit.rb', line 268

def save(filename)
  File.binwrite(filename, to_msgpack)
end

#statsObject

Returns a human-readable summary of the filter’s current state.

The report includes the configured width (m), the current number of set bits (n), the hash count (k), and the predicted false-positive rate based on the current fill level.



212
213
214
215
216
217
218
219
220
221
# File 'lib/bloom_fit.rb', line 212

def stats
  fpr = ((n.to_f / m)**k) * 100

  format <<~STATS, m, n, k, fpr
    Number of filter buckets (m):  %d
    Number of set bits (n):        %d
    Number of filter hashes (k):   %d
    Predicted false positive rate: %.2f%%
  STATS
end

#to_binaryObject

Returns the bitmap as a binary string of 0 and 1 characters.

The output is truncated to the configured filter width, so it omits any trailing padding present in the native bitmap.



154
155
156
# File 'lib/bloom_fit.rb', line 154

def to_binary
  bitmap.unpack1("B*")[0...size]
end

#to_hexObject

Returns the bitmap as a hexadecimal string.

This is useful for debugging, logging, or comparing filter state in a more compact form than to_binary.



145
146
147
148
# File 'lib/bloom_fit.rb', line 145

def to_hex
  length = ((size / 8.0).ceil * 8 / 4)
  bitmap.unpack1("H*")[0...length]
end

#to_msgpackObject

Returns this filter serialized in msgpack format.

The encoded payload contains the same three values as marshal_dump: size, hashes, and the raw bitmap.



253
254
255
# File 'lib/bloom_fit.rb', line 253

def to_msgpack
  MessagePack.pack(marshal_dump)
end

#|(other) ⇒ Object Also known as: union

Returns a new filter containing the bitwise union of two filters.

Both filters must have the same size and hashes values or the native extension raises ArgumentError.

The receiver and other are left unchanged.



200
201
202
203
204
# File 'lib/bloom_fit.rb', line 200

def |(other)
  self.class.new(size:, hashes:).tap do |result|
    result.instance_variable_set(:@bf, @bf.|(other.bf))
  end
end