Class: BloomFit
- Inherits:
-
Object
- Object
- BloomFit
- Extended by:
- Forwardable
- Defined in:
- lib/bloom_fit.rb,
lib/bloom_fit/version.rb
Overview
BloomFit is an in-memory Bloom filter with a small, Set-like API.
Bloom filters are probabilistic membership structures: they can report false positives, but they do not report false negatives for values that have been added. That makes BloomFit useful for cheaply ruling out missing values before doing more expensive work, while keeping memory usage low.
The class wraps the native CBloomFilter implementation in Ruby-friendly methods such as add, include?, merge, &, and |. Instances can be serialized to msgpack with to_msgpack or persisted to disk with save and later restored with BloomFit.unpack or BloomFit.load.
Filters can only be combined when they were created with the same size and hashes values; otherwise the native extension raises ArgumentError.
filter = BloomFit.new(size: 10_000, hashes: 6)
filter.add("cat")
filter.include?("cat") # => true
filter.include?("dog") # => false
Choose size and hashes based on the expected number of inserts and the false-positive rate you can tolerate.
Constant Summary collapse
- LN2 =
Math.log(2.0).freeze
- VERSION =
"1.2.0".freeze
Instance Attribute Summary collapse
-
#bf ⇒ Object
readonly
The wrapped native
CBloomFilterinstance.
Class Method Summary collapse
-
.load(filename) ⇒ Object
Loads a filter from a file previously written by
save. -
.unpack(msg) ⇒ Object
Rebuilds a filter from the serialized data returned by
to_msgpack.
Instance Method Summary collapse
-
#&(other) ⇒ Object
(also: #intersection)
Returns a new filter containing the bitwise intersection of two filters.
-
#[]=(key, value) ⇒ Object
Adds
keyto the filter whenvalueis truthy. -
#add(key) ⇒ Object
(also: #<<)
Adds
keyto the filter and returnsself. -
#add?(key) ⇒ Boolean
Adds
keyonly if it does not already appear to be present. -
#clear ⇒ Object
Clears the filter by resetting all bits to
0and returnsself. -
#empty? ⇒ Boolean
Returns
truewhen no bits are set. -
#initialize(capacity: 100, false_positive_rate: 0.001, size: nil, hashes: 4) ⇒ BloomFit
constructor
Creates an empty Bloom filter.
-
#marshal_dump ⇒ Object
Returns the data Ruby’s
Marshaluses to serialize this filter. -
#marshal_load(ary) ⇒ Object
Rebuilds the filter from the serialized data returned by
marshal_dump. -
#merge(other) ⇒ Object
Merges another filter or collection of keys into this filter.
-
#save(filename) ⇒ Object
Writes the filter to
filenameusing msgpack format. -
#stats ⇒ Object
Returns a human-readable summary of the filter’s current state.
-
#to_binary ⇒ Object
Returns the bitmap as a binary string of
0and1characters. -
#to_hex ⇒ Object
Returns the bitmap as a hexadecimal string.
-
#to_msgpack ⇒ Object
Returns this filter serialized in msgpack format.
-
#|(other) ⇒ Object
(also: #union)
Returns a new filter containing the bitwise union of two filters.
Constructor Details
#initialize(capacity: 100, false_positive_rate: 0.001, size: nil, hashes: 4) ⇒ BloomFit
Creates an empty Bloom filter.
The defaults are a reasonable starting point for small in-memory filters, but the best values depend on how many keys you expect to insert and how many false positives you can tolerate.
50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/bloom_fit.rb', line 50 def initialize(capacity: 100, false_positive_rate: 0.001, size: nil, hashes: 4) if size.nil? || hashes.nil? raise ArgumentError, "capacity must be > 0" unless capacity.positive? raise ArgumentError, "false_positive_rate must be between 0 and 1" if false_positive_rate <= 0.0 || false_positive_rate >= 1.0 size = (-capacity.to_f * Math.log(false_positive_rate) / (LN2**2)).ceil hashes = (size / capacity * LN2).ceil end @bf = CBloomFilter.new(size, hashes) end |
Instance Attribute Details
#bf ⇒ Object (readonly)
The wrapped native CBloomFilter instance.
This is mostly useful for low-level integrations and internal filter operations such as merge, union, and intersection.
38 39 40 |
# File 'lib/bloom_fit.rb', line 38 def bf @bf end |
Class Method Details
.load(filename) ⇒ Object
Loads a filter from a file previously written by save.
The file contents are decoded from msgpack.
260 261 262 |
# File 'lib/bloom_fit.rb', line 260 def self.load(filename) unpack(File.binread(filename)) end |
.unpack(msg) ⇒ Object
Rebuilds a filter from the serialized data returned by to_msgpack.
The payload stores size, hashes, and the raw bitmap in msgpack format, making it suitable for compact transport or persistence outside Ruby’s Marshal.
243 244 245 246 247 |
# File 'lib/bloom_fit.rb', line 243 def self.unpack(msg) BloomFit.allocate.tap do |bf| bf.marshal_load(MessagePack.unpack(msg)) end end |
Instance Method Details
#&(other) ⇒ Object Also known as: intersection
Returns a new filter containing the bitwise intersection of two filters.
Both filters must have the same size and hashes values or the native extension raises ArgumentError.
Like all Bloom filter operations, membership checks on the result remain probabilistic and may still produce false positives.
187 188 189 190 191 |
# File 'lib/bloom_fit.rb', line 187 def &(other) self.class.new(size:, hashes:).tap do |result| result.instance_variable_set(:@bf, @bf.&(other.bf)) end end |
#[]=(key, value) ⇒ Object
Adds key to the filter when value is truthy.
This makes BloomFit behave like a write-only membership hash: truthy values add the key, while false and nil are ignored.
125 126 127 |
# File 'lib/bloom_fit.rb', line 125 def []=(key, value) @bf.add(key) if value end |
#add(key) ⇒ Object Also known as: <<
Adds key to the filter and returns self.
This mimics the behavior of Set#add and allows chaining with #<<.
109 110 111 112 |
# File 'lib/bloom_fit.rb', line 109 def add(key) @bf.add(key) self end |
#add?(key) ⇒ Boolean
Adds key only if it does not already appear to be present.
Returns self when the key is added and nil when include? is already true. This mimics Set#add?.
Because Bloom filters can return false positives, add? may occasionally return nil for a key that has not actually been inserted before.
136 137 138 139 |
# File 'lib/bloom_fit.rb', line 136 def add?(key) return nil if include?(key) # rubocop:disable Style/ReturnNilInPredicateMethodDefinition add(key) end |
#clear ⇒ Object
Clears the filter by resetting all bits to 0 and returns self.
116 117 118 119 |
# File 'lib/bloom_fit.rb', line 116 def clear @bf.clear self end |
#empty? ⇒ Boolean
Returns true when no bits are set.
This is an exact check on the filter state, unlike include?, which is probabilistic for positive matches.
102 103 104 |
# File 'lib/bloom_fit.rb', line 102 def empty? set_bits.zero? end |
#marshal_dump ⇒ Object
Returns the data Ruby’s Marshal uses to serialize this filter.
234 235 236 |
# File 'lib/bloom_fit.rb', line 234 def marshal_dump [size, hashes, bitmap] end |
#marshal_load(ary) ⇒ Object
Rebuilds the filter from the serialized data returned by marshal_dump.
This hook is used by Ruby’s Marshal support.
226 227 228 229 230 231 |
# File 'lib/bloom_fit.rb', line 226 def marshal_load(ary) size, hashes, bitmap = *ary initialize(size:, hashes:) @bf.load(bitmap) if bitmap end |
#merge(other) ⇒ Object
Merges another filter or collection of keys into this filter.
When other is a BloomFit, the merge is performed bitwise and both filters must have the same size and hashes values. When other behaves like a hash, only keys with truthy values are added. Any other enumerable is treated as a list of keys.
This method mutates the receiver and mimics Set#merge.
166 167 168 169 170 171 172 173 174 175 176 177 178 |
# File 'lib/bloom_fit.rb', line 166 def merge(other) if other.is_a?(BloomFit) @bf.merge(other.bf) elsif other.respond_to?(:each_key) other.each { |k, v| add(k) if v } elsif other.is_a?(Enumerable) other.each { |k| add(k) } else raise ArgumentError, "value must be enumerable or another BloomFit filter" end self end |
#save(filename) ⇒ Object
Writes the filter to filename using msgpack format.
This produces a compact binary payload that can be restored with BloomFit.load.
268 269 270 |
# File 'lib/bloom_fit.rb', line 268 def save(filename) File.binwrite(filename, to_msgpack) end |
#stats ⇒ Object
Returns a human-readable summary of the filter’s current state.
The report includes the configured width (m), the current number of set bits (n), the hash count (k), and the predicted false-positive rate based on the current fill level.
212 213 214 215 216 217 218 219 220 221 |
# File 'lib/bloom_fit.rb', line 212 def stats fpr = ((n.to_f / m)**k) * 100 format <<~STATS, m, n, k, fpr Number of filter buckets (m): %d Number of set bits (n): %d Number of filter hashes (k): %d Predicted false positive rate: %.2f%% STATS end |
#to_binary ⇒ Object
Returns the bitmap as a binary string of 0 and 1 characters.
The output is truncated to the configured filter width, so it omits any trailing padding present in the native bitmap.
154 155 156 |
# File 'lib/bloom_fit.rb', line 154 def to_binary bitmap.unpack1("B*")[0...size] end |
#to_hex ⇒ Object
Returns the bitmap as a hexadecimal string.
This is useful for debugging, logging, or comparing filter state in a more compact form than to_binary.
145 146 147 148 |
# File 'lib/bloom_fit.rb', line 145 def to_hex length = ((size / 8.0).ceil * 8 / 4) bitmap.unpack1("H*")[0...length] end |
#to_msgpack ⇒ Object
Returns this filter serialized in msgpack format.
The encoded payload contains the same three values as marshal_dump: size, hashes, and the raw bitmap.
253 254 255 |
# File 'lib/bloom_fit.rb', line 253 def to_msgpack MessagePack.pack(marshal_dump) end |
#|(other) ⇒ Object Also known as: union
Returns a new filter containing the bitwise union of two filters.
Both filters must have the same size and hashes values or the native extension raises ArgumentError.
The receiver and other are left unchanged.
200 201 202 203 204 |
# File 'lib/bloom_fit.rb', line 200 def |(other) self.class.new(size:, hashes:).tap do |result| result.instance_variable_set(:@bf, @bf.|(other.bf)) end end |