Class: Philiprehberger::BloomFilter::Filter
- Inherits:
-
Object
- Object
- Philiprehberger::BloomFilter::Filter
- Defined in:
- lib/philiprehberger/bloom_filter/filter.rb
Overview
A space-efficient probabilistic set membership data structure.
Instance Attribute Summary collapse
-
#count ⇒ Object
readonly
Returns the value of attribute count.
Class Method Summary collapse
-
.deserialize(data) ⇒ Filter
Deserialize a filter from a hash.
-
.from_json(str) ⇒ Filter
Deserialize a filter from a JSON string.
Instance Method Summary collapse
-
#==(other) ⇒ Boolean
(also: #eql?)
Check structural equality with another filter.
-
#add(item) ⇒ self
Add an item to the filter.
-
#bulk_add(items) ⇒ self
Add all items from an enumerable.
-
#clear ⇒ self
Reset the filter, removing all items.
-
#compatible?(other) ⇒ Boolean
Check whether another filter has compatible structure for merge, intersection, union, and superset operations.
-
#copy ⇒ Filter
Create an independent deep copy of this filter.
-
#count_estimate ⇒ Float
Estimate the number of unique items using the fill rate.
-
#empty? ⇒ Boolean
Check if the filter is empty (no items added).
-
#false_positive_rate ⇒ Float
Calculate the actual false positive rate based on current fill rate.
-
#fill_rate ⇒ Float
Return the proportion of set bits in the bit array.
-
#hash ⇒ Integer
Hash code consistent with #==, allowing filters to be used as Hash keys or Set members.
-
#include?(item) ⇒ Boolean
Check if an item might be in the filter.
-
#initialize(expected_items:, false_positive_rate: 0.01) ⇒ Filter
constructor
A new instance of Filter.
-
#inspect ⇒ String
Human-readable representation for debugging.
-
#intersection(other) ⇒ Filter
(also: #&)
Create a new filter containing only elements present in both filters (AND).
-
#memory_usage ⇒ Integer
Return the memory usage of the bit array in bytes.
-
#merge(other) ⇒ self
Merge another bloom filter into this one.
-
#saturated?(threshold: 0.5) ⇒ Boolean
Check whether the filter has reached or exceeded a fill threshold.
-
#serialize ⇒ Hash
Serialize the filter to a hash.
-
#subset?(other) ⇒ Boolean
Check if this filter is a subset of another.
-
#superset?(other) ⇒ Boolean
Check if this filter is a superset of another.
-
#to_json(*_args) ⇒ String
Serialize the filter to a JSON string.
-
#union(other) ⇒ Filter
(also: #|)
Create a new filter containing all items present in either filter (OR).
Constructor Details
#initialize(expected_items:, false_positive_rate: 0.01) ⇒ Filter
Returns a new instance of Filter.
16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 16 def initialize(expected_items:, false_positive_rate: 0.01) raise Error, 'expected_items must be positive' unless expected_items.is_a?(Integer) && expected_items.positive? unless false_positive_rate.positive? && false_positive_rate < 1 raise Error, 'false_positive_rate must be between 0 and 1' end @expected_items = expected_items @false_positive_rate = false_positive_rate @bit_size = optimal_bit_size(expected_items, false_positive_rate) @hash_count = optimal_hash_count(@bit_size, expected_items) @bits = "\0".b * ((@bit_size + 7) / 8) @count = 0 end |
Instance Attribute Details
#count ⇒ Object (readonly)
Returns the value of attribute count.
12 13 14 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 12 def count @count end |
Class Method Details
.deserialize(data) ⇒ Filter
Deserialize a filter from a hash.
313 314 315 316 317 318 319 320 321 322 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 313 def self.deserialize(data) filter = allocate filter.instance_variable_set(:@expected_items, data['expected_items']) filter.instance_variable_set(:@false_positive_rate, data['false_positive_rate']) filter.instance_variable_set(:@bit_size, data['bit_size']) filter.instance_variable_set(:@hash_count, data['hash_count']) filter.instance_variable_set(:@bits, [data['bits']].pack('H*')) filter.instance_variable_set(:@count, data['count']) filter end |
.from_json(str) ⇒ Filter
Deserialize a filter from a JSON string.
181 182 183 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 181 def self.from_json(str) deserialize(JSON.parse(str)) end |
Instance Method Details
#==(other) ⇒ Boolean Also known as: eql?
Check structural equality with another filter.
136 137 138 139 140 141 142 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 136 def ==(other) return false unless other.is_a?(self.class) @bit_size == other.instance_variable_get(:@bit_size) && @hash_count == other.instance_variable_get(:@hash_count) && @bits == other.instance_variable_get(:@bits) end |
#add(item) ⇒ self
Add an item to the filter.
35 36 37 38 39 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 35 def add(item) hash_indices(item.to_s).each { |i| set_bit(i) } @count += 1 self end |
#bulk_add(items) ⇒ self
Add all items from an enumerable
88 89 90 91 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 88 def bulk_add(items) items.each { |item| add(item) } self end |
#clear ⇒ self
Reset the filter, removing all items.
71 72 73 74 75 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 71 def clear @bits = "\0".b * ((@bit_size + 7) / 8) @count = 0 self end |
#compatible?(other) ⇒ Boolean
Check whether another filter has compatible structure for merge, intersection, union, and superset operations.
242 243 244 245 246 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 242 def compatible?(other) other.is_a?(self.class) && @bit_size == other.instance_variable_get(:@bit_size) && @hash_count == other.instance_variable_get(:@hash_count) end |
#copy ⇒ Filter
Create an independent deep copy of this filter.
147 148 149 150 151 152 153 154 155 156 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 147 def copy result = self.class.allocate result.instance_variable_set(:@expected_items, @expected_items) result.instance_variable_set(:@false_positive_rate, @false_positive_rate) result.instance_variable_set(:@bit_size, @bit_size) result.instance_variable_set(:@hash_count, @hash_count) result.instance_variable_set(:@bits, @bits.dup) result.instance_variable_set(:@count, @count) result end |
#count_estimate ⇒ Float
Estimate the number of unique items using the fill rate
96 97 98 99 100 101 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 96 def count_estimate bits_set = count_set_bits return 0.0 if bits_set.zero? -(@bit_size.to_f / @hash_count) * Math.log(1.0 - (bits_set.to_f / @bit_size)) end |
#empty? ⇒ Boolean
Check if the filter is empty (no items added).
209 210 211 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 209 def empty? @count.zero? end |
#false_positive_rate ⇒ Float
Calculate the actual false positive rate based on current fill rate.
Uses the formula: (1 - e^(-k*n/m))^k where k = hash_count, n = count, m = bit_size.
164 165 166 167 168 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 164 def false_positive_rate return 0.0 if @count.zero? (1.0 - Math.exp(-@hash_count.to_f * @count / @bit_size))**@hash_count end |
#fill_rate ⇒ Float
Return the proportion of set bits in the bit array
128 129 130 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 128 def fill_rate count_set_bits.to_f / @bit_size end |
#hash ⇒ Integer
Hash code consistent with #==, allowing filters to be used as Hash keys or Set members.
277 278 279 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 277 def hash [self.class, @bit_size, @hash_count, @bits].hash end |
#include?(item) ⇒ Boolean
Check if an item might be in the filter.
45 46 47 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 45 def include?(item) hash_indices(item.to_s).all? { |i| get_bit(i) } end |
#inspect ⇒ String
Human-readable representation for debugging.
287 288 289 290 291 292 293 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 287 def inspect format( '#<%<class>s count=%<count>d bit_size=%<bit_size>d hash_count=%<hash_count>d fill_rate=%<fill>.4f>', class: self.class.name, count: @count, bit_size: @bit_size, hash_count: @hash_count, fill: fill_rate ) end |
#intersection(other) ⇒ Filter Also known as: &
Create a new filter containing only elements present in both filters (AND)
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 108 def intersection(other) unless @bit_size == other.instance_variable_get(:@bit_size) raise Error, 'cannot intersect filters with different bit sizes' end result = self.class.new(expected_items: @expected_items, false_positive_rate: @false_positive_rate) other_bits = other.instance_variable_get(:@bits) result_bits = result.instance_variable_get(:@bits) @bits.bytesize.times do |i| result_bits.setbyte(i, @bits.getbyte(i) & other_bits.getbyte(i)) end result end |
#memory_usage ⇒ Integer
Return the memory usage of the bit array in bytes.
80 81 82 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 80 def memory_usage @bits.bytesize end |
#merge(other) ⇒ self
Merge another bloom filter into this one.
54 55 56 57 58 59 60 61 62 63 64 65 66 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 54 def merge(other) unless @bit_size == other.instance_variable_get(:@bit_size) raise Error, 'cannot merge filters with different bit sizes' end other_bits = other.instance_variable_get(:@bits) @bits.bytesize.times do |i| @bits.setbyte(i, @bits.getbyte(i) | other_bits.getbyte(i)) end @count += other.count self end |
#saturated?(threshold: 0.5) ⇒ Boolean
Check whether the filter has reached or exceeded a fill threshold.
269 270 271 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 269 def saturated?(threshold: 0.5) fill_rate >= threshold end |
#serialize ⇒ Hash
Serialize the filter to a hash.
298 299 300 301 302 303 304 305 306 307 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 298 def serialize { 'expected_items' => @expected_items, 'false_positive_rate' => @false_positive_rate, 'bit_size' => @bit_size, 'hash_count' => @hash_count, 'bits' => @bits.unpack1('H*'), 'count' => @count } end |
#subset?(other) ⇒ Boolean
Check if this filter is a subset of another.
Returns true if every set bit in ‘self` is also set in `other`.
255 256 257 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 255 def subset?(other) other.superset?(self) end |
#superset?(other) ⇒ Boolean
Check if this filter is a superset of another.
Returns true if every set bit in ‘other` is also set in `self`.
192 193 194 195 196 197 198 199 200 201 202 203 204 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 192 def superset?(other) unless @bit_size == other.instance_variable_get(:@bit_size) raise Error, 'cannot compare filters with different bit sizes' end other_bits = other.instance_variable_get(:@bits) @bits.bytesize.times do |i| ob = other_bits.getbyte(i) return false unless @bits.getbyte(i) & ob == ob end true end |
#to_json(*_args) ⇒ String
Serialize the filter to a JSON string.
173 174 175 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 173 def to_json(*_args) JSON.generate(serialize) end |
#union(other) ⇒ Filter Also known as: |
Create a new filter containing all items present in either filter (OR).
Non-mutating counterpart to #merge.
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 220 def union(other) unless @bit_size == other.instance_variable_get(:@bit_size) raise Error, 'cannot union filters with different bit sizes' end result = self.class.new(expected_items: @expected_items, false_positive_rate: @false_positive_rate) other_bits = other.instance_variable_get(:@bits) result_bits = result.instance_variable_get(:@bits) @bits.bytesize.times do |i| result_bits.setbyte(i, @bits.getbyte(i) | other_bits.getbyte(i)) end result.instance_variable_set(:@count, @count + other.count) result end |