Class: Philiprehberger::BloomFilter::Filter

Inherits:
Object
  • Object
show all
Defined in:
lib/philiprehberger/bloom_filter/filter.rb

Overview

A space-efficient probabilistic set membership data structure.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(expected_items:, false_positive_rate: 0.01) ⇒ Filter

Returns a new instance of Filter.

Parameters:

  • expected_items (Integer)

    expected number of items

  • false_positive_rate (Float) (defaults to: 0.01)

    desired false positive rate

Raises:



16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 16

def initialize(expected_items:, false_positive_rate: 0.01)
  raise Error, 'expected_items must be positive' unless expected_items.is_a?(Integer) && expected_items.positive?
  unless false_positive_rate.positive? && false_positive_rate < 1
    raise Error,
          'false_positive_rate must be between 0 and 1'
  end

  @expected_items = expected_items
  @false_positive_rate = false_positive_rate
  @bit_size = BloomFilter.optimal_size(
    expected_items: expected_items,
    false_positive_rate: false_positive_rate
  )
  @hash_count = BloomFilter.optimal_hash_count(
    size: @bit_size,
    expected_items: expected_items
  )
  @bits = "\0".b * ((@bit_size + 7) / 8)
  @count = 0
end

Instance Attribute Details

#countObject (readonly)

Returns the value of attribute count.



12
13
14
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 12

def count
  @count
end

Class Method Details

.deserialize(data) ⇒ Filter

Deserialize a filter from a hash.

Parameters:

  • data (Hash)

    serialized filter data

Returns:



319
320
321
322
323
324
325
326
327
328
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 319

def self.deserialize(data)
  filter = allocate
  filter.instance_variable_set(:@expected_items, data['expected_items'])
  filter.instance_variable_set(:@false_positive_rate, data['false_positive_rate'])
  filter.instance_variable_set(:@bit_size, data['bit_size'])
  filter.instance_variable_set(:@hash_count, data['hash_count'])
  filter.instance_variable_set(:@bits, [data['bits']].pack('H*'))
  filter.instance_variable_set(:@count, data['count'])
  filter
end

.from_json(str) ⇒ Filter

Deserialize a filter from a JSON string.

Parameters:

  • str (String)

    JSON string

Returns:



187
188
189
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 187

def self.from_json(str)
  deserialize(JSON.parse(str))
end

Instance Method Details

#==(other) ⇒ Boolean Also known as: eql?

Check structural equality with another filter.

Parameters:

  • other (Filter)

    another bloom filter

Returns:

  • (Boolean)

    true if both filters have the same structure and bit array



142
143
144
145
146
147
148
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 142

def ==(other)
  return false unless other.is_a?(self.class)

  @bit_size == other.instance_variable_get(:@bit_size) &&
    @hash_count == other.instance_variable_get(:@hash_count) &&
    @bits == other.instance_variable_get(:@bits)
end

#add(item) ⇒ self

Add an item to the filter.

Parameters:

  • item (Object)

    the item to add (uses #to_s for hashing)

Returns:

  • (self)


41
42
43
44
45
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 41

def add(item)
  hash_indices(item.to_s).each { |i| set_bit(i) }
  @count += 1
  self
end

#bulk_add(items) ⇒ self

Add all items from an enumerable

Parameters:

  • items (Enumerable)

    items to add

Returns:

  • (self)


94
95
96
97
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 94

def bulk_add(items)
  items.each { |item| add(item) }
  self
end

#clearself

Reset the filter, removing all items.

Returns:

  • (self)


77
78
79
80
81
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 77

def clear
  @bits = "\0".b * ((@bit_size + 7) / 8)
  @count = 0
  self
end

#compatible?(other) ⇒ Boolean

Check whether another filter has compatible structure for merge, intersection, union, and superset operations.

Parameters:

  • other (Object)

    candidate filter

Returns:

  • (Boolean)


248
249
250
251
252
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 248

def compatible?(other)
  other.is_a?(self.class) &&
    @bit_size == other.instance_variable_get(:@bit_size) &&
    @hash_count == other.instance_variable_get(:@hash_count)
end

#copyFilter

Create an independent deep copy of this filter.

Returns:

  • (Filter)

    a new filter with the same state



153
154
155
156
157
158
159
160
161
162
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 153

def copy
  result = self.class.allocate
  result.instance_variable_set(:@expected_items, @expected_items)
  result.instance_variable_set(:@false_positive_rate, @false_positive_rate)
  result.instance_variable_set(:@bit_size, @bit_size)
  result.instance_variable_set(:@hash_count, @hash_count)
  result.instance_variable_set(:@bits, @bits.dup)
  result.instance_variable_set(:@count, @count)
  result
end

#count_estimateFloat

Estimate the number of unique items using the fill rate

Returns:

  • (Float)

    estimated cardinality



102
103
104
105
106
107
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 102

def count_estimate
  bits_set = count_set_bits
  return 0.0 if bits_set.zero?

  -(@bit_size.to_f / @hash_count) * Math.log(1.0 - (bits_set.to_f / @bit_size))
end

#empty?Boolean

Check if the filter is empty (no items added).

Returns:

  • (Boolean)

    true if count is zero



215
216
217
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 215

def empty?
  @count.zero?
end

#false_positive_rateFloat

Calculate the actual false positive rate based on current fill rate.

Uses the formula: (1 - e^(-k*n/m))^k where k = hash_count, n = count, m = bit_size.

Returns:

  • (Float)

    estimated false positive rate



170
171
172
173
174
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 170

def false_positive_rate
  return 0.0 if @count.zero?

  (1.0 - Math.exp(-@hash_count.to_f * @count / @bit_size))**@hash_count
end

#fill_rateFloat

Return the proportion of set bits in the bit array

Returns:

  • (Float)

    fill rate between 0.0 and 1.0



134
135
136
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 134

def fill_rate
  count_set_bits.to_f / @bit_size
end

#hashInteger

Hash code consistent with #==, allowing filters to be used as Hash keys or Set members.

Returns:

  • (Integer)


283
284
285
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 283

def hash
  [self.class, @bit_size, @hash_count, @bits].hash
end

#include?(item) ⇒ Boolean

Check if an item might be in the filter.

Parameters:

  • item (Object)

    the item to check

Returns:

  • (Boolean)

    true if the item might be present, false if definitely absent



51
52
53
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 51

def include?(item)
  hash_indices(item.to_s).all? { |i| get_bit(i) }
end

#inspectString

Human-readable representation for debugging.

Returns:

  • (String)


293
294
295
296
297
298
299
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 293

def inspect
  format(
    '#<%<class>s count=%<count>d bit_size=%<bit_size>d hash_count=%<hash_count>d fill_rate=%<fill>.4f>',
    class: self.class.name, count: @count, bit_size: @bit_size,
    hash_count: @hash_count, fill: fill_rate
  )
end

#intersection(other) ⇒ Filter Also known as: &

Create a new filter containing only elements present in both filters (AND)

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Filter)

    new intersection filter

Raises:

  • (Error)

    if the filters have different bit sizes



114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 114

def intersection(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot intersect filters with different bit sizes'
  end

  result = self.class.new(expected_items: @expected_items, false_positive_rate: @false_positive_rate)
  other_bits = other.instance_variable_get(:@bits)
  result_bits = result.instance_variable_get(:@bits)

  @bits.bytesize.times do |i|
    result_bits.setbyte(i, @bits.getbyte(i) & other_bits.getbyte(i))
  end

  result
end

#memory_usageInteger

Return the memory usage of the bit array in bytes.

Returns:

  • (Integer)

    memory usage in bytes



86
87
88
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 86

def memory_usage
  @bits.bytesize
end

#merge(other) ⇒ self

Merge another bloom filter into this one.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (self)

Raises:

  • (Error)

    if the filters have different bit sizes



60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 60

def merge(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot merge filters with different bit sizes'
  end

  other_bits = other.instance_variable_get(:@bits)
  @bits.bytesize.times do |i|
    @bits.setbyte(i, @bits.getbyte(i) | other_bits.getbyte(i))
  end
  @count += other.count
  self
end

#saturated?(threshold: 0.5) ⇒ Boolean

Check whether the filter has reached or exceeded a fill threshold.

Parameters:

  • threshold (Float) (defaults to: 0.5)

    fill rate threshold between 0.0 and 1.0

Returns:

  • (Boolean)


275
276
277
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 275

def saturated?(threshold: 0.5)
  fill_rate >= threshold
end

#serializeHash

Serialize the filter to a hash.

Returns:

  • (Hash)

    serialized representation



304
305
306
307
308
309
310
311
312
313
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 304

def serialize
  {
    'expected_items' => @expected_items,
    'false_positive_rate' => @false_positive_rate,
    'bit_size' => @bit_size,
    'hash_count' => @hash_count,
    'bits' => @bits.unpack1('H*'),
    'count' => @count
  }
end

#subset?(other) ⇒ Boolean

Check if this filter is a subset of another.

Returns true if every set bit in ‘self` is also set in `other`.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Boolean)

Raises:

  • (Error)

    if the filters have different bit sizes



261
262
263
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 261

def subset?(other)
  other.superset?(self)
end

#superset?(other) ⇒ Boolean

Check if this filter is a superset of another.

Returns true if every set bit in ‘other` is also set in `self`.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Boolean)

Raises:

  • (Error)

    if the filters have different bit sizes



198
199
200
201
202
203
204
205
206
207
208
209
210
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 198

def superset?(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot compare filters with different bit sizes'
  end

  other_bits = other.instance_variable_get(:@bits)
  @bits.bytesize.times do |i|
    ob = other_bits.getbyte(i)
    return false unless @bits.getbyte(i) & ob == ob
  end
  true
end

#to_json(*_args) ⇒ String

Serialize the filter to a JSON string.

Returns:

  • (String)

    JSON representation



179
180
181
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 179

def to_json(*_args)
  JSON.generate(serialize)
end

#union(other) ⇒ Filter Also known as: |

Create a new filter containing all items present in either filter (OR).

Non-mutating counterpart to #merge.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Filter)

    new union filter

Raises:

  • (Error)

    if the filters have different bit sizes



226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 226

def union(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot union filters with different bit sizes'
  end

  result = self.class.new(expected_items: @expected_items, false_positive_rate: @false_positive_rate)
  other_bits = other.instance_variable_get(:@bits)
  result_bits = result.instance_variable_get(:@bits)

  @bits.bytesize.times do |i|
    result_bits.setbyte(i, @bits.getbyte(i) | other_bits.getbyte(i))
  end
  result.instance_variable_set(:@count, @count + other.count)
  result
end