Class: Philiprehberger::BloomFilter::Filter

Inherits:
Object
  • Object
show all
Defined in:
lib/philiprehberger/bloom_filter/filter.rb

Overview

A space-efficient probabilistic set membership data structure.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(expected_items:, false_positive_rate: 0.01) ⇒ Filter

Returns a new instance of Filter.

Parameters:

  • expected_items (Integer)

    expected number of items

  • false_positive_rate (Float) (defaults to: 0.01)

    desired false positive rate

Raises:



16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 16

def initialize(expected_items:, false_positive_rate: 0.01)
  raise Error, 'expected_items must be positive' unless expected_items.is_a?(Integer) && expected_items.positive?
  unless false_positive_rate.positive? && false_positive_rate < 1
    raise Error,
          'false_positive_rate must be between 0 and 1'
  end

  @expected_items = expected_items
  @false_positive_rate = false_positive_rate
  @bit_size = BloomFilter.optimal_size(
    expected_items: expected_items,
    false_positive_rate: false_positive_rate
  )
  @hash_count = BloomFilter.optimal_hash_count(
    size: @bit_size,
    expected_items: expected_items
  )
  @bits = "\0".b * ((@bit_size + 7) / 8)
  @count = 0
end

Instance Attribute Details

#countObject (readonly)

Returns the value of attribute count.



12
13
14
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 12

def count
  @count
end

Class Method Details

.deserialize(data) ⇒ Filter

Deserialize a filter from a hash.

Parameters:

  • data (Hash)

    serialized filter data

Returns:



327
328
329
330
331
332
333
334
335
336
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 327

def self.deserialize(data)
  filter = allocate
  filter.instance_variable_set(:@expected_items, data['expected_items'])
  filter.instance_variable_set(:@false_positive_rate, data['false_positive_rate'])
  filter.instance_variable_set(:@bit_size, data['bit_size'])
  filter.instance_variable_set(:@hash_count, data['hash_count'])
  filter.instance_variable_set(:@bits, [data['bits']].pack('H*'))
  filter.instance_variable_set(:@count, data['count'])
  filter
end

.from_json(str) ⇒ Filter

Deserialize a filter from a JSON string.

Parameters:

  • str (String)

    JSON string

Returns:



195
196
197
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 195

def self.from_json(str)
  deserialize(JSON.parse(str))
end

Instance Method Details

#==(other) ⇒ Boolean Also known as: eql?

Check structural equality with another filter.

Parameters:

  • other (Filter)

    another bloom filter

Returns:

  • (Boolean)

    true if both filters have the same structure and bit array



150
151
152
153
154
155
156
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 150

def ==(other)
  return false unless other.is_a?(self.class)

  @bit_size == other.instance_variable_get(:@bit_size) &&
    @hash_count == other.instance_variable_get(:@hash_count) &&
    @bits == other.instance_variable_get(:@bits)
end

#add(item) ⇒ self

Add an item to the filter.

Parameters:

  • item (Object)

    the item to add (uses #to_s for hashing)

Returns:

  • (self)


41
42
43
44
45
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 41

def add(item)
  hash_indices(item.to_s).each { |i| set_bit(i) }
  @count += 1
  self
end

#bulk_add(items) ⇒ self

Add all items from an enumerable

Parameters:

  • items (Enumerable)

    items to add

Returns:

  • (self)


94
95
96
97
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 94

def bulk_add(items)
  items.each { |item| add(item) }
  self
end

#bulk_include?(items) ⇒ Array<Boolean>

Check membership for many items at once.

Parameters:

  • items (Enumerable)

    items to test

Returns:

  • (Array<Boolean>)

    one boolean per input item, in order



103
104
105
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 103

def bulk_include?(items)
  items.map { |item| include?(item) }
end

#clearself

Reset the filter, removing all items.

Returns:

  • (self)


77
78
79
80
81
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 77

def clear
  @bits = "\0".b * ((@bit_size + 7) / 8)
  @count = 0
  self
end

#compatible?(other) ⇒ Boolean

Check whether another filter has compatible structure for merge, intersection, union, and superset operations.

Parameters:

  • other (Object)

    candidate filter

Returns:

  • (Boolean)


256
257
258
259
260
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 256

def compatible?(other)
  other.is_a?(self.class) &&
    @bit_size == other.instance_variable_get(:@bit_size) &&
    @hash_count == other.instance_variable_get(:@hash_count)
end

#copyFilter

Create an independent deep copy of this filter.

Returns:

  • (Filter)

    a new filter with the same state



161
162
163
164
165
166
167
168
169
170
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 161

def copy
  result = self.class.allocate
  result.instance_variable_set(:@expected_items, @expected_items)
  result.instance_variable_set(:@false_positive_rate, @false_positive_rate)
  result.instance_variable_set(:@bit_size, @bit_size)
  result.instance_variable_set(:@hash_count, @hash_count)
  result.instance_variable_set(:@bits, @bits.dup)
  result.instance_variable_set(:@count, @count)
  result
end

#count_estimateFloat

Estimate the number of unique items using the fill rate

Returns:

  • (Float)

    estimated cardinality



110
111
112
113
114
115
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 110

def count_estimate
  bits_set = count_set_bits
  return 0.0 if bits_set.zero?

  -(@bit_size.to_f / @hash_count) * Math.log(1.0 - (bits_set.to_f / @bit_size))
end

#empty?Boolean

Check if the filter is empty (no items added).

Returns:

  • (Boolean)

    true if count is zero



223
224
225
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 223

def empty?
  @count.zero?
end

#false_positive_rateFloat

Calculate the actual false positive rate based on current fill rate.

Uses the formula: (1 - e^(-k*n/m))^k where k = hash_count, n = count, m = bit_size.

Returns:

  • (Float)

    estimated false positive rate



178
179
180
181
182
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 178

def false_positive_rate
  return 0.0 if @count.zero?

  (1.0 - Math.exp(-@hash_count.to_f * @count / @bit_size))**@hash_count
end

#fill_rateFloat

Return the proportion of set bits in the bit array

Returns:

  • (Float)

    fill rate between 0.0 and 1.0



142
143
144
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 142

def fill_rate
  count_set_bits.to_f / @bit_size
end

#hashInteger

Hash code consistent with #==, allowing filters to be used as Hash keys or Set members.

Returns:

  • (Integer)


291
292
293
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 291

def hash
  [self.class, @bit_size, @hash_count, @bits].hash
end

#include?(item) ⇒ Boolean

Check if an item might be in the filter.

Parameters:

  • item (Object)

    the item to check

Returns:

  • (Boolean)

    true if the item might be present, false if definitely absent



51
52
53
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 51

def include?(item)
  hash_indices(item.to_s).all? { |i| get_bit(i) }
end

#inspectString

Human-readable representation for debugging.

Returns:

  • (String)


301
302
303
304
305
306
307
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 301

def inspect
  format(
    '#<%<class>s count=%<count>d bit_size=%<bit_size>d hash_count=%<hash_count>d fill_rate=%<fill>.4f>',
    class: self.class.name, count: @count, bit_size: @bit_size,
    hash_count: @hash_count, fill: fill_rate
  )
end

#intersection(other) ⇒ Filter Also known as: &

Create a new filter containing only elements present in both filters (AND)

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Filter)

    new intersection filter

Raises:

  • (Error)

    if the filters have different bit sizes



122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 122

def intersection(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot intersect filters with different bit sizes'
  end

  result = self.class.new(expected_items: @expected_items, false_positive_rate: @false_positive_rate)
  other_bits = other.instance_variable_get(:@bits)
  result_bits = result.instance_variable_get(:@bits)

  @bits.bytesize.times do |i|
    result_bits.setbyte(i, @bits.getbyte(i) & other_bits.getbyte(i))
  end

  result
end

#memory_usageInteger

Return the memory usage of the bit array in bytes.

Returns:

  • (Integer)

    memory usage in bytes



86
87
88
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 86

def memory_usage
  @bits.bytesize
end

#merge(other) ⇒ self

Merge another bloom filter into this one.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (self)

Raises:

  • (Error)

    if the filters have different bit sizes



60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 60

def merge(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot merge filters with different bit sizes'
  end

  other_bits = other.instance_variable_get(:@bits)
  @bits.bytesize.times do |i|
    @bits.setbyte(i, @bits.getbyte(i) | other_bits.getbyte(i))
  end
  @count += other.count
  self
end

#saturated?(threshold: 0.5) ⇒ Boolean

Check whether the filter has reached or exceeded a fill threshold.

Parameters:

  • threshold (Float) (defaults to: 0.5)

    fill rate threshold between 0.0 and 1.0

Returns:

  • (Boolean)


283
284
285
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 283

def saturated?(threshold: 0.5)
  fill_rate >= threshold
end

#serializeHash

Serialize the filter to a hash.

Returns:

  • (Hash)

    serialized representation



312
313
314
315
316
317
318
319
320
321
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 312

def serialize
  {
    'expected_items' => @expected_items,
    'false_positive_rate' => @false_positive_rate,
    'bit_size' => @bit_size,
    'hash_count' => @hash_count,
    'bits' => @bits.unpack1('H*'),
    'count' => @count
  }
end

#subset?(other) ⇒ Boolean

Check if this filter is a subset of another.

Returns true if every set bit in ‘self` is also set in `other`.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Boolean)

Raises:

  • (Error)

    if the filters have different bit sizes



269
270
271
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 269

def subset?(other)
  other.superset?(self)
end

#superset?(other) ⇒ Boolean

Check if this filter is a superset of another.

Returns true if every set bit in ‘other` is also set in `self`.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Boolean)

Raises:

  • (Error)

    if the filters have different bit sizes



206
207
208
209
210
211
212
213
214
215
216
217
218
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 206

def superset?(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot compare filters with different bit sizes'
  end

  other_bits = other.instance_variable_get(:@bits)
  @bits.bytesize.times do |i|
    ob = other_bits.getbyte(i)
    return false unless @bits.getbyte(i) & ob == ob
  end
  true
end

#to_json(*_args) ⇒ String

Serialize the filter to a JSON string.

Returns:

  • (String)

    JSON representation



187
188
189
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 187

def to_json(*_args)
  JSON.generate(serialize)
end

#union(other) ⇒ Filter Also known as: |

Create a new filter containing all items present in either filter (OR).

Non-mutating counterpart to #merge.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Filter)

    new union filter

Raises:

  • (Error)

    if the filters have different bit sizes



234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 234

def union(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot union filters with different bit sizes'
  end

  result = self.class.new(expected_items: @expected_items, false_positive_rate: @false_positive_rate)
  other_bits = other.instance_variable_get(:@bits)
  result_bits = result.instance_variable_get(:@bits)

  @bits.bytesize.times do |i|
    result_bits.setbyte(i, @bits.getbyte(i) | other_bits.getbyte(i))
  end
  result.instance_variable_set(:@count, @count + other.count)
  result
end