Class: Philiprehberger::BloomFilter::Filter

Inherits:
Object
  • Object
show all
Defined in:
lib/philiprehberger/bloom_filter/filter.rb

Overview

A space-efficient probabilistic set membership data structure.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(expected_items:, false_positive_rate: 0.01) ⇒ Filter

Returns a new instance of Filter.

Parameters:

  • expected_items (Integer)

    expected number of items

  • false_positive_rate (Float) (defaults to: 0.01)

    desired false positive rate

Raises:



16
17
18
19
20
21
22
23
24
25
26
27
28
29
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 16

def initialize(expected_items:, false_positive_rate: 0.01)
  raise Error, 'expected_items must be positive' unless expected_items.is_a?(Integer) && expected_items.positive?
  unless false_positive_rate.positive? && false_positive_rate < 1
    raise Error,
          'false_positive_rate must be between 0 and 1'
  end

  @expected_items = expected_items
  @false_positive_rate = false_positive_rate
  @bit_size = optimal_bit_size(expected_items, false_positive_rate)
  @hash_count = optimal_hash_count(@bit_size, expected_items)
  @bits = "\0".b * ((@bit_size + 7) / 8)
  @count = 0
end

Instance Attribute Details

#countObject (readonly)

Returns the value of attribute count.



12
13
14
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 12

def count
  @count
end

Class Method Details

.deserialize(data) ⇒ Filter

Deserialize a filter from a hash.

Parameters:

  • data (Hash)

    serialized filter data

Returns:



313
314
315
316
317
318
319
320
321
322
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 313

def self.deserialize(data)
  filter = allocate
  filter.instance_variable_set(:@expected_items, data['expected_items'])
  filter.instance_variable_set(:@false_positive_rate, data['false_positive_rate'])
  filter.instance_variable_set(:@bit_size, data['bit_size'])
  filter.instance_variable_set(:@hash_count, data['hash_count'])
  filter.instance_variable_set(:@bits, [data['bits']].pack('H*'))
  filter.instance_variable_set(:@count, data['count'])
  filter
end

.from_json(str) ⇒ Filter

Deserialize a filter from a JSON string.

Parameters:

  • str (String)

    JSON string

Returns:



181
182
183
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 181

def self.from_json(str)
  deserialize(JSON.parse(str))
end

Instance Method Details

#==(other) ⇒ Boolean Also known as: eql?

Check structural equality with another filter.

Parameters:

  • other (Filter)

    another bloom filter

Returns:

  • (Boolean)

    true if both filters have the same structure and bit array



136
137
138
139
140
141
142
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 136

def ==(other)
  return false unless other.is_a?(self.class)

  @bit_size == other.instance_variable_get(:@bit_size) &&
    @hash_count == other.instance_variable_get(:@hash_count) &&
    @bits == other.instance_variable_get(:@bits)
end

#add(item) ⇒ self

Add an item to the filter.

Parameters:

  • item (Object)

    the item to add (uses #to_s for hashing)

Returns:

  • (self)


35
36
37
38
39
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 35

def add(item)
  hash_indices(item.to_s).each { |i| set_bit(i) }
  @count += 1
  self
end

#bulk_add(items) ⇒ self

Add all items from an enumerable

Parameters:

  • items (Enumerable)

    items to add

Returns:

  • (self)


88
89
90
91
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 88

def bulk_add(items)
  items.each { |item| add(item) }
  self
end

#clearself

Reset the filter, removing all items.

Returns:

  • (self)


71
72
73
74
75
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 71

def clear
  @bits = "\0".b * ((@bit_size + 7) / 8)
  @count = 0
  self
end

#compatible?(other) ⇒ Boolean

Check whether another filter has compatible structure for merge, intersection, union, and superset operations.

Parameters:

  • other (Object)

    candidate filter

Returns:

  • (Boolean)


242
243
244
245
246
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 242

def compatible?(other)
  other.is_a?(self.class) &&
    @bit_size == other.instance_variable_get(:@bit_size) &&
    @hash_count == other.instance_variable_get(:@hash_count)
end

#copyFilter

Create an independent deep copy of this filter.

Returns:

  • (Filter)

    a new filter with the same state



147
148
149
150
151
152
153
154
155
156
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 147

def copy
  result = self.class.allocate
  result.instance_variable_set(:@expected_items, @expected_items)
  result.instance_variable_set(:@false_positive_rate, @false_positive_rate)
  result.instance_variable_set(:@bit_size, @bit_size)
  result.instance_variable_set(:@hash_count, @hash_count)
  result.instance_variable_set(:@bits, @bits.dup)
  result.instance_variable_set(:@count, @count)
  result
end

#count_estimateFloat

Estimate the number of unique items using the fill rate

Returns:

  • (Float)

    estimated cardinality



96
97
98
99
100
101
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 96

def count_estimate
  bits_set = count_set_bits
  return 0.0 if bits_set.zero?

  -(@bit_size.to_f / @hash_count) * Math.log(1.0 - (bits_set.to_f / @bit_size))
end

#empty?Boolean

Check if the filter is empty (no items added).

Returns:

  • (Boolean)

    true if count is zero



209
210
211
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 209

def empty?
  @count.zero?
end

#false_positive_rateFloat

Calculate the actual false positive rate based on current fill rate.

Uses the formula: (1 - e^(-k*n/m))^k where k = hash_count, n = count, m = bit_size.

Returns:

  • (Float)

    estimated false positive rate



164
165
166
167
168
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 164

def false_positive_rate
  return 0.0 if @count.zero?

  (1.0 - Math.exp(-@hash_count.to_f * @count / @bit_size))**@hash_count
end

#fill_rateFloat

Return the proportion of set bits in the bit array

Returns:

  • (Float)

    fill rate between 0.0 and 1.0



128
129
130
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 128

def fill_rate
  count_set_bits.to_f / @bit_size
end

#hashInteger

Hash code consistent with #==, allowing filters to be used as Hash keys or Set members.

Returns:

  • (Integer)


277
278
279
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 277

def hash
  [self.class, @bit_size, @hash_count, @bits].hash
end

#include?(item) ⇒ Boolean

Check if an item might be in the filter.

Parameters:

  • item (Object)

    the item to check

Returns:

  • (Boolean)

    true if the item might be present, false if definitely absent



45
46
47
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 45

def include?(item)
  hash_indices(item.to_s).all? { |i| get_bit(i) }
end

#inspectString

Human-readable representation for debugging.

Returns:

  • (String)


287
288
289
290
291
292
293
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 287

def inspect
  format(
    '#<%<class>s count=%<count>d bit_size=%<bit_size>d hash_count=%<hash_count>d fill_rate=%<fill>.4f>',
    class: self.class.name, count: @count, bit_size: @bit_size,
    hash_count: @hash_count, fill: fill_rate
  )
end

#intersection(other) ⇒ Filter Also known as: &

Create a new filter containing only elements present in both filters (AND)

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Filter)

    new intersection filter

Raises:

  • (Error)

    if the filters have different bit sizes



108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 108

def intersection(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot intersect filters with different bit sizes'
  end

  result = self.class.new(expected_items: @expected_items, false_positive_rate: @false_positive_rate)
  other_bits = other.instance_variable_get(:@bits)
  result_bits = result.instance_variable_get(:@bits)

  @bits.bytesize.times do |i|
    result_bits.setbyte(i, @bits.getbyte(i) & other_bits.getbyte(i))
  end

  result
end

#memory_usageInteger

Return the memory usage of the bit array in bytes.

Returns:

  • (Integer)

    memory usage in bytes



80
81
82
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 80

def memory_usage
  @bits.bytesize
end

#merge(other) ⇒ self

Merge another bloom filter into this one.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (self)

Raises:

  • (Error)

    if the filters have different bit sizes



54
55
56
57
58
59
60
61
62
63
64
65
66
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 54

def merge(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot merge filters with different bit sizes'
  end

  other_bits = other.instance_variable_get(:@bits)
  @bits.bytesize.times do |i|
    @bits.setbyte(i, @bits.getbyte(i) | other_bits.getbyte(i))
  end
  @count += other.count
  self
end

#saturated?(threshold: 0.5) ⇒ Boolean

Check whether the filter has reached or exceeded a fill threshold.

Parameters:

  • threshold (Float) (defaults to: 0.5)

    fill rate threshold between 0.0 and 1.0

Returns:

  • (Boolean)


269
270
271
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 269

def saturated?(threshold: 0.5)
  fill_rate >= threshold
end

#serializeHash

Serialize the filter to a hash.

Returns:

  • (Hash)

    serialized representation



298
299
300
301
302
303
304
305
306
307
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 298

def serialize
  {
    'expected_items' => @expected_items,
    'false_positive_rate' => @false_positive_rate,
    'bit_size' => @bit_size,
    'hash_count' => @hash_count,
    'bits' => @bits.unpack1('H*'),
    'count' => @count
  }
end

#subset?(other) ⇒ Boolean

Check if this filter is a subset of another.

Returns true if every set bit in ‘self` is also set in `other`.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Boolean)

Raises:

  • (Error)

    if the filters have different bit sizes



255
256
257
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 255

def subset?(other)
  other.superset?(self)
end

#superset?(other) ⇒ Boolean

Check if this filter is a superset of another.

Returns true if every set bit in ‘other` is also set in `self`.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Boolean)

Raises:

  • (Error)

    if the filters have different bit sizes



192
193
194
195
196
197
198
199
200
201
202
203
204
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 192

def superset?(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot compare filters with different bit sizes'
  end

  other_bits = other.instance_variable_get(:@bits)
  @bits.bytesize.times do |i|
    ob = other_bits.getbyte(i)
    return false unless @bits.getbyte(i) & ob == ob
  end
  true
end

#to_json(*_args) ⇒ String

Serialize the filter to a JSON string.

Returns:

  • (String)

    JSON representation



173
174
175
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 173

def to_json(*_args)
  JSON.generate(serialize)
end

#union(other) ⇒ Filter Also known as: |

Create a new filter containing all items present in either filter (OR).

Non-mutating counterpart to #merge.

Parameters:

  • other (Filter)

    another bloom filter with the same parameters

Returns:

  • (Filter)

    new union filter

Raises:

  • (Error)

    if the filters have different bit sizes



220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# File 'lib/philiprehberger/bloom_filter/filter.rb', line 220

def union(other)
  unless @bit_size == other.instance_variable_get(:@bit_size)
    raise Error,
          'cannot union filters with different bit sizes'
  end

  result = self.class.new(expected_items: @expected_items, false_positive_rate: @false_positive_rate)
  other_bits = other.instance_variable_get(:@bits)
  result_bits = result.instance_variable_get(:@bits)

  @bits.bytesize.times do |i|
    result_bits.setbyte(i, @bits.getbyte(i) | other_bits.getbyte(i))
  end
  result.instance_variable_set(:@count, @count + other.count)
  result
end