Class: Kotoshu::DataStructures::BloomFilter

Inherits:
Object
  • Object
show all
Defined in:
lib/kotoshu/data_structures/bloom_filter.rb

Overview

Bloom filter - probabilistic data structure for fast membership testing.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.

Examples:

Basic usage

filter = BloomFilter.new
filter.add("hello")
filter.include?("hello")  # => true (definitely in set)
filter.include?("world")  # => false (probably not in set)

See Also:

Constant Summary collapse

DEFAULT_FALSE_POSITIVE_RATE =

Default false positive rate (1%)

0.01
DEFAULT_EXPECTED_SIZE =

Default expected number of elements

10_000

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(expected_size: DEFAULT_EXPECTED_SIZE, false_positive_rate: DEFAULT_FALSE_POSITIVE_RATE, case_sensitive: false) ⇒ BloomFilter

Create a new Bloom filter.

Parameters:

  • expected_size (Integer) (defaults to: DEFAULT_EXPECTED_SIZE)

    Expected number of elements (default: 10_000)

  • false_positive_rate (Float) (defaults to: DEFAULT_FALSE_POSITIVE_RATE)

    Desired false positive rate (default: 0.01)

  • case_sensitive (Boolean) (defaults to: false)

    Whether lookups are case-sensitive (default: false)



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# File 'lib/kotoshu/data_structures/bloom_filter.rb', line 39

def initialize(expected_size: DEFAULT_EXPECTED_SIZE,
               false_positive_rate: DEFAULT_FALSE_POSITIVE_RATE,
               case_sensitive: false)
  @case_sensitive = case_sensitive
  @item_count = 0

  # Calculate optimal size and hash count
  # m = -n * ln(p) / (ln(2)^2)
  # k = (m/n) * ln(2)
  @size = calculate_size(expected_size, false_positive_rate)
  @hash_count = calculate_hash_count(@size, expected_size)

  # Initialize bit array
  @bits = Array.new(@size, false)
end

Instance Attribute Details

#hash_countInteger (readonly)

Returns Number of hash functions.

Returns:

  • (Integer)

    Number of hash functions



29
30
31
# File 'lib/kotoshu/data_structures/bloom_filter.rb', line 29

def hash_count
  @hash_count
end

#item_countInteger (readonly)

Returns Number of items added.

Returns:

  • (Integer)

    Number of items added



32
33
34
# File 'lib/kotoshu/data_structures/bloom_filter.rb', line 32

def item_count
  @item_count
end

#sizeInteger (readonly)

Returns Size of the bit array.

Returns:

  • (Integer)

    Size of the bit array



26
27
28
# File 'lib/kotoshu/data_structures/bloom_filter.rb', line 26

def size
  @size
end

Instance Method Details

#add(item) ⇒ self

Add an element to the filter.

Parameters:

  • item (String)

    The item to add

Returns:

  • (self)

    Self for chaining



59
60
61
62
63
64
65
66
67
68
69
# File 'lib/kotoshu/data_structures/bloom_filter.rb', line 59

def add(item)
  normalized_item = normalize_item(item)

  @hash_count.times do |i|
    index = hash_index(normalized_item, i)
    @bits[index] = true
  end

  @item_count += 1
  self
end

#clearself

Clear all elements from the filter.

Returns:

  • (self)

    Self for chaining



110
111
112
113
114
# File 'lib/kotoshu/data_structures/bloom_filter.rb', line 110

def clear
  @bits = Array.new(@size, false)
  @item_count = 0
  self
end

#merge(other) ⇒ self

Merge another bloom filter into this one.

Parameters:

  • other (BloomFilter)

    Another bloom filter with same parameters

Returns:

  • (self)

    Self for chaining

Raises:

  • (ArgumentError)


95
96
97
98
99
100
101
102
103
104
105
# File 'lib/kotoshu/data_structures/bloom_filter.rb', line 95

def merge(other)
  raise ArgumentError, "Cannot merge filters with different sizes" unless other.size == @size
  raise ArgumentError, "Cannot merge filters with different hash counts" unless other.hash_count == @hash_count

  @size.times do |i|
    @bits[i] = @bits[i] || other.instance_variable_get(:@bits)[i]
  end

  @item_count += other.item_count
  self
end

#statsHash

Get filter statistics.

Returns:

  • (Hash)

    Statistics including :size, :hash_count, :item_count



119
120
121
122
123
124
125
# File 'lib/kotoshu/data_structures/bloom_filter.rb', line 119

def stats
  {
    size: @size,
    hash_count: @hash_count,
    item_count: @item_count
  }
end