ruby-fst

CI Gem Version License Ruby

Ruby bindings for the fst crate by Andrew Gallant. Provides finite state transducer backed ordered maps and sets with fast lookup, prefix and range scans, floor/ceiling lookups, and Levenshtein fuzzy search.

Requirements

  • Ruby >= 3.2
  • For source installs: a Rust toolchain. Precompiled gems are published for Linux (x86_64, aarch64, glibc and musl), macOS (x86_64, arm64), and Windows (x64) — these install with no Rust toolchain required.

Installation

gem install ruby-fst

Or add to your Gemfile:

gem 'ruby-fst'

Keys are byte strings

All FST keys are arbitrary byte strings (Encoding::BINARY). When a key is returned from the gem (e.g. via each, get_le, range) it always carries the Encoding::BINARY encoding — be careful when interpolating into UTF-8 strings or puts-ing keys that contain non-ASCII bytes.

Keys must be inserted into a builder in strictly ascending lexicographic byte order, with no duplicates. Out-of-order or duplicate inserts raise.

Map

Ordered map from byte string keys to unsigned 64-bit integer values.

require 'ruby_fst'

builder = RubyFst::MapBuilder.new
builder.insert('bar', 2)
builder.insert('baz', 3)
builder.insert('foo', 1)
map = RubyFst::Map.new(builder.finish)

map['foo']             # => 1
map.get('missing')     # => nil
map.contains?('bar')   # => true
map.length             # => 3

map.each { |key, value| puts("#{key}: #{value}") }

Set

Ordered set of byte string keys.

builder = RubyFst::SetBuilder.new
builder.insert('bar')
builder.insert('baz')
builder.insert('foo')
set = RubyFst::Set.new(builder.finish)

set.contains?('foo')   # => true
set.length             # => 3

set.each { |key| puts(key) }

Range queries

Map#range and Set#range stream every entry whose key falls in [ge, le]. Either bound may be omitted.

map.range(ge: 'b', le: 'f') { |key, value| puts("#{key} -> #{value}") }
map.range(ge: 'b').to_a                      # Enumerator without a block
set.range(le: 'baz') { |key| puts(key) }

Map#starts_with and Set#starts_with stream every entry whose key begins with the given prefix.

map.starts_with('foo') { |key, value| puts("#{key} -> #{value}") }
set.starts_with('app').to_a

Floor and ceiling lookups

get_le returns the greatest key less than or equal to the query. get_ge returns the smallest key greater than or equal to the query. Both return [key, value] or nil.

builder = RubyFst::MapBuilder.new
builder.insert('bar', 1)
builder.insert('foo', 2)
builder.insert('qux', 3)
map = RubyFst::Map.new(builder.finish)

map.get_le('dog')   # => ["bar", 1]
map.get_ge('dog')   # => ["foo", 2]
map.get_le('aaa')   # => nil

For range lookups where you only need the value, get_le_value and get_ge_value skip key reconstruction and return only the Integer (or nil):

map.get_le_value('dog')   # => 1
map.get_ge_value('dog')   # => 2
map.get_le_value('aaa')   # => nil

This is useful for IP range lookups. Encode range starts as 4-byte big-endian keys and use get_le_value to find which range an IP falls into:

require 'ipaddr'

builder = RubyFst::MapBuilder.new
builder.insert([167_772_160].pack('N'), 1)   # 10.0.0.0
builder.insert([3_232_235_520].pack('N'), 2) # 192.168.0.0
map = RubyFst::Map.new(builder.finish)

ip = IPAddr.new('10.0.0.100').to_i
label_id = map.get_le_value([ip].pack('N'))

Find all keys within a given edit distance. The search runs as an automaton intersection with the FST, visiting only reachable states.

The query must be valid UTF-8 (the Levenshtein automaton is defined over Unicode codepoints), but the keys themselves can be any bytes — the automaton matches against the UTF-8 interpretation of the keys.

builder = RubyFst::MapBuilder.new
%w(bar baz cat foo fun).each_with_index { |w, i| builder.insert(w, i) }
map = RubyFst::Map.new(builder.finish)

map.search_levenshtein('far', 1) { |key, value| puts(key) }
# => bar

Works on sets too:

set.search_levenshtein('university', 2) { |key| puts(key) }

Serialization

# Bytes
bytes = map.to_bytes              # binary string
map = RubyFst::Map.new(bytes)

# File: read entire file into memory (good for small/medium FSTs)
map.save('/path/to/file.fst')
map = RubyFst::Map.from_path('/path/to/file.fst')

# File: memory-map the file (good for large FSTs; lookups page in only what they touch)
map = RubyFst::Map.from_path_mmap('/path/to/file.fst')

When using from_path_mmap, the file must remain unchanged on disk for the lifetime of the resulting Map or Set — modifying or truncating it causes undefined behavior.

Development

bundle install
bundle exec rake          # compile + test + rubocop

Individual tasks: rake compile, rake test, rake rubocop. Bump the version (and CHANGELOG) with rake bump[patch] / rake bump[minor] / rake bump[major]. Tagging vX.Y.Z on GitHub triggers cross-compile and publish to RubyGems.

License

MIT. See LICENSE.txt for details.

This gem wraps the fst crate by Andrew Gallant, also MIT licensed.