rzstd

Gem Version License: MIT Ruby Rust

Ractor-safe Zstandard bindings for Ruby with persistent contexts.

rzstd provides a stateful FrameCodec that reuses ZSTD_CCtx / ZSTD_DCtx state across calls instead of allocating a fresh ~256 KB context every time, which is what makes it viable for small-message workloads where the upstream zstd-ruby gem loses to LZ4 purely on context-allocation overhead.

API

Two classes plus one utility module function — API shape mirrors rlz4 0.4.x:

Purpose
RZstd::Dictionary Value type: dict bytes + 4-byte id
RZstd::FrameCodec Stateful frame-format codec, optional dict
RZstd.get_frame_content_size(bytes) Header parse, no decode

Errors:

  • RZstd::DecompressError < StandardError — malformed frame, wrong dict, checksum mismatch.
  • RZstd::MissingContentSizeError < DecompressErrormax_output_size: requested but the frame header omits Frame_Content_Size.
  • RZstd::OutputSizeLimitError < DecompressError — frame's declared Frame_Content_Size exceeds the caller's limit.

RZstd::Dictionary

Pure value type — just dict bytes plus a 4-byte id. Built on Data.define, so it's immutable, has value equality, and is shareable across Ractors.

# Raw-content dict: id synthesised from sha256(bytes) mapped into
# the public 32_768..(2**31 - 1) range.
d = RZstd::Dictionary.new(bytes: "schema=v1 type=message field1=")

# ZDICT-format dict (produced by `zstd --train` or Dictionary.train):
# id is read from the header, matching what zstd writes into every
# compressed frame via FLG.DictID.
d = RZstd::Dictionary.new(bytes: File.binread("schema.dict"))

d.bytes  # => frozen binary bytes
d.id     # => u32
d.size   # => dict size

# Override the id (e.g. from an out-of-band registrar):
d = RZstd::Dictionary.new(bytes: raw, id: 0xDEAD_BEEF)

Training

# ZDICT_trainFromBuffer: 100 KiB total samples and ≥ 10 samples
# recommended. Returns a ZDICT-format Dictionary.
samples = 1000.times.map { generate_sample_message }
dict    = RZstd::Dictionary.train(samples, capacity: 64 * 1024)

dict.bytes[0, 4] # => "\x37\xA4\x30\xEC" (ZDICT magic)
dict.id          # => the id zstd put in the header; same as on the wire

Dictionary IDs — the long version

Dictionary#id follows the Zstandard spec's Dictionary_ID semantics:

  • ZDICT-format dicts (the output of Dictionary.train, or any bytes starting with the ZDICT magic 0xEC30A437 LE): the id is read straight out of header bytes [4..7]. This is the same id zstd writes into every compressed frame header via ZSTD_c_dictIDFlag (on by default), so Dictionary#id and the on-wire frame Dictionary_ID always agree. Receivers can therefore route incoming frames to the right dictionary purely by parsing the frame header — no side channel required.
  • Raw-content dicts (opaque bytes with no ZDICT header): the spec requires the on-wire frame Dictionary_ID to be 0, so rzstd synthesises a local id from sha256(bytes) mapped into the public range 32_768..(2**31 - 1) — avoiding both reserved ranges (0..32_767, reserved for a future registrar, and >= 2**31). This id is useful as an in-process handle; it is not on the wire, so peers that need to agree on raw-content dicts must share them out-of-band.

Public constants RZstd::Dictionary::USER_DICT_ID_MIN / USER_DICT_ID_MAX / USER_DICT_ID_SIZE expose the private range for callers that generate their own ids.

RZstd::FrameCodec

Stateful frame-format codec. Holds a CCtx and a DCtx across calls, avoiding the ~256 KB per-call allocation overhead that bites the upstream zstd-ruby gem on small messages.

# No-dict codec, default level (3).
codec = RZstd::FrameCodec.new

ct = codec.compress("the quick brown fox" * 10)
pt = codec.decompress(ct)

# Explicit level (negative = Zstd's fast strategy):
codec = RZstd::FrameCodec.new(level: -3)

Dict-bound

Pass a Dictionary (or raw bytes as a shortcut):

codec = RZstd::FrameCodec.new(dict: dict,    level: -3)
codec = RZstd::FrameCodec.new(dict: "bytes", level: -3)

codec.has_dict?  # => true
codec.id         # => u32 (the dict's id)
codec.level      # => -3
codec.size       # => dict size in bytes

Wrong-dict decoding is caught by the content checksum the encoder enables — a peer using the wrong dictionary raises RZstd::DecompressError instead of returning corrupt bytes.

Bounded decompression

# max_output_size: enforces an upper bound on the declared
# Frame_Content_Size before allocating the output buffer or
# invoking the decoder.
codec.decompress(bytes, max_output_size: 1_048_576)

Missing Frame_Content_Size when max_output_size: is set raises MissingContentSizeError. Declared size over the limit raises OutputSizeLimitError.

Frame header utility

RZstd.get_frame_content_size(bytes)  # => Integer, or nil if header omits FCS

Useful for a receiver that wants to inspect a frame's declared size before calling #decompress (e.g. for routing, accounting, or pre-sizing).

Ractor safety

Module functions, Dictionary values, and FrameCodec instances are all shareable across Ractors. FrameCodec serializes compress / decompress calls on its internal Mutexes — for parallel throughput, allocate one FrameCodec per Ractor.

ractors = 4.times.map do |i|
  Ractor.new(i) do |idx|
    codec = RZstd::FrameCodec.new
    pt    = "ractor #{idx} payload " * 1000
    1000.times do
      ct = codec.compress(pt)
      raise "mismatch" unless codec.decompress(ct) == pt
    end
    :ok
  end
end
ractors.map(&:value) # => [:ok, :ok, :ok, :ok]

Non-goals

  • Streaming / chunked compression.
  • Preservation of string encoding on decompress (output is always binary).

License

MIT