omq-lz4

Gem Version License: ISC Ruby

Status: 0.1.0 — first landable release. See RFC.md for the wire-format spec and CHANGELOG.md for what's in.

LZ4-compressed TCP transport for OMQ, complementary to omq-zstd. Pick lz4+tcp:// instead of tcp:// or zstd+tcp:// when you want cheap per-message compression with a small per-connection footprint.

When to pick lz4+tcp:// over zstd+tcp://

LZ4 has no entropy stage (no Huffman, no FSE), ~16 KiB of encoder state per connection, and trades a worse compression ratio for ~4–8× faster encode and ~3× less memory per connection.

zstd+tcp:// lz4+tcp://
Encode, 1 KiB, no dict ~3 µs ~0.4 µs
Encode, 1 KiB, with dict ~3.5 µs ~0.5 µs
Memory per connection ~256 KiB ~16 KiB + dict
Ratio, 1 KiB JSON no dict ~45% ~65%
Ratio, 1 KiB JSON with dict ~20% ~35%
Auto-trained dictionaries — (user-supplied only)

Pick omq-lz4 for CPU- or memory-scarce deployments (edge gateways, IoT concentrators, high-fanout scenarios where per-connection state matters more than ratio). Pick omq-zstd for bandwidth-bound deployments where CPU is cheap.

Install

# Gemfile
gem "omq-lz4"
gem install omq-lz4

Usage

require "omq"
require "omq/lz4"

pull = OMQ::PULL.new
push = OMQ::PUSH.new

uri = pull.bind("lz4+tcp://127.0.0.1:0")
push.connect(uri.to_s)

push << ["hello, compressed world"]
pull.receive  # => ["hello, compressed world"]

Both peers must use lz4+tcp://. A tcp:// peer cannot talk to an lz4+tcp:// peer — they speak different transports.

Dictionary compression

Small messages don't compress well on their own. A shared dictionary gives 2–5× better ratios on payloads with a common prefix. Supply a user-trained dictionary (LZ4 has no auto-training — use omq-zstd for that):

dict = File.binread("schema.dict")
push.connect("lz4+tcp://127.0.0.1:5555", dict: dict)

The sender ships the dictionary to the receiver in-band, prefixed with the dictionary sentinel (4C 5A 34 44, "LZ4D" in ASCII), on the first outgoing message. The receiver installs the dictionary and decompresses subsequent messages against it. Dictionary size is capped at 8 KiB — tighter than omq-zstd's 64 KiB cap, to let constrained peers accept shipments without allocating tens of KB of scratch.

Compression thresholds

To avoid pessimizing tiny frames, the sender skips compression below:

Mode Threshold
No dictionary 512 B
With dictionary 32 B

Below the threshold the part is sent uncompressed (4-byte zero sentinel + plaintext).

Security limits

The receiver bounds decompression by the socket's max_message_size (the same knob you'd use on a plain tcp:// socket). It caps the total decompressed size of all parts in a single message. A peer attempting to send an over-budget message drops the connection — OMQ::SocketDeadError surfaces on the next receive.

Independent of that, the dictionary itself is capped at 8 KiB; a larger shipment drops the connection.

See the plan roadmap (../OMQ-LZ4.plan) for history and open questions.

Performance

Measured on x86_64 scalar, Ruby 4.0 + YJIT, on dict-friendly (repeated Lorem ipsum prefix) input.

OMQ::LZ4::Codec (pure encode/decode, no I/O):

Input size No dict encode Dict encode No dict decode Dict decode
64 B ~0.9 µs ~1.0 µs ~0.4 µs ~0.6 µs
256 B ~1.1 µs ~0.8 µs ~0.4 µs ~0.5 µs
1 KiB ~1.5 µs ~0.9 µs ~0.9 µs ~1.0 µs
16 KiB ~3.2 µs ~2.4 µs ~3.9 µs ~3.0 µs
1 MiB ~89 µs ~87 µs ~173 µs ~303 µs

End-to-end PUSH → PULL over lz4+tcp:// (loopback):

Message size Throughput
64 B ~67k msg/s
256 B ~94k msg/s
1 KiB ~92k msg/s

Run the benchmarks yourself:

OMQ_DEV=1 bundle exec ruby --yjit bench/codec_micro.rb
OMQ_DEV=1 bundle exec ruby --yjit bench/transport_throughput.rb
OMQ_DEV=1 bundle exec ruby --yjit bench/head_to_head.rb   # lz4 vs zstd

Head-to-head vs omq-zstd and plain tcp

End-to-end PUSH → PULL throughput, Ruby 4.0 + YJIT. Input: UUID-sprinkled Lorem ipsum — a fresh UUID between each Lorem paragraph. Approximates realistic workloads where a schema repeats but values vary (event logs, protobuf records, JSON events), so a fraction of every message is mandatorily incompressible.

The link between PUSH and PULL is loopback, rate-shaped with tc netem rate Xmbit on dev lo to simulate bandwidth-limited networks. zstd+tcp shown at level -3 (default, fast) and level 3 (tighter ratio, more CPU).

The table below: plaintext MiB/s (application-level throughput) and wire MiB/s (bytes on the socket) at 128 KiB payload, across three bandwidth regimes.

Link Metric tcp lz4+tcp zstd -3 zstd 3
100 Mbit plain 11.8 105 114 197
(cap ≈ 12 MiB/s) wire 11.8 12 12 12
speedup 1.00× 8.89× 9.70× 16.74×
1 Gbit plain 117 794 900 603
(cap ≈ 125 MiB/s) wire 117 93 94 36
speedup 1.00× 6.81× 7.73× 5.17×
Unlimited loopback plain 1 064 869 972 626
(kernel-copy-bound) wire 1 064 99 101 37
speedup 1.00× 0.82× 0.91× 0.59×

Three regimes visible:

  • 100 Mbit — all compressed transports saturate wire at ~12 MiB/s. Plaintext = wire-cap × (1 / compression-ratio). The tighter the ratio, the bigger the win: zstd 3's 3% wire ratio translates to a ~17× throughput multiplier over plain tcp.
  • 1 Gbit — compressed transports shift from wire-saturated to CPU-limited. zstd -3 reaches ~75% of wire cap; zstd 3 only 29% (deep CPU-bound). Both beat plain tcp (which is pinned at the wire cap) by 6–8×. zstd 3's tighter wire no longer helps — there's no wire saturation to trade CPU for.
  • Unlimited loopback — no wire cap. All three are CPU-limited. Plain tcp doesn't pay compression CPU, so skip compression on loopback.

Rate-shape your own link to reproduce:

sudo tc qdisc add dev lo root netem rate 100mbit  # or 1gbit, 10mbit, etc.
OMQ_DEV=1 bundle exec ruby --yjit bench/head_to_head.rb
sudo tc qdisc del dev lo root

Or use a veth pair in a network namespace so shaping doesn't touch your host's real loopback (see tc-netem(8), ip-netns(8)).

Full sweeps (8 sizes from 256 B to 512 KiB) for each regime live in bench/head_to_head.rb output — run it yourself; the headline numbers above are stable across repeats but small sizes and very large sizes vary a bit run-to-run.

Takeaway:

  • Pick lz4+tcp:// for bandwidth-limited links (any real network — even 1 Gbit LAN). 6–9× throughput multiplier over plain tcp, minimal memory (~16 KiB/connection), modest CPU. Ties or beats zstd -3 at 1 Gbit; loses the ratio race to zstd 3 at 100 Mbit and below.
  • Pick zstd+tcp:// (level ≥ 3) when the wire is the precious resource (≤ 100 Mbit links, WAN, or you're paying for egress). ~17× throughput multiplier at 100 Mbit for 128 KiB messages is hard to argue with.
  • Pick plain tcp:// when the link is not the bottleneck (localhost IPC, loopback, datacenter-fast inter-host connections where the bandwidth ceiling is above the CPU's compress/decompress speed — typically 10+ Gbit), or when the payload is already high-entropy (encrypted, already compressed, random binary) and compression only adds overhead.

Development

OMQ_DEV=1 bundle install
OMQ_DEV=1 bundle exec rake test

License

ISC