gte

gte is a Ruby gem with a Rust extension for fast text embeddings with ONNX Runtime. Inspired by https://github.com/fbilhaut/gte-rs

Quick Start

require "gte"

model = GTE.config(ENV.fetch("GTE_MODEL_DIR"))

# String input => GTE::Tensor (1 row)
tensor = model.embed("query: hello world")
vector = tensor.row(0)

# [] with string => Array<Float> (single vector)
single = model["query: nearest coffee shop"]

# [] with array => GTE::Tensor (batch)
batch = model[["query: hello", "query: world"]]

Embedding Config (`GTE.config`)

GTE.config(model_dir) builds (and caches) a GTE::Model.

default_model = GTE.config(ENV.fetch("GTE_MODEL_DIR"))

raw_model = GTE.config(ENV.fetch("GTE_MODEL_DIR")) do |config|
  config.with(normalize: false)
end

custom = GTE.config(ENV.fetch("GTE_MODEL_DIR")) do |config|
  config.with(
    output_tensor: "last_hidden_state",
    max_length: 256,
    padding: "batch_longest",
    optimization_level: 3
  )
end

Config fields and defaults:

model_dir: absolute path to model directory
optimization_level: 3
model_name: nil
normalize: true (L2 normalization at Ruby-facing API)
output_tensor: nil (auto-select output tensor)
max_length: nil (uses tokenizer/model defaults)
padding: nil (auto; accepts auto, batch_longest, fixed)
execution_providers: nil (falls back to GTE_EXECUTION_PROVIDERS / CPU default)

Notes:

Return a Config::Text from the block (for example, config.with(...)).
Model instances are cached by full config key; different config values create different cached instances.
GTE.warmup(model, threads:) pre-warms thread-local ONNX sessions eagerly at boot. Useful in multi-threaded servers (Puma, Sidekiq) to avoid ~100-500ms cold-start latency.

Common model presets:

e5 = GTE.config(ENV.fetch("GTE_MODEL_DIR")) do |config|
  config.with(
    model_name: "model.onnx",
    output_tensor: "last_hidden_state",
    max_length: 512,
    execution_providers: "cpu"
  )
end

siglip2 = GTE.config(ENV.fetch("GTE_SIGLIP2_DIR")) do |config|
  config.with(
    model_name: "text_model.onnx",
    output_tensor: "pooler_output",
    max_length: 64,
    execution_providers: "cpu"
  )
end

clip = GTE.config(ENV.fetch("GTE_CLIP_DIR")) do |config|
  config.with(
    output_tensor: "sentence_embedding",
    max_length: 512,
    execution_providers: "cpu"
  )
end

Picking a specific layer:

Use output_tensor: to request a named model output.
last_hidden_state gives token-level hidden states and is mean-pooled by gte when the tensor is rank 3.
pooler_output, sentence_embedding, and similar 2D tensors are returned directly and then L2-normalized by default.
If the requested tensor is not present in the model, gte raises an error instead of silently falling back.

Low-level embedder setup (without model cache):

embedder = GTE::Embedder.config(ENV.fetch("GTE_MODEL_DIR")) do |config|
  config.with(execution_providers: "cpu")
end

Reranker

Use GTE::Reranker.config(model_dir) for cross-encoder reranking.

reranker = GTE::Reranker.config(ENV.fetch("GTE_RERANK_DIR")) do |config|
  config.with(sigmoid: true)
end

query = "how to train a neural network?"
candidates = [
  "Backpropagation and gradient descent are core techniques.",
  "This recipe uses flour and eggs."
]

# Raw scores aligned with input order
scores = reranker.score(query, candidates)
# => [0.93, 0.07]

# Ranked output sorted by score desc
ranked = reranker.rerank(query: query, candidates: candidates)
# => [
#      { index: 0, score: 0.93, text: "Backpropagation and gradient descent are core techniques." },
#      { index: 1, score: 0.07, text: "This recipe uses flour and eggs." }
#    ]

Reranker config fields and defaults:

model_dir: absolute path to model directory
optimization_level: 3
model_name: nil
sigmoid: false (set true if you want bounded [0,1] style scores)
output_tensor: nil
max_length: nil
padding: nil (auto; accepts auto, batch_longest, fixed)
execution_providers: nil

Session pool sizing:

GTE_SESSION_POOL_CAP: optional positive integer cap for internal ONNX session pool size.
Unset by default; runtime uses available CPU parallelism.

Automatic Tuning

gte automatically adapts to the hardware — no configuration required.

ONNX Intra-op Threads

Auto-detected via std::thread::available_parallelism() capped at 4.
Prevents oversubscription on high-concurrency workloads.
Override with GTE_INTRA_OP_NUM_THREADS env var.

ONNX Inter-op Threads

Defaults to 1 (text embedding graphs are linear chains with no independent parallel nodes).
Override with GTE_INTER_OP_NUM_THREADS env var.

Execution Providers

gte automatically tries XNNPACK for optimized CPU inference. Falls back to ORT's default CPU provider if unavailable.

ARM64 (Apple Silicon, AWS Graviton): XNNPACK is typically ~25% faster than plain CPU while producing identical embeddings (cos=1.0, max_abs=0.0).
x86/x64 (Intel, AMD): XNNPACK offers minimal benefit — ORT's default CPU provider already uses MKL-DNN/oneDNN, which are better tuned for these chips. The auto-detect silently falls back to the default provider.

Configure providers explicitly with GTE_EXECUTION_PROVIDERS (comma-separated):

export GTE_EXECUTION_PROVIDERS=xnnpack,coreml

Set cpu or none to skip auto-detect and use ORT's default CPU provider.

Session Pre-Warming

ONNX sessions are created lazily per OS thread. In multi-threaded servers (Puma, Sidekiq), each thread creates its own session on first use (~100-500ms cold start). Pre-warm sessions eagerly at boot:

model = GTE.config(ENV.fetch("GTE_MODEL_DIR"))

# Pre-warm thread-local sessions for a Puma server with 5 threads:
GTE.warmup(model, threads: 5)

Runtime + Result Examples

Process-local reuse (recommended for Puma/web servers):

EMBEDDER = GTE.config(ENV.fetch("GTE_MODEL_DIR"))

def embed_query(text)
  EMBEDDER[text] # Array<Float>
end

Model Directory

A model directory must include tokenizer.json and one ONNX model, resolved in this order:

onnx/text_model.onnx
text_model.onnx
onnx/model.onnx
model.onnx

Input policy is text-only. Graphs requiring unsupported multimodal inputs (such as pixel_values) are intentionally rejected.

Development

Run commands inside nix develop via Make targets:

make setup
make compile
make test
make lint
make ci

Benchmarks

Docker Rails+Puma+wrk (Real-World HTTP)

The bench/rails/ directory contains a full-stack benchmark: Rails 7.1 API app served by Puma, loaded with wrk (randomized text queries, 135 diverse texts).

Run for all models:

make bench-docker-compare

Run for a single model:

make bench-docker-sweep-siglip2
make bench-docker-validate  # cross-validation checks

Siglip2 (768-dim, pooler_output)

Concurrency	GTE p90	Pure Ruby p90	Ratio	GTE RPS	Pure Ruby RPS
c=1	~12ms	~120ms	9-10×	~95	~10
c=4	~39ms	~503ms	10-13×	~228	~10
c=8	~146ms	~613ms	3-4×	~224	~10
c=16	~430ms	~611ms	1-1.5×	~226	~11

E5 (384-dim, last_hidden_state + mean pool)

Concurrency	GTE p90	Pure Ruby p90	Ratio	GTE RPS	Pure Ruby RPS
c=1	~7ms	~120ms	16-17×	~160	~10
c=4	~12ms	~430ms	35-40×	~477	~10
c=8	~64ms	~530ms	8-9×	~503	~10
c=16	~205ms	~534ms	2-3×	~509	~11

GTE releases the GVL during ONNX inference, enabling true parallelism across Puma threads. Pure Ruby is GVL-bound (~10 RPS regardless of concurrency).

The Puma thread pool (min=2, max=5) limits throughput at c=16+. GTE's pipelining and GVL release already saturate the available threads at c=4.

In-Process Benchmarks

make bench
nix develop -c bundle exec rake bench:pure_compare
nix develop -c bundle exec rake bench:matrix_sweep
nix develop -c bundle exec ruby bench/memory_probe.rb --compare-pure

make bench: Puma-like single-request comparison at concurrency 16
rake bench:pure_compare: batch amortization comparison
rake bench:matrix_sweep: GTE provider sweep using the shared result schema
Optional Python comparisons use bench/python_onnxruntime.py and are skipped automatically if local dependencies are unavailable.

To run benchmark + append a RUNS.md entry + enforce goal checks:

make bench-record

bench/runs_ledger.rb check is goal-focused by default:

Enforces the goal metric (response_time_p95) across every enabled competitor.
Does not require current-version coverage in RUNS.md unless explicitly enabled.

gte