GRX-Tensor
Ruby speaks. C computes.
A tensor framework for Ruby with automatic differentiation, a C+SIMD compute core, and neural network primitives — all behind a clean, expressive Ruby API.
What is GRX?
GRX is a tensor computation library for Ruby. The numeric core is written in C and compiled with AVX2 + FMA SIMD instructions — processing 4 doubles per CPU cycle with fused multiply-add. Ruby handles the high-level API: shape validation, computation graph construction, and orchestration. C handles everything else.
Key features
| Feature | Details |
|---|---|
| C+SIMD kernel | AVX2+FMA, SSE2 fallback, scalar fallback — auto-detected at compile time |
| Autograd | Automatic differentiation via a topological computation graph |
| Optimizers | SGD (momentum, weight decay) and Adam (inner loop in C with FMA) |
| NN layers | Linear, Sequential, Dropout, BatchNorm1d |
| Activations | ReLU, Leaky ReLU, Tanh, Sigmoid, Softmax |
| Loss functions | MSE, MAE, BCE, CrossEntropy, Huber |
| Weight init | Xavier uniform, He normal (Box-Muller in C) |
| Cross-platform | .so on Linux, .dylib on macOS, .dll on Windows |
| Pure Ruby fallback | Works without compilation — slower but always correct |
Installation
gem install grx-tensor
# Gemfile
gem "grx-tensor"
The C extension compiles automatically on gem install. No extra steps needed.
Quick start
require "grx"
a = GRX.tensor([1.0, 2.0, 3.0], [3], requires_grad: true)
b = GRX.tensor([4.0, 5.0, 6.0], [3], requires_grad: true)
c = a + b # [5.0, 7.0, 9.0] — computed in C with AVX2
c.backward # propagates gradients through the graph
a.grad.to_a # [1.0, 1.0, 1.0]
b.grad.to_a # [1.0, 1.0, 1.0]
Tensors
# From array + shape
t = GRX.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], [2, 3])
t.shape # [2, 3]
t.numel # 6
t.to_a # [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
t.item # only for single-element tensors → Float
# Factories
GRX.zeros([3]) # [0.0, 0.0, 0.0]
GRX.ones([2, 2]) # [1.0, 1.0, 1.0, 1.0]
GRX.rand([4]) # uniform [0, 1)
GRX.randn([4]) # normal N(0, 1)
GRX::Tensor.zeros_like(t) # same shape, all zeros
GRX::Tensor.ones_like(t) # same shape, all ones
Arithmetic
All operations run in C. Scalar operands are supported on both sides.
a = GRX.tensor([1.0, 2.0, 3.0, 4.0], [4])
b = GRX.tensor([4.0, 3.0, 2.0, 1.0], [4])
(a + b).to_a # [5.0, 5.0, 5.0, 5.0]
(a - b).to_a # [-3.0, -1.0, 1.0, 3.0]
(a * b).to_a # [4.0, 6.0, 6.0, 4.0]
(a / b).to_a # [0.25, 0.666, 1.5, 4.0]
(-a).to_a # [-1.0, -2.0, -3.0, -4.0]
# Tensor OP scalar
(a + 10.0).to_a # [11.0, 12.0, 13.0, 14.0]
(a * 3.0).to_a # [3.0, 6.0, 9.0, 12.0]
(a / 2.0).to_a # [0.5, 1.0, 1.5, 2.0]
(a - 1.0).to_a # [0.0, 1.0, 2.0, 3.0]
Math operations
x = GRX.tensor([1.0, 4.0, 9.0, 16.0], [4])
x.sqrt.to_a # [1.0, 2.0, 3.0, 4.0]
x.square.to_a # [1.0, 16.0, 81.0, 256.0]
x.abs.to_a # absolute value element-wise
x.log.to_a # natural logarithm
x.exp.to_a # e^x
x.pow(3).to_a # [1.0, 64.0, 729.0, 4096.0]
x.clip(2.0, 10.0).to_a # [2.0, 4.0, 9.0, 10.0]
# Reductions → Float
x.sum # 30.0
x.mean # 7.5
x.max # 16.0
x.min # 1.0
Linear algebra
u = GRX.tensor([1.0, 2.0, 3.0], [3])
v = GRX.tensor([4.0, 5.0, 6.0], [3])
u.dot(v) # 32.0 → 1×4 + 2×5 + 3×6
# Matrix multiplication — tiled for cache efficiency
a = GRX.tensor([1.0, 2.0, 3.0, 4.0], [2, 2])
b = GRX.tensor([5.0, 6.0, 7.0, 8.0], [2, 2])
a.matmul(b).to_a # [19.0, 22.0, 43.0, 50.0]
# Non-square: [2×3] × [3×2] → [2×2]
a3 = GRX.tensor([1.0,2.0,3.0, 4.0,5.0,6.0], [2, 3])
b3 = GRX.tensor([7.0,8.0, 9.0,10.0, 11.0,12.0], [3, 2])
a3.matmul(b3).to_a # [58.0, 64.0, 139.0, 154.0]
Zero-copy geometry
reshape and transpose return views over the same memory — no data is copied.
m = GRX.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], [2, 3])
m.get(1, 2) # 6.0
m.reshape([3, 2]) # new view, same data
m.flatten # shape [6], same data
m.transpose # shape [3, 2], same data
# Transpose is a true view
sq = GRX.tensor([1.0, 2.0, 3.0, 4.0], [2, 2])
tr = sq.transpose
tr.get(0, 1) # 3.0 (was sq[1, 0])
tr.get(1, 0) # 2.0 (was sq[0, 1])
tr.to_a # [1.0, 3.0, 2.0, 4.0]
Activations
x = GRX.tensor([-3.0, -1.0, 0.0, 1.0, 3.0], [5])
x.relu.to_a # [0.0, 0.0, 0.0, 1.0, 3.0]
x.leaky_relu(0.1).to_a # [-0.3, -0.1, 0.0, 1.0, 3.0]
x.sigmoid.to_a # [0.047, 0.268, 0.5, 0.731, 0.952]
x.tanh.to_a # [-0.995, -0.761, 0.0, 0.761, 0.995]
GRX.tensor([1.0, 2.0, 3.0, 4.0], [4]).softmax.to_a
# [0.032, 0.087, 0.236, 0.643] — always sums to 1.0
Autograd
Every operation builds a computation graph automatically. Call .backward to propagate gradients back through the graph.
# --- Simple gradient ---
a = GRX.tensor([2.0, 3.0], [2], requires_grad: true)
b = GRX.tensor([4.0, 5.0], [2], requires_grad: true)
c = a + b
c.backward
a.grad.to_a # [1.0, 1.0] — d(a+b)/da = 1
b.grad.to_a # [1.0, 1.0] — d(a+b)/db = 1
# --- Chained operations ---
x = GRX.tensor([1.0, 2.0], [2], requires_grad: true)
y = GRX.tensor([3.0, 4.0], [2], requires_grad: true)
z = (x + y) * y # z = xy + y²
z.backward
x.grad.to_a # [3.0, 4.0] — dz/dx = y
y.grad.to_a # [7.0, 10.0] — dz/dy = x + 2y
# Reset gradients before next step
x.zero_grad!
y.zero_grad!
Operations with autograd support:
+ - * / negate scale square sqrt log exp pow
relu leaky_relu tanh sigmoid matmul transpose
Neural networks
# Build a network with Sequential
net = GRX::NN::Sequential.new(
GRX::NN::Linear.new(4, 64),
GRX::NN::ReLU.new,
GRX::NN::Linear.new(64, 32),
GRX::NN::Tanh.new,
GRX::NN::Linear.new(32, 1),
GRX::NN::Sigmoid.new
)
puts net
# Sequential(
# (0): Linear(4 → 64, bias: true)
# (1): ReLU()
# (2): Linear(64 → 32, bias: true)
# (3): Tanh()
# (4): Linear(32 → 1, bias: true)
# (5): Sigmoid()
# )
# Forward pass — batch of 8 samples, 4 features each
x = GRX.randn([8, 4])
pred = net.call(x) # shape [8, 1]
# Access all trainable parameters
params = net.parameters # Array of Tensors with requires_grad: true
params.size # 6 (3 weights + 3 biases)
Training loop
require "grx"
# --- Dataset: learn y = 2x + 1 ---
train_x = GRX.tensor((1..8).map(&:to_f), [8, 1])
train_y = GRX.tensor((1..8).map { |x| 2.0 * x + 1.0 }, [8, 1])
# --- Network ---
net = GRX::NN::Sequential.new(
GRX::NN::Linear.new(1, 8),
GRX::NN::Tanh.new,
GRX::NN::Linear.new(8, 1)
)
opt = GRX::Optim::Adam.new(net.parameters, lr: 0.05)
loss_fn = GRX::Loss::MSELoss.new
300.times do |epoch|
opt.zero_grad
pred = net.call(train_x)
loss_val = loss_fn.call(pred, train_y)
# Compute and inject gradients
grad = pred.to_a.zip(train_y.to_a).map { |p, t| 2.0 * (p - t) / pred.numel }
pred.agregar_gradiente(GRX.tensor(grad, pred.shape))
pred.backward
opt.step
puts "epoch #{epoch + 1} loss: #{loss_val.round(6)}" if (epoch + 1) % 100 == 0
end
# epoch 100 loss: 0.312...
# epoch 200 loss: 0.041...
# epoch 300 loss: 0.005...
Layers
| Class | Description |
|---|---|
GRX::NN::Linear |
Dense layer — y = x @ Wᵀ + b, Xavier uniform init |
GRX::NN::Sequential |
Ordered chain of layers |
GRX::NN::ReLU |
Rectified Linear Unit |
GRX::NN::LeakyReLU |
Leaky ReLU with configurable alpha (default 0.01) |
GRX::NN::Tanh |
Hyperbolic tangent |
GRX::NN::Sigmoid |
Logistic sigmoid |
GRX::NN::Softmax |
Normalized exponential |
GRX::NN::Dropout |
Inverted dropout — train! / eval! modes |
GRX::NN::BatchNorm1d |
Batch normalization with running statistics |
Loss functions
| Class | Formula | Use case |
|---|---|---|
GRX::Loss::MSELoss |
mean((pred − target)²) |
Regression |
GRX::Loss::MAELoss |
`mean( | pred − target |
GRX::Loss::BCELoss |
-mean(t·log(p) + (1−t)·log(1−p)) |
Binary classification |
GRX::Loss::CrossEntropyLoss |
Softmax + NLL | Multi-class classification |
GRX::Loss::HuberLoss |
Smooth L1 (configurable delta) | Regression with outliers |
Optimizers
# SGD with momentum and weight decay
opt = GRX::Optim::SGD.new(net.parameters,
lr: 0.01,
momentum: 0.9,
weight_decay: 1e-4
)
# Adam — the standard choice for deep networks
opt = GRX::Optim::Adam.new(net.parameters,
lr: 0.001,
beta1: 0.9,
beta2: 0.999,
epsilon: 1e-8,
weight_decay: 0.0
)
# Training step
opt.zero_grad # clear gradients
# ... forward + backward ...
opt.step # update parameters
Weight initialization
# Xavier uniform — recommended for tanh / sigmoid layers
GRX::Tensor.xavier_uniform([64, 32], requires_grad: true)
# He normal — recommended for ReLU layers
GRX::Tensor.he_normal([64, 32], requires_grad: true)
# Manual
GRX::Tensor.zeros([64], requires_grad: true)
GRX::Tensor.ones([64], requires_grad: true)
Dropout & BatchNorm
# Dropout — different behavior in train vs eval
drop = GRX::NN::Dropout.new(0.5)
drop.train! # activates dropout
drop.eval! # passes input through unchanged
# BatchNorm1d — normalizes across the batch dimension
bn = GRX::NN::BatchNorm1d.new(16)
bn.train!
bn.eval!
Architecture
grx-tensor/
├── ext/
│ ├── grx/
│ │ ├── grx_core.c # C kernel
│ │ │ # AVX2+FMA element-wise ops (unroll ×2)
│ │ │ # Cache-tiled matmul (TILE=8, 64-byte cache lines)
│ │ │ # Adam optimizer inner loop with FMA
│ │ │ # Xavier uniform + He normal (Box-Muller in C)
│ │ │ # 32-byte aligned memory (posix_memalign / _aligned_malloc)
│ │ ├── grx_core.h # Public C API with GRX_API export macro
│ │ └── extconf.rb # mkmf config — auto-detects AVX2, SSE2, scalar
│ ├── unix/
│ │ └── Makefile # Manual build → lib/grx/libgrx_core.so / .dylib
│ └── windows/
│ └── Makefile.mingw # Manual build → lib/grx/grx_core.dll
│
├── lib/
│ ├── grx.rb # require "grx" ← entry point
│ └── grx/
│ ├── c_api.rb # Fiddle bridge — finds and loads the binary
│ │ # Searches: lib/grx/, lib/, ext/grx/ (all install methods)
│ ├── storage.rb # Native memory buffer (Fiddle::Pointer, 32-byte aligned)
│ ├── tensor.rb # Tensor: zero-copy views + autograd node
│ ├── nn.rb # NN layers
│ ├── optim.rb # Optimizers
│ ├── loss.rb # Loss functions
│ └── errors.rb # ShapeError, DimensionError, StorageError
│
└── test/
├── test_full.rb # 104-test integration suite
├── test_tensor.rb
├── test_nn.rb
└── benchmark.rb
How the binary is found
c_api.rb searches for the compiled binary in this order:
| Priority | Path | When |
|---|---|---|
| 1 | lib/grx/libgrx_core.so |
make -C ext/unix (manual) |
| 2 | lib/grx_core.so |
gem install via rake-compiler |
| 3 | lib/grx_core.bundle |
gem install on macOS |
| 4 | ext/grx/libgrx_core.so |
local development |
If none is found, GRX falls back to pure Ruby automatically — no crash, no configuration needed.
Benchmark
Measured on Ruby 3.3, Linux x86_64, AVX2+FMA active.
| Operation | n = 1M elements | Throughput |
|---|---|---|
add |
~4ms / iter | ~250M doubles/s |
dot |
~2ms / iter | ~500M doubles/s |
relu |
~4ms / iter | ~250M doubles/s |
matmul 256×256 |
~6ms | — |
Roadmap
- [ ] OpenMP — parallelize element-wise ops across all CPU cores
- [ ] BLAS (
cblas_dgemm) — production-grade matmul - [ ] Broadcasting — automatic shape expansion
- [ ]
float32support — 8 values/cycle with AVX2 - [ ] Move autograd graph to C — eliminate Ruby GC overhead for large networks
- [ ]
Conv2d,LSTM,MultiheadAttention - [ ] CUDA extension (
grx-tensor-cuda)
License
MIT — see LICENSE.txt