GRX-Tensor

Ruby speaks. C computes.

A tensor framework for Ruby with automatic differentiation, a C+SIMD compute core, and neural network primitives — all behind a clean, expressive Ruby API.

What is GRX?

GRX is a tensor computation library for Ruby. The numeric core is written in C and compiled with AVX2 + FMA SIMD instructions — processing 4 doubles per CPU cycle with fused multiply-add. Ruby handles the high-level API: shape validation, computation graph construction, and orchestration. C handles everything else.

Key features

Feature	Details
C+SIMD kernel	AVX2+FMA, SSE2 fallback, scalar fallback — auto-detected at compile time
Autograd	Automatic differentiation via a topological computation graph
Optimizers	SGD (momentum, weight decay) and Adam (inner loop in C with FMA)
NN layers	Linear, Sequential, Dropout, BatchNorm1d
Activations	ReLU, Leaky ReLU, Tanh, Sigmoid, Softmax
Loss functions	MSE, MAE, BCE, CrossEntropy, Huber
Weight init	Xavier uniform, He normal (Box-Muller in C)
Cross-platform	`.so` on Linux, `.dylib` on macOS, `.dll` on Windows
Pure Ruby fallback	Works without compilation — slower but always correct

Installation

gem install grx-tensor

# Gemfile
gem "grx-tensor"

The C extension compiles automatically on gem install. No extra steps needed.

Quick start

require "grx"

a = GRX.tensor([1.0, 2.0, 3.0], [3], requires_grad: true)
b = GRX.tensor([4.0, 5.0, 6.0], [3], requires_grad: true)

c = a + b        # [5.0, 7.0, 9.0] — computed in C with AVX2
c.backward       # propagates gradients through the graph

a.grad.to_a      # [1.0, 1.0, 1.0]
b.grad.to_a      # [1.0, 1.0, 1.0]

Tensors

# From array + shape
t = GRX.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], [2, 3])
t.shape    # [2, 3]
t.numel    # 6
t.to_a     # [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
t.item     # only for single-element tensors → Float

# Factories
GRX.zeros([3])          # [0.0, 0.0, 0.0]
GRX.ones([2, 2])        # [1.0, 1.0, 1.0, 1.0]
GRX.rand([4])           # uniform [0, 1)
GRX.randn([4])          # normal N(0, 1)

GRX::Tensor.zeros_like(t)   # same shape, all zeros
GRX::Tensor.ones_like(t)    # same shape, all ones

Arithmetic

All operations run in C. Scalar operands are supported on both sides.

a = GRX.tensor([1.0, 2.0, 3.0, 4.0], [4])
b = GRX.tensor([4.0, 3.0, 2.0, 1.0], [4])

(a + b).to_a    # [5.0, 5.0, 5.0, 5.0]
(a - b).to_a    # [-3.0, -1.0, 1.0, 3.0]
(a * b).to_a    # [4.0, 6.0, 6.0, 4.0]
(a / b).to_a    # [0.25, 0.666, 1.5, 4.0]
(-a).to_a       # [-1.0, -2.0, -3.0, -4.0]

# Tensor OP scalar
(a + 10.0).to_a   # [11.0, 12.0, 13.0, 14.0]
(a * 3.0).to_a    # [3.0, 6.0, 9.0, 12.0]
(a / 2.0).to_a    # [0.5, 1.0, 1.5, 2.0]
(a - 1.0).to_a    # [0.0, 1.0, 2.0, 3.0]

Math operations

x = GRX.tensor([1.0, 4.0, 9.0, 16.0], [4])

x.sqrt.to_a           # [1.0, 2.0, 3.0, 4.0]
x.square.to_a         # [1.0, 16.0, 81.0, 256.0]
x.abs.to_a            # absolute value element-wise
x.log.to_a            # natural logarithm
x.exp.to_a            # e^x
x.pow(3).to_a         # [1.0, 64.0, 729.0, 4096.0]
x.clip(2.0, 10.0).to_a  # [2.0, 4.0, 9.0, 10.0]

# Reductions → Float
x.sum    # 30.0
x.mean   # 7.5
x.max    # 16.0
x.min    # 1.0

Linear algebra

u = GRX.tensor([1.0, 2.0, 3.0], [3])
v = GRX.tensor([4.0, 5.0, 6.0], [3])

u.dot(v)   # 32.0  →  1×4 + 2×5 + 3×6

# Matrix multiplication — tiled for cache efficiency
a = GRX.tensor([1.0, 2.0, 3.0, 4.0], [2, 2])
b = GRX.tensor([5.0, 6.0, 7.0, 8.0], [2, 2])
a.matmul(b).to_a   # [19.0, 22.0, 43.0, 50.0]

# Non-square: [2×3] × [3×2] → [2×2]
a3 = GRX.tensor([1.0,2.0,3.0, 4.0,5.0,6.0], [2, 3])
b3 = GRX.tensor([7.0,8.0, 9.0,10.0, 11.0,12.0], [3, 2])
a3.matmul(b3).to_a   # [58.0, 64.0, 139.0, 154.0]

Zero-copy geometry

reshape and transpose return views over the same memory — no data is copied.

m = GRX.tensor([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], [2, 3])

m.get(1, 2)           # 6.0
m.reshape([3, 2])     # new view, same data
m.flatten             # shape [6], same data
m.transpose           # shape [3, 2], same data

# Transpose is a true view
sq = GRX.tensor([1.0, 2.0, 3.0, 4.0], [2, 2])
tr = sq.transpose
tr.get(0, 1)    # 3.0  (was sq[1, 0])
tr.get(1, 0)    # 2.0  (was sq[0, 1])
tr.to_a         # [1.0, 3.0, 2.0, 4.0]

Activations

x = GRX.tensor([-3.0, -1.0, 0.0, 1.0, 3.0], [5])

x.relu.to_a              # [0.0, 0.0, 0.0, 1.0, 3.0]
x.leaky_relu(0.1).to_a   # [-0.3, -0.1, 0.0, 1.0, 3.0]
x.sigmoid.to_a           # [0.047, 0.268, 0.5, 0.731, 0.952]
x.tanh.to_a              # [-0.995, -0.761, 0.0, 0.761, 0.995]

GRX.tensor([1.0, 2.0, 3.0, 4.0], [4]).softmax.to_a
# [0.032, 0.087, 0.236, 0.643]  — always sums to 1.0

Autograd

Every operation builds a computation graph automatically. Call .backward to propagate gradients back through the graph.

# --- Simple gradient ---
a = GRX.tensor([2.0, 3.0], [2], requires_grad: true)
b = GRX.tensor([4.0, 5.0], [2], requires_grad: true)

c = a + b
c.backward

a.grad.to_a   # [1.0, 1.0]  — d(a+b)/da = 1
b.grad.to_a   # [1.0, 1.0]  — d(a+b)/db = 1

# --- Chained operations ---
x = GRX.tensor([1.0, 2.0], [2], requires_grad: true)
y = GRX.tensor([3.0, 4.0], [2], requires_grad: true)

z = (x + y) * y   # z = xy + y²
z.backward

x.grad.to_a   # [3.0, 4.0]   — dz/dx = y
y.grad.to_a   # [7.0, 10.0]  — dz/dy = x + 2y

# Reset gradients before next step
x.zero_grad!
y.zero_grad!

Operations with autograd support: + - * / negate scale square sqrt log exp pow relu leaky_relu tanh sigmoid matmul transpose

Neural networks

# Build a network with Sequential
net = GRX::NN::Sequential.new(
  GRX::NN::Linear.new(4, 64),
  GRX::NN::ReLU.new,
  GRX::NN::Linear.new(64, 32),
  GRX::NN::Tanh.new,
  GRX::NN::Linear.new(32, 1),
  GRX::NN::Sigmoid.new
)

puts net
# Sequential(
#   (0): Linear(4 → 64, bias: true)
#   (1): ReLU()
#   (2): Linear(64 → 32, bias: true)
#   (3): Tanh()
#   (4): Linear(32 → 1, bias: true)
#   (5): Sigmoid()
# )

# Forward pass — batch of 8 samples, 4 features each
x    = GRX.randn([8, 4])
pred = net.call(x)    # shape [8, 1]

# Access all trainable parameters
params = net.parameters   # Array of Tensors with requires_grad: true
params.size               # 6  (3 weights + 3 biases)

Training loop

require "grx"

# --- Dataset: learn y = 2x + 1 ---
train_x = GRX.tensor((1..8).map(&:to_f), [8, 1])
train_y = GRX.tensor((1..8).map { |x| 2.0 * x + 1.0 }, [8, 1])

# --- Network ---
net = GRX::NN::Sequential.new(
  GRX::NN::Linear.new(1, 8),
  GRX::NN::Tanh.new,
  GRX::NN::Linear.new(8, 1)
)

opt     = GRX::Optim::Adam.new(net.parameters, lr: 0.05)
loss_fn = GRX::Loss::MSELoss.new

300.times do |epoch|
  opt.zero_grad

  pred     = net.call(train_x)
  loss_val = loss_fn.call(pred, train_y)

  # Compute and inject gradients
  grad = pred.to_a.zip(train_y.to_a).map { |p, t| 2.0 * (p - t) / pred.numel }
  pred.agregar_gradiente(GRX.tensor(grad, pred.shape))
  pred.backward

  opt.step

  puts "epoch #{epoch + 1}  loss: #{loss_val.round(6)}" if (epoch + 1) % 100 == 0
end
# epoch 100  loss: 0.312...
# epoch 200  loss: 0.041...
# epoch 300  loss: 0.005...

Layers

Class	Description
`GRX::NN::Linear`	Dense layer — `y = x @ Wᵀ + b`, Xavier uniform init
`GRX::NN::Sequential`	Ordered chain of layers
`GRX::NN::ReLU`	Rectified Linear Unit
`GRX::NN::LeakyReLU`	Leaky ReLU with configurable alpha (default `0.01`)
`GRX::NN::Tanh`	Hyperbolic tangent
`GRX::NN::Sigmoid`	Logistic sigmoid
`GRX::NN::Softmax`	Normalized exponential
`GRX::NN::Dropout`	Inverted dropout — `train!` / `eval!` modes
`GRX::NN::BatchNorm1d`	Batch normalization with running statistics

Loss functions

Class	Formula	Use case
`GRX::Loss::MSELoss`	`mean((pred − target)²)`	Regression
`GRX::Loss::MAELoss`	`mean(	pred − target
`GRX::Loss::BCELoss`	`-mean(t·log(p) + (1−t)·log(1−p))`	Binary classification
`GRX::Loss::CrossEntropyLoss`	Softmax + NLL	Multi-class classification
`GRX::Loss::HuberLoss`	Smooth L1 (configurable delta)	Regression with outliers

Optimizers

# SGD with momentum and weight decay
opt = GRX::Optim::SGD.new(net.parameters,
  lr:           0.01,
  momentum:     0.9,
  weight_decay: 1e-4
)

# Adam — the standard choice for deep networks
opt = GRX::Optim::Adam.new(net.parameters,
  lr:           0.001,
  beta1:        0.9,
  beta2:        0.999,
  epsilon:      1e-8,
  weight_decay: 0.0
)

# Training step
opt.zero_grad   # clear gradients
# ... forward + backward ...
opt.step        # update parameters

Weight initialization

# Xavier uniform — recommended for tanh / sigmoid layers
GRX::Tensor.xavier_uniform([64, 32], requires_grad: true)

# He normal — recommended for ReLU layers
GRX::Tensor.he_normal([64, 32], requires_grad: true)

# Manual
GRX::Tensor.zeros([64], requires_grad: true)
GRX::Tensor.ones([64],  requires_grad: true)

Dropout & BatchNorm

# Dropout — different behavior in train vs eval
drop = GRX::NN::Dropout.new(0.5)
drop.train!          # activates dropout
drop.eval!           # passes input through unchanged

# BatchNorm1d — normalizes across the batch dimension
bn = GRX::NN::BatchNorm1d.new(16)
bn.train!
bn.eval!

Architecture

grx-tensor/
├── ext/
│   ├── grx/
│   │   ├── grx_core.c      # C kernel
│   │   │                   #   AVX2+FMA element-wise ops (unroll ×2)
│   │   │                   #   Cache-tiled matmul (TILE=8, 64-byte cache lines)
│   │   │                   #   Adam optimizer inner loop with FMA
│   │   │                   #   Xavier uniform + He normal (Box-Muller in C)
│   │   │                   #   32-byte aligned memory (posix_memalign / _aligned_malloc)
│   │   ├── grx_core.h      # Public C API with GRX_API export macro
│   │   └── extconf.rb      # mkmf config — auto-detects AVX2, SSE2, scalar
│   ├── unix/
│   │   └── Makefile        # Manual build → lib/grx/libgrx_core.so / .dylib
│   └── windows/
│       └── Makefile.mingw  # Manual build → lib/grx/grx_core.dll
│
├── lib/
│   ├── grx.rb              # require "grx"  ← entry point
│   └── grx/
│       ├── c_api.rb        # Fiddle bridge — finds and loads the binary
│       │                   # Searches: lib/grx/, lib/, ext/grx/ (all install methods)
│       ├── storage.rb      # Native memory buffer (Fiddle::Pointer, 32-byte aligned)
│       ├── tensor.rb       # Tensor: zero-copy views + autograd node
│       ├── nn.rb           # NN layers
│       ├── optim.rb        # Optimizers
│       ├── loss.rb         # Loss functions
│       └── errors.rb       # ShapeError, DimensionError, StorageError
│
└── test/
    ├── test_full.rb        # 104-test integration suite
    ├── test_tensor.rb
    ├── test_nn.rb
    └── benchmark.rb

How the binary is found

c_api.rb searches for the compiled binary in this order:

Priority	Path	When
1	`lib/grx/libgrx_core.so`	`make -C ext/unix` (manual)
2	`lib/grx_core.so`	`gem install` via rake-compiler
3	`lib/grx_core.bundle`	`gem install` on macOS
4	`ext/grx/libgrx_core.so`	local development

If none is found, GRX falls back to pure Ruby automatically — no crash, no configuration needed.

Benchmark

Measured on Ruby 3.3, Linux x86_64, AVX2+FMA active.

Operation	n = 1M elements	Throughput
`add`	~4ms / iter	~250M doubles/s
`dot`	~2ms / iter	~500M doubles/s
`relu`	~4ms / iter	~250M doubles/s
`matmul` 256×256	~6ms	—

Roadmap

[ ] OpenMP — parallelize element-wise ops across all CPU cores
[ ] BLAS (cblas_dgemm) — production-grade matmul
[ ] Broadcasting — automatic shape expansion
[ ] float32 support — 8 values/cycle with AVX2
[ ] Move autograd graph to C — eliminate Ruby GC overhead for large networks
[ ] Conv2d, LSTM, MultiheadAttention
[ ] CUDA extension (grx-tensor-cuda)

License

MIT — see LICENSE.txt