Module: TinyNN

Defined in:: lib/toy/ffi/tinynn.rb

Class Method Summary collapse

.adam_step(param, grad, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object

Adam optimizer step.
.add(a, b) ⇒ Object

Element-wise a + b.
.alloc_1d_i32(sess, n) ⇒ Object
.alloc_2d(sess, rows, cols) ⇒ Object
.build_add(sess, ta, tb) ⇒ Object
.build_gelu(sess, ta) ⇒ Object
.build_matmul(sess, ta, tb) ⇒ Object
.build_rms_norm(sess, tx, tgamma, eps) ⇒ Object
.build_scale(sess, ta, s) ⇒ Object
.build_softmax(sess, ta) ⇒ Object
.compute(sess) ⇒ Object
.cross_entropy_grad(logits, targets, n_pred) ⇒ Object

Fused softmax-cross-entropy gradient: dlogits[i, v] = (softmax(logits)[i, v] - one_hot(targets)[v]) / n_pred.
.download_matmul(sess, tensor, m, n) ⇒ Object

Download a matmul result.
.download_row_major(sess, dl_handle, rows, cols) ⇒ Object

Download a tensor whose data is row-major (output of elementwise ops like add, gelu, rms_norm, softmax, scale).
.download_to_mat(sess, dl_handle, rows, cols) ⇒ Object

Chunked Mat-roundtrip for large tensors.
.embed_back(d_out, indices, vocab_size) ⇒ Object

Embedding backward: scatter-add d_out rows into a vocab-sized table.
.embed_lookup(table, indices) ⇒ Object

Embedding lookup: gather table rows by indices.
.ffn_pipeline(h, w1, w2) ⇒ Object

FFN-shaped chain: result = gelu(h * w1) * w2.
.gelu(a) ⇒ Object

Element-wise GeLU (tanh approximation, matches project’s feed_forward).
.gelu_back(x, dh) ⇒ Object

GeLU backward: dx = dh * d/dx GeLU(x) (tanh approx).
.matmul(a, b) ⇒ Object

a ** b where both are project Mats (row-major f64).
.matmul_t(a, b) ⇒ Object

a * b^T natively (matches Mat#matmul_t).
.mul(a, b) ⇒ Object

Element-wise multiply c = a * b.
.persistent_free(sess) ⇒ Object
.persistent_new(prefer_cuda) ⇒ Object

———————————————————————- Persistent-session API: build a graph once, run it many times.
.realize(sess, result) ⇒ Object
.rms_norm(x, gamma, eps) ⇒ Object

RMSNorm(x) * gamma.
.rms_norm_back(x, dy, eps) ⇒ Object

d/dx of plain RMSNorm(x) given dy (= grad of normalized output).
.scale(a, s) ⇒ Object

Element-wise a * s for scalar s.
.sgd_step(param, grad, lr) ⇒ Object

SGD parameter update: param_new = param - lr * grad.
.silu(a) ⇒ Object

Element-wise SiLU (x * sigmoid(x)), llama-family activation.
.silu_back(x, dy) ⇒ Object

Backward for SiLU: given x (the input to silu) and dy (gradient from upstream), returns dx.
.softmax(a) ⇒ Object

Per-row softmax.
.softmax_back(a_softmax, dy) ⇒ Object

d/dx of per-row softmax.
.stage_row_major_and_upload(sess, target, m) ⇒ Object

Internal: stage ‘m` row-major into scratch, then bulk-upload to `target`.
.stage_transposed_and_upload(sess, target, b) ⇒ Object

Internal: stage b TRANSPOSED into scratch, then bulk-upload to ‘target`.
.t_matmul(a, b) ⇒ Object

a^T * b (matches Mat#t_matmul).
.transpose(a) ⇒ Object

Transpose.
.upload_int_array(sess, tensor, indices) ⇒ Object

Upload an Array<Int> to a 1D int32 tensor in one FFI call.
.upload_row_major(sess, tensor, mat) ⇒ Object

Stage a Mat row-major into scratch and upload to ‘tensor`.
.upload_transposed(sess, tensor, mat) ⇒ Object

Stage a Mat TRANSPOSED into scratch and upload.

Class Method Details

.adam_step(param, grad, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ `Object`

Adam optimizer step. Matches the project’s adam_step_mat.

Returns three new Mats: [param_new, m_new, v_new]. Caller is responsible for swapping them back into wherever they came from (no persistent storage yet — once persistent sessions are wired into transformer.rb, m/v can stay on-device).

omc1, omc2 are pre-computed bias-correction divisors:

omc1 = 1 - beta1^t,  omc2 = 1 - beta2^t

where t is the step number. (The project tracks them as running products in AdamState.bc1 / bc2; both conventions work.)

# File 'lib/toy/ffi/tinynn.rb', line 1206

def self.adam_step(param, grad, m, v, lr, b1, b2, eps, omc1, omc2)
  sess = TinyNN.tnn_session_new(0)
  n = param.nrows * param.ncols
  # Stage param at [0..n), grad at [n..2n), m at [2n..3n), v at [3n..4n).
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, param.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, n + i, grad.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, 2 * n + i, m.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, 3 * n + i, v.flat[i])
    i = i + 1
  end

  TinyNN.tnn_adam_step_scratch(sess, n, lr, b1, b2, eps, omc1, omc2)

  new_param = Mat.new(param.nrows, param.ncols)
  new_mom_m = Mat.new(param.nrows, param.ncols)
  new_mom_v = Mat.new(param.nrows, param.ncols)
  i = 0
  while i < n
    new_param.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    new_mom_m.flat[i] = TinyNN.tnn_scratch_get(sess, 2 * n + i)
    new_mom_v.flat[i] = TinyNN.tnn_scratch_get(sess, 3 * n + i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  AdamStepResult.new(new_param, new_mom_m, new_mom_v)
end

.add(a, b) ⇒ `Object`

Element-wise a + b. Both Mats must have the same shape.

# File 'lib/toy/ffi/tinynn.rb', line 683

def self.add(a, b)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tb = TinyNN.tnn_input_2d_f32(sess, b.nrows, b.ncols)
  tc = TinyNN.tnn_add(sess, ta, tb)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, b.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tb)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  # Result is row-major same shape as a (ne0=cols, ne1=rows, flat
  # is row-major already since ggml_add preserves layout).
  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.alloc_1d_i32(sess, n) ⇒ `Object`



968
969
970

# File 'lib/toy/ffi/tinynn.rb', line 968

def self.alloc_1d_i32(sess, n)
  TinyNN.tnn_input_1d_i32(sess, n)
end

.alloc_2d(sess, rows, cols) ⇒ `Object`



964
965
966

# File 'lib/toy/ffi/tinynn.rb', line 964

def self.alloc_2d(sess, rows, cols)
  TinyNN.tnn_input_2d_f32(sess, rows, cols)
end

.build_add(sess, ta, tb) ⇒ `Object`



976
977
978

# File 'lib/toy/ffi/tinynn.rb', line 976

def self.build_add(sess, ta, tb)
  TinyNN.tnn_add(sess, ta, tb)
end

.build_gelu(sess, ta) ⇒ `Object`



980
981
982

# File 'lib/toy/ffi/tinynn.rb', line 980

def self.build_gelu(sess, ta)
  TinyNN.tnn_gelu(sess, ta)
end

.build_matmul(sess, ta, tb) ⇒ `Object`



972
973
974

# File 'lib/toy/ffi/tinynn.rb', line 972

def self.build_matmul(sess, ta, tb)
  TinyNN.tnn_matmul(sess, ta, tb)
end

.build_rms_norm(sess, tx, tgamma, eps) ⇒ `Object`



992
993
994

# File 'lib/toy/ffi/tinynn.rb', line 992

def self.build_rms_norm(sess, tx, tgamma, eps)
  TinyNN.tnn_rms_norm(sess, tx, tgamma, eps)
end

.build_scale(sess, ta, s) ⇒ `Object`



988
989
990

# File 'lib/toy/ffi/tinynn.rb', line 988

def self.build_scale(sess, ta, s)
  TinyNN.tnn_scale(sess, ta, s)
end

.build_softmax(sess, ta) ⇒ `Object`



984
985
986

# File 'lib/toy/ffi/tinynn.rb', line 984

def self.build_softmax(sess, ta)
  TinyNN.tnn_softmax(sess, ta)
end

.compute(sess) ⇒ `Object`



1000
1001
1002

# File 'lib/toy/ffi/tinynn.rb', line 1000

def self.compute(sess)
  TinyNN.tnn_compute(sess)
end

.cross_entropy_grad(logits, targets, n_pred) ⇒ `Object`

Fused softmax-cross-entropy gradient:

dlogits[i, v] = (softmax(logits)[i, v] - one_hot(targets[i])[v]) / n_pred

Composable from existing ops:

sm  = softmax(logits)
oh  = one_hot mat (built on the Ruby side; cheap — n_pred sets)
dlg = (sm - oh) / n_pred = scale(sm, 1/n_pred) + scale(oh, -1/n_pred)

‘logits` is (n_pred, vocab); `targets` is Array<Int> of length n_pred where targets in [0, vocab) is the desired class at row i.

# File 'lib/toy/ffi/tinynn.rb', line 1179

def self.cross_entropy_grad(logits, targets, n_pred)
  # 1. one-hot in Ruby.
  oh = Mat.new(logits.nrows, logits.ncols)
  i = 0
  while i < n_pred
    oh.flat[i * logits.ncols + targets[i]] = 1.0
    i = i + 1
  end
  # 2. softmax + scale + scale + add through FFI.
  sm = TinyNN.softmax(logits)
  inv_n = 1.0 / n_pred.to_f
  sm_s  = TinyNN.scale(sm, inv_n)
  oh_s  = TinyNN.scale(oh, -inv_n)
  TinyNN.add(sm_s, oh_s)
end

.download_matmul(sess, tensor, m, n) ⇒ `Object`

Download a matmul result. ggml’s mul_mat result has ne0=m, ne1=n; reading row-major (rows=m, cols=n) means scratch[j*m + i].

# File 'lib/toy/ffi/tinynn.rb', line 1076

def self.download_matmul(sess, tensor, m, n)
  TinyNN.tnn_download(sess, tensor)
  out = Mat.new(m, n)
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end
  out
end

.download_row_major(sess, dl_handle, rows, cols) ⇒ `Object`

Download a tensor whose data is row-major (output of elementwise ops like add, gelu, rms_norm, softmax, scale).

Param name ‘dl_handle` (not `tensor`) intentionally — Spinel unifies param-name types across the whole program, and `tensor` collides with a dead `upload_transposed` definition whose param got mistyped as mrb_int. Result: download_row_major’s tensor arg gets boxed at call sites and the (void *) cast inside fails.

# File 'lib/toy/ffi/tinynn.rb', line 1047

def self.download_row_major(sess, dl_handle, rows, cols)
  TinyNN.tnn_download(sess, dl_handle)
  out = Mat.new(rows, cols)
  n = rows * cols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  out
end

.download_to_mat(sess, dl_handle, rows, cols) ⇒ `Object`

Chunked Mat-roundtrip for large tensors. Unlike download_row_major this bypasses the 16 MiB scratch (via tnn_download_to_f64_array’s internal chunking) and so works on weight-sized tensors loaded via the direct GGUF→FFI path. Mirrors ‘upload_row_major`.

Use this when you want a Mat copy of a persistent FFI tensor —inspection, Mat-side fine-tuning, export. For small graph intermediates (norms / per-step logits) the scratch-based download_row_major is fine and slightly faster.

# File 'lib/toy/ffi/tinynn.rb', line 1068

def self.download_to_mat(sess, dl_handle, rows, cols)
  out = Mat.new(rows, cols)
  TinyNN.tnn_download_to_f64_array(sess, dl_handle, out.flat, rows * cols)
  out
end

.embed_back(d_out, indices, vocab_size) ⇒ `Object`

Embedding backward: scatter-add d_out rows into a vocab-sized table. ‘d_out` is (n_idx, d_model). `indices` is Array<Int>. Returns (vocab_size, d_model) Mat where out[indices] += d_out.

# File 'lib/toy/ffi/tinynn.rb', line 1298

def self.embed_back(d_out, indices, vocab_size)
  n_idx = indices.length
  sess  = TinyNN.tnn_session_new(0)
  td    = TinyNN.tnn_input_2d_f32(sess, d_out.nrows, d_out.ncols)
  tidx  = TinyNN.tnn_input_1d_i32(sess, n_idx)
  # Shape reference for the result: a freshly-allocated (vocab, d) tensor.
  tshape = TinyNN.tnn_input_2d_f32(sess, vocab_size, d_out.ncols)
  tout  = TinyNN.tnn_get_rows_back(sess, td, tidx, tshape)
  TinyNN.tnn_realize(sess, tout)

  TinyNN.stage_row_major_and_upload(sess, td, d_out)

  i = 0
  while i < n_idx
    TinyNN.tnn_scratch_set_i32(sess, i, indices[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tidx)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tout)

  out = Mat.new(vocab_size, d_out.ncols)
  n = vocab_size * d_out.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.embed_lookup(table, indices) ⇒ `Object`

Embedding lookup: gather table rows by indices. ‘table` is (vocab, d_model) Mat; `indices` is Array<Int>. Returns (indices.length, d_model) Mat with table[indices] in row i.

# File 'lib/toy/ffi/tinynn.rb', line 1264

def self.embed_lookup(table, indices)
  n_idx = indices.length
  sess  = TinyNN.tnn_session_new(0)
  ttab  = TinyNN.tnn_input_2d_f32(sess, table.nrows, table.ncols)
  tidx  = TinyNN.tnn_input_1d_i32(sess, n_idx)
  tout  = TinyNN.tnn_get_rows(sess, ttab, tidx)
  TinyNN.tnn_realize(sess, tout)

  TinyNN.stage_row_major_and_upload(sess, ttab, table)

  i = 0
  while i < n_idx
    TinyNN.tnn_scratch_set_i32(sess, i, indices[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tidx)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tout)

  out = Mat.new(n_idx, table.ncols)
  n = n_idx * table.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.ffn_pipeline(h, w1, w2) ⇒ `Object`

FFN-shaped chain: result = gelu(h * w1) * w2.

Calls three op-sized sessions, each reusing the cached engine (the backend + scheduler init runs once, not three times). One ggml-graph chaining is theoretically possible but needs explicit intermediate transposes because mul_mat’s result has ne0 swapped relative to the next op’s k-dim. Sticking to three sessions until we have a clean chain-friendly layout convention.

# File 'lib/toy/ffi/tinynn.rb', line 1385

def self.ffn_pipeline(h, w1, w2)
  pre    = TinyNN.matmul(h, w1)
  hidden = TinyNN.gelu(pre)
  TinyNN.matmul(hidden, w2)
end

.gelu(a) ⇒ `Object`

Element-wise GeLU (tanh approximation, matches project’s feed_forward).

# File 'lib/toy/ffi/tinynn.rb', line 722

def self.gelu(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_gelu(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.gelu_back(x, dh) ⇒ `Object`

GeLU backward: dx = dh * d/dx GeLU(x) (tanh approx). Skips ggml entirely — uses tnn_gelu_back_scratch which operates on the session’s scratch buffer directly. CPU-only.

# File 'lib/toy/ffi/tinynn.rb', line 1144

def self.gelu_back(x, dh)
  sess = TinyNN.tnn_session_new(0)
  n = x.nrows * x.ncols
  # Stage x at [0..n), dh at [n..2n)
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, x.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, n + i, dh.flat[i])
    i = i + 1
  end
  TinyNN.tnn_gelu_back_scratch(sess, n)
  out = Mat.new(x.nrows, x.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, 2 * n + i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.matmul(a, b) ⇒ `Object`

a ** b where both are project Mats (row-major f64). Returns a Mat (rows = a.nrows, cols = b.ncols).

Implementation note: ggml_mul_mat computes A ** B^T. To get A ** B we upload b TRANSPOSED *** b is (br x bc) row-major; we present it to ggml as a (bc x br) tensor whose rows are b’s columns. Then ggml’s A ** B^T = A ** B (because the “B^T” inside ggml lines up with the original b shape).

# File 'lib/toy/ffi/tinynn.rb', line 628

def self.matmul(a, b)
  sess = TinyNN.tnn_session_new(0)   # 0 = CPU

  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  # ggml-side tensor for b^T: rows=b.ncols, cols=b.nrows.
  tb_t = TinyNN.tnn_input_2d_f32(sess, b.ncols, b.nrows)
  tc = TinyNN.tnn_matmul(sess, ta, tb_t)
  TinyNN.tnn_realize(sess, tc)

  # Upload a (row-major flat).
  i = 0
  na = a.nrows * a.ncols
  while i < na
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  # Upload b TRANSPOSED into scratch: scratch[j*b.nrows + i] = b[i,j].
  bc = b.ncols
  br = b.nrows
  i = 0
  while i < br
    j = 0
    while j < bc
      TinyNN.tnn_scratch_set(sess, j * br + i, b.flat[i * bc + j])
      j = j + 1
    end
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tb_t)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  # Result tensor ggml shape: ne0=m=a.nrows, ne1=n=b.ncols. Read into
  # row-major Mat[i][j] (= flat[i*ncols+j]) from scratch[j*m + i].
  out = Mat.new(a.nrows, b.ncols)
  m = a.nrows
  n = b.ncols
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.matmul_t(a, b) ⇒ `Object`

a * b^T natively (matches Mat#matmul_t). Faster than .matmul(b) for the same shapes because there’s no Ruby-side transpose of b on upload.

# File 'lib/toy/ffi/tinynn.rb', line 1393

def self.matmul_t(a, b)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tb = TinyNN.tnn_input_2d_f32(sess, b.nrows, b.ncols)
  tc = TinyNN.tnn_matmul(sess, ta, tb)
  TinyNN.tnn_realize(sess, tc)

  TinyNN.stage_row_major_and_upload(sess, ta, a)
  TinyNN.stage_row_major_and_upload(sess, tb, b)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, b.nrows)
  m = a.nrows
  n = b.nrows
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.mul(a, b) ⇒ `Object`

Element-wise multiply c = a * b. Matching shape required. One-shot wrapper. Used in SwiGLU between silu(gate) and up.

# File 'lib/toy/ffi/tinynn.rb', line 784

def self.mul(a, b)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tb = TinyNN.tnn_input_2d_f32(sess, b.nrows, b.ncols)
  tc = TinyNN.tnn_mul(sess, ta, tb)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, b.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tb)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.persistent_free(sess) ⇒ `Object`



960
961
962

# File 'lib/toy/ffi/tinynn.rb', line 960

def self.persistent_free(sess)
  TinyNN.tnn_session_free(sess)
end

.persistent_new(prefer_cuda) ⇒ `Object`

Persistent-session API: build a graph once, run it many times.

Workflow:

sess = TinyNN.persistent_new(0)
ta   = TinyNN.alloc_2d(sess, rows, cols)
tb   = TinyNN.alloc_2d(sess, rows, cols)
tc   = TinyNN.build_matmul(sess, ta, tb)   # or build_add / build_gelu / ...
TinyNN.realize(sess, tc)                    # allocates all backend buffers
# Upload weights once:
TinyNN.upload_row_major(sess, tb, w_mat)
# Per training step:
loop do
  TinyNN.upload_row_major(sess, ta, input_mat)
  TinyNN.compute(sess)
  result = TinyNN.download_matmul(sess, tc, m, n)    # transposed readback
end
TinyNN.persistent_free(sess)

The win over the one-shot wrappers (TinyNN.matmul etc.) is that ggml_init / ggml_backend_sched_alloc_graph runs once instead of per op, and backend buffers (the cuda-side storage for tensors) are allocated once instead of per call. At the toy LM’s transformer shapes (see ab_smoke_big), this should flip CUDA from losing to native at small shapes.



956
957
958

# File 'lib/toy/ffi/tinynn.rb', line 956

def self.persistent_new(prefer_cuda)
  TinyNN.tnn_session_new(prefer_cuda)
end

.realize(sess, result) ⇒ `Object`



996
997
998

# File 'lib/toy/ffi/tinynn.rb', line 996

def self.realize(sess, result)
  TinyNN.tnn_realize(sess, result)
end

.rms_norm(x, gamma, eps) ⇒ `Object`

RMSNorm(x) * gamma. x is (T, d_model), gamma is Array<Float> of length d_model. eps defaults to 1e-5 (matches the project’s rms_norm helper).

# File 'lib/toy/ffi/tinynn.rb', line 822

def self.rms_norm(x, gamma, eps)
  sess = TinyNN.tnn_session_new(0)
  tx = TinyNN.tnn_input_2d_f32(sess, x.nrows, x.ncols)
  # gamma as a 1-row tensor: shape (1, d_model). ggml will broadcast
  # across x's leading dimension during the mul.
  tg = TinyNN.tnn_input_2d_f32(sess, 1, x.ncols)
  tc = TinyNN.tnn_rms_norm(sess, tx, tg, eps)
  TinyNN.tnn_realize(sess, tc)

  # Upload x.
  nx = x.nrows * x.ncols
  i = 0
  while i < nx
    TinyNN.tnn_scratch_set(sess, i, x.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tx)

  # Upload gamma (length d_model).
  i = 0
  while i < x.ncols
    TinyNN.tnn_scratch_set(sess, i, gamma[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tg)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(x.nrows, x.ncols)
  i = 0
  while i < nx
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.rms_norm_back(x, dy, eps) ⇒ `Object`

d/dx of plain RMSNorm(x) given dy (= grad of normalized output). No gamma — caller is responsible for the gamma part of the chain rule.

Note on arg order: ggml’s header says “a - x, b - dy” but the CPU source (ggml-cpu/ops.cpp ggml_compute_forward_rms_norm_back_f32) treats src0 as gradients and src1 as the forward input. We pass (dy, x) to match the source.

# File 'lib/toy/ffi/tinynn.rb', line 1120

def self.rms_norm_back(x, dy, eps)
  sess = TinyNN.tnn_session_new(0)
  tdy = TinyNN.tnn_input_2d_f32(sess, dy.nrows, dy.ncols)
  tx  = TinyNN.tnn_input_2d_f32(sess, x.nrows, x.ncols)
  tc  = TinyNN.tnn_rms_norm_back(sess, tdy, tx, eps)
  TinyNN.tnn_realize(sess, tc)
  TinyNN.stage_row_major_and_upload(sess, tdy, dy)
  TinyNN.stage_row_major_and_upload(sess, tx,  x)
  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)
  out = Mat.new(x.nrows, x.ncols)
  n = x.nrows * x.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.scale(a, s) ⇒ `Object`

Element-wise a * s for scalar s. Returns a new Mat (out-of-place).

# File 'lib/toy/ffi/tinynn.rb', line 1459

def self.scale(a, s)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_scale(sess, ta, s)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.sgd_step(param, grad, lr) ⇒ `Object`

SGD parameter update: param_new = param - lr * grad. Returns a fresh Mat with the updated parameter (caller is responsible for swapping it back into wherever param came from —we don’t have persistent-session storage yet).

Composed from TinyNN.add and TinyNN.scale rather than ggml_opt_step_sgd (which would need an sgd_params tensor with (alpha, weight_decay)). Faster path is a single fused op; this version is the cleanest one with the primitives we already have.



1257
1258
1259

# File 'lib/toy/ffi/tinynn.rb', line 1257

def self.sgd_step(param, grad, lr)
  TinyNN.add(param, TinyNN.scale(grad, -lr))
end

.silu(a) ⇒ `Object`

Element-wise SiLU (x * sigmoid(x)), llama-family activation. One-shot wrapper (slow per-call: session + graph + free); used by ab_smoke_silu and as a building block. The persistent-session FFN path doesn’t go through this — it builds silu into a fused graph.

# File 'lib/toy/ffi/tinynn.rb', line 754

def self.silu(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_silu(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.silu_back(x, dy) ⇒ `Object`

Backward for SiLU: given x (the input to silu) and dy (gradient from upstream), returns dx. dx = dy * (sigmoid(x) * (1 + x * (1 - sigmoid(x)))).

# File 'lib/toy/ffi/tinynn.rb', line 1333

def self.silu_back(x, dy)
  sess = TinyNN.tnn_session_new(0)
  tx  = TinyNN.tnn_input_2d_f32(sess, x.nrows, x.ncols)
  tdy = TinyNN.tnn_input_2d_f32(sess, dy.nrows, dy.ncols)
  tc  = TinyNN.tnn_silu_back(sess, tx, tdy)
  TinyNN.tnn_realize(sess, tc)
  TinyNN.stage_row_major_and_upload(sess, tx,  x)
  TinyNN.stage_row_major_and_upload(sess, tdy, dy)
  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)
  out = Mat.new(x.nrows, x.ncols)
  n = x.nrows * x.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.softmax(a) ⇒ `Object`

Per-row softmax. Matches the project’s softmax_rows! (out-of-place).

# File 'lib/toy/ffi/tinynn.rb', line 863

def self.softmax(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_softmax(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.softmax_back(a_softmax, dy) ⇒ `Object`

d/dx of per-row softmax. ‘a_softmax` is the softmax output; `dy` is grad of output. (ggml source: src0=dy, src1=y_softmax.)

# File 'lib/toy/ffi/tinynn.rb', line 1356

def self.softmax_back(a_softmax, dy)
  sess = TinyNN.tnn_session_new(0)
  tdy = TinyNN.tnn_input_2d_f32(sess, dy.nrows, dy.ncols)
  ta  = TinyNN.tnn_input_2d_f32(sess, a_softmax.nrows, a_softmax.ncols)
  tc  = TinyNN.tnn_softmax_back(sess, tdy, ta)
  TinyNN.tnn_realize(sess, tc)
  TinyNN.stage_row_major_and_upload(sess, tdy, dy)
  TinyNN.stage_row_major_and_upload(sess, ta,  a_softmax)
  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)
  out = Mat.new(a_softmax.nrows, a_softmax.ncols)
  n = a_softmax.nrows * a_softmax.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.stage_row_major_and_upload(sess, target, m) ⇒ `Object`

Internal: stage ‘m` row-major into scratch, then bulk-upload to `target`.

# File 'lib/toy/ffi/tinynn.rb', line 1103

def self.stage_row_major_and_upload(sess, target, m)
  n = m.nrows * m.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, m.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, target)
end

.stage_transposed_and_upload(sess, target, b) ⇒ `Object`

Internal: stage b TRANSPOSED into scratch, then bulk-upload to ‘target`. The C side does both the transpose and a chunked upload so the call works for tensors larger than the 16 MiB scratch buffer (Qwen2.5-0.5B’s ffn_gate is 17.4 MB; the old per-element + single bulk-upload path silently truncated at the 4M-float boundary, leaving the tail uninitialised and producing 1e+37 magnitudes in the subsequent matmul output).



1098
1099
1100

# File 'lib/toy/ffi/tinynn.rb', line 1098

def self.stage_transposed_and_upload(sess, target, b)
  TinyNN.tnn_upload_transposed_f64(sess, target, b.flat, b.nrows, b.ncols)
end

.t_matmul(a, b) ⇒ `Object`

a^T * b (matches Mat#t_matmul). Both inputs uploaded transposed so ggml’s ne0 lines up with the summed-over K dimension.

# File 'lib/toy/ffi/tinynn.rb', line 1425

def self.t_matmul(a, b)
  sess = TinyNN.tnn_session_new(0)
  # Both tensors created as their transposed shape:
  #   ta_t: ne0=a.nrows (=K), ne1=a.ncols (=M)
  #   tb_t: ne0=b.nrows (=K), ne1=b.ncols (=N)
  ta_t = TinyNN.tnn_input_2d_f32(sess, a.ncols, a.nrows)
  tb_t = TinyNN.tnn_input_2d_f32(sess, b.ncols, b.nrows)
  tc = TinyNN.tnn_matmul(sess, ta_t, tb_t)
  TinyNN.tnn_realize(sess, tc)

  TinyNN.stage_transposed_and_upload(sess, ta_t, a)
  TinyNN.stage_transposed_and_upload(sess, tb_t, b)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.ncols, b.ncols)
  m = a.ncols
  n = b.ncols
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.transpose(a) ⇒ `Object`

Transpose. Returns a Mat with rows/cols swapped.

# File 'lib/toy/ffi/tinynn.rb', line 892

def self.transpose(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_transpose(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  # Result shape: (a.ncols, a.nrows) *** rows and cols swapped.
  # ggml stores it contiguous after ggml_cont; row-major readout is
  # straightforward since the transposed tensor's ne0/ne1 already
  # match the target Mat's cols/rows.
  out = Mat.new(a.ncols, a.nrows)
  rin  = a.nrows
  cin  = a.ncols
  i = 0
  while i < cin
    j = 0
    while j < rin
      out.flat[i * rin + j] = TinyNN.tnn_scratch_get(sess, i * rin + j)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.upload_int_array(sess, tensor, indices) ⇒ `Object`

Upload an Array<Int> to a 1D int32 tensor in one FFI call. Uses Spinel’s :int_array spec (matz/spinel#474).



1017
1018
1019

# File 'lib/toy/ffi/tinynn.rb', line 1017

def self.upload_int_array(sess, tensor, indices)
  TinyNN.tnn_upload_from_int_array(sess, tensor, indices, indices.length)
end

.upload_row_major(sess, tensor, mat) ⇒ `Object`

Stage a Mat row-major into scratch and upload to ‘tensor`. Use for elementwise inputs or for matmul’s A operand. For matmul’s B we also have upload_transposed below.

Uses Spinel’s :float_array spec (matz/spinel#474) for zero-copy transfer of mat.flat — single FFI call replaces O(n) per-element tnn_scratch_set loop.



1011
1012
1013

# File 'lib/toy/ffi/tinynn.rb', line 1011

def self.upload_row_major(sess, tensor, mat)
  TinyNN.tnn_upload_from_float_array(sess, tensor, mat.flat, mat.nrows * mat.ncols)
end

.upload_transposed(sess, tensor, mat) ⇒ `Object`

Stage a Mat TRANSPOSED into scratch and upload. Use this for the ‘b` operand of build_matmul to get logical A*B semantics (ggml’s mul_mat is A*B^T natively).

# File 'lib/toy/ffi/tinynn.rb', line 1024

def self.upload_transposed(sess, tensor, mat)
  br = mat.nrows
  bc = mat.ncols
  i = 0
  while i < br
    j = 0
    while j < bc
      TinyNN.tnn_scratch_set(sess, j * br + i, mat.flat[i * bc + j])
      j = j + 1
    end
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tensor)
end

Module: TinyNN

Class Method Summary collapse

Class Method Details

.adam_step(param, grad, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object

.add(a, b) ⇒ Object

.alloc_1d_i32(sess, n) ⇒ Object

.alloc_2d(sess, rows, cols) ⇒ Object

.build_add(sess, ta, tb) ⇒ Object

.build_gelu(sess, ta) ⇒ Object

.build_matmul(sess, ta, tb) ⇒ Object

.build_rms_norm(sess, tx, tgamma, eps) ⇒ Object

.build_scale(sess, ta, s) ⇒ Object

.build_softmax(sess, ta) ⇒ Object

.compute(sess) ⇒ Object

.cross_entropy_grad(logits, targets, n_pred) ⇒ Object

.download_matmul(sess, tensor, m, n) ⇒ Object

.download_row_major(sess, dl_handle, rows, cols) ⇒ Object

.download_to_mat(sess, dl_handle, rows, cols) ⇒ Object

.embed_back(d_out, indices, vocab_size) ⇒ Object

.embed_lookup(table, indices) ⇒ Object

.ffn_pipeline(h, w1, w2) ⇒ Object

.gelu(a) ⇒ Object

.gelu_back(x, dh) ⇒ Object

.matmul(a, b) ⇒ Object

.matmul_t(a, b) ⇒ Object

.mul(a, b) ⇒ Object

.persistent_free(sess) ⇒ Object

.persistent_new(prefer_cuda) ⇒ Object

.realize(sess, result) ⇒ Object

.rms_norm(x, gamma, eps) ⇒ Object

.rms_norm_back(x, dy, eps) ⇒ Object

.scale(a, s) ⇒ Object

.sgd_step(param, grad, lr) ⇒ Object

.silu(a) ⇒ Object

.silu_back(x, dy) ⇒ Object

.softmax(a) ⇒ Object

.softmax_back(a_softmax, dy) ⇒ Object

.stage_row_major_and_upload(sess, target, m) ⇒ Object

.stage_transposed_and_upload(sess, target, b) ⇒ Object

.t_matmul(a, b) ⇒ Object

.transpose(a) ⇒ Object

.upload_int_array(sess, tensor, indices) ⇒ Object

.upload_row_major(sess, tensor, mat) ⇒ Object

.upload_transposed(sess, tensor, mat) ⇒ Object