Module: TinyNN

Defined in:
lib/toy/ffi/tinynn.rb

Class Method Summary collapse

Class Method Details

.adam_step(param, grad, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object

Adam optimizer step. Matches the project’s adam_step_mat.

Returns three new Mats: [param_new, m_new, v_new]. Caller is responsible for swapping them back into wherever they came from (no persistent storage yet — once persistent sessions are wired into transformer.rb, m/v can stay on-device).

omc1, omc2 are pre-computed bias-correction divisors:

omc1 = 1 - beta1^t,  omc2 = 1 - beta2^t

where t is the step number. (The project tracks them as running products in AdamState.bc1 / bc2; both conventions work.)



1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
# File 'lib/toy/ffi/tinynn.rb', line 1187

def self.adam_step(param, grad, m, v, lr, b1, b2, eps, omc1, omc2)
  sess = TinyNN.tnn_session_new(0)
  n = param.nrows * param.ncols
  # Stage param at [0..n), grad at [n..2n), m at [2n..3n), v at [3n..4n).
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, param.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, n + i, grad.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, 2 * n + i, m.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, 3 * n + i, v.flat[i])
    i = i + 1
  end

  TinyNN.tnn_adam_step_scratch(sess, n, lr, b1, b2, eps, omc1, omc2)

  new_param = Mat.new(param.nrows, param.ncols)
  new_mom_m = Mat.new(param.nrows, param.ncols)
  new_mom_v = Mat.new(param.nrows, param.ncols)
  i = 0
  while i < n
    new_param.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    new_mom_m.flat[i] = TinyNN.tnn_scratch_get(sess, 2 * n + i)
    new_mom_v.flat[i] = TinyNN.tnn_scratch_get(sess, 3 * n + i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  AdamStepResult.new(new_param, new_mom_m, new_mom_v)
end

.add(a, b) ⇒ Object

Element-wise a + b. Both Mats must have the same shape.



664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
# File 'lib/toy/ffi/tinynn.rb', line 664

def self.add(a, b)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tb = TinyNN.tnn_input_2d_f32(sess, b.nrows, b.ncols)
  tc = TinyNN.tnn_add(sess, ta, tb)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, b.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tb)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  # Result is row-major same shape as a (ne0=cols, ne1=rows, flat
  # is row-major already since ggml_add preserves layout).
  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.alloc_1d_i32(sess, n) ⇒ Object



949
950
951
# File 'lib/toy/ffi/tinynn.rb', line 949

def self.alloc_1d_i32(sess, n)
  TinyNN.tnn_input_1d_i32(sess, n)
end

.alloc_2d(sess, rows, cols) ⇒ Object



945
946
947
# File 'lib/toy/ffi/tinynn.rb', line 945

def self.alloc_2d(sess, rows, cols)
  TinyNN.tnn_input_2d_f32(sess, rows, cols)
end

.build_add(sess, ta, tb) ⇒ Object



957
958
959
# File 'lib/toy/ffi/tinynn.rb', line 957

def self.build_add(sess, ta, tb)
  TinyNN.tnn_add(sess, ta, tb)
end

.build_gelu(sess, ta) ⇒ Object



961
962
963
# File 'lib/toy/ffi/tinynn.rb', line 961

def self.build_gelu(sess, ta)
  TinyNN.tnn_gelu(sess, ta)
end

.build_matmul(sess, ta, tb) ⇒ Object



953
954
955
# File 'lib/toy/ffi/tinynn.rb', line 953

def self.build_matmul(sess, ta, tb)
  TinyNN.tnn_matmul(sess, ta, tb)
end

.build_rms_norm(sess, tx, tgamma, eps) ⇒ Object



973
974
975
# File 'lib/toy/ffi/tinynn.rb', line 973

def self.build_rms_norm(sess, tx, tgamma, eps)
  TinyNN.tnn_rms_norm(sess, tx, tgamma, eps)
end

.build_scale(sess, ta, s) ⇒ Object



969
970
971
# File 'lib/toy/ffi/tinynn.rb', line 969

def self.build_scale(sess, ta, s)
  TinyNN.tnn_scale(sess, ta, s)
end

.build_softmax(sess, ta) ⇒ Object



965
966
967
# File 'lib/toy/ffi/tinynn.rb', line 965

def self.build_softmax(sess, ta)
  TinyNN.tnn_softmax(sess, ta)
end

.compute(sess) ⇒ Object



981
982
983
# File 'lib/toy/ffi/tinynn.rb', line 981

def self.compute(sess)
  TinyNN.tnn_compute(sess)
end

.cross_entropy_grad(logits, targets, n_pred) ⇒ Object

Fused softmax-cross-entropy gradient:

dlogits[i, v] = (softmax(logits)[i, v] - one_hot(targets[i])[v]) / n_pred

Composable from existing ops:

sm  = softmax(logits)
oh  = one_hot mat (built on the Ruby side; cheap  n_pred sets)
dlg = (sm - oh) / n_pred = scale(sm, 1/n_pred) + scale(oh, -1/n_pred)

‘logits` is (n_pred, vocab); `targets` is Array<Int> of length n_pred where targets in [0, vocab) is the desired class at row i.



1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
# File 'lib/toy/ffi/tinynn.rb', line 1160

def self.cross_entropy_grad(logits, targets, n_pred)
  # 1. one-hot in Ruby.
  oh = Mat.new(logits.nrows, logits.ncols)
  i = 0
  while i < n_pred
    oh.flat[i * logits.ncols + targets[i]] = 1.0
    i = i + 1
  end
  # 2. softmax + scale + scale + add through FFI.
  sm = TinyNN.softmax(logits)
  inv_n = 1.0 / n_pred.to_f
  sm_s  = TinyNN.scale(sm, inv_n)
  oh_s  = TinyNN.scale(oh, -inv_n)
  TinyNN.add(sm_s, oh_s)
end

.download_matmul(sess, tensor, m, n) ⇒ Object

Download a matmul result. ggml’s mul_mat result has ne0=m, ne1=n; reading row-major (rows=m, cols=n) means scratch[j*m + i].



1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
# File 'lib/toy/ffi/tinynn.rb', line 1057

def self.download_matmul(sess, tensor, m, n)
  TinyNN.tnn_download(sess, tensor)
  out = Mat.new(m, n)
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end
  out
end

.download_row_major(sess, dl_handle, rows, cols) ⇒ Object

Download a tensor whose data is row-major (output of elementwise ops like add, gelu, rms_norm, softmax, scale).

Param name ‘dl_handle` (not `tensor`) intentionally — Spinel unifies param-name types across the whole program, and `tensor` collides with a dead `upload_transposed` definition whose param got mistyped as mrb_int. Result: download_row_major’s tensor arg gets boxed at call sites and the (void *) cast inside fails.



1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
# File 'lib/toy/ffi/tinynn.rb', line 1028

def self.download_row_major(sess, dl_handle, rows, cols)
  TinyNN.tnn_download(sess, dl_handle)
  out = Mat.new(rows, cols)
  n = rows * cols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  out
end

.download_to_mat(sess, dl_handle, rows, cols) ⇒ Object

Chunked Mat-roundtrip for large tensors. Unlike download_row_major this bypasses the 16 MiB scratch (via tnn_download_to_f64_array’s internal chunking) and so works on weight-sized tensors loaded via the direct GGUF→FFI path. Mirrors ‘upload_row_major`.

Use this when you want a Mat copy of a persistent FFI tensor —inspection, Mat-side fine-tuning, export. For small graph intermediates (norms / per-step logits) the scratch-based download_row_major is fine and slightly faster.



1049
1050
1051
1052
1053
# File 'lib/toy/ffi/tinynn.rb', line 1049

def self.download_to_mat(sess, dl_handle, rows, cols)
  out = Mat.new(rows, cols)
  TinyNN.tnn_download_to_f64_array(sess, dl_handle, out.flat, rows * cols)
  out
end

.embed_back(d_out, indices, vocab_size) ⇒ Object

Embedding backward: scatter-add d_out rows into a vocab-sized table. ‘d_out` is (n_idx, d_model). `indices` is Array<Int>. Returns (vocab_size, d_model) Mat where out[indices] += d_out.



1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
# File 'lib/toy/ffi/tinynn.rb', line 1279

def self.embed_back(d_out, indices, vocab_size)
  n_idx = indices.length
  sess  = TinyNN.tnn_session_new(0)
  td    = TinyNN.tnn_input_2d_f32(sess, d_out.nrows, d_out.ncols)
  tidx  = TinyNN.tnn_input_1d_i32(sess, n_idx)
  # Shape reference for the result: a freshly-allocated (vocab, d) tensor.
  tshape = TinyNN.tnn_input_2d_f32(sess, vocab_size, d_out.ncols)
  tout  = TinyNN.tnn_get_rows_back(sess, td, tidx, tshape)
  TinyNN.tnn_realize(sess, tout)

  TinyNN.stage_row_major_and_upload(sess, td, d_out)

  i = 0
  while i < n_idx
    TinyNN.tnn_scratch_set_i32(sess, i, indices[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tidx)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tout)

  out = Mat.new(vocab_size, d_out.ncols)
  n = vocab_size * d_out.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.embed_lookup(table, indices) ⇒ Object

Embedding lookup: gather table rows by indices. ‘table` is (vocab, d_model) Mat; `indices` is Array<Int>. Returns (indices.length, d_model) Mat with table[indices] in row i.



1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
# File 'lib/toy/ffi/tinynn.rb', line 1245

def self.embed_lookup(table, indices)
  n_idx = indices.length
  sess  = TinyNN.tnn_session_new(0)
  ttab  = TinyNN.tnn_input_2d_f32(sess, table.nrows, table.ncols)
  tidx  = TinyNN.tnn_input_1d_i32(sess, n_idx)
  tout  = TinyNN.tnn_get_rows(sess, ttab, tidx)
  TinyNN.tnn_realize(sess, tout)

  TinyNN.stage_row_major_and_upload(sess, ttab, table)

  i = 0
  while i < n_idx
    TinyNN.tnn_scratch_set_i32(sess, i, indices[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tidx)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tout)

  out = Mat.new(n_idx, table.ncols)
  n = n_idx * table.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.ffn_pipeline(h, w1, w2) ⇒ Object

FFN-shaped chain: result = gelu(h * w1) * w2.

Calls three op-sized sessions, each reusing the cached engine (the backend + scheduler init runs once, not three times). One ggml-graph chaining is theoretically possible but needs explicit intermediate transposes because mul_mat’s result has ne0 swapped relative to the next op’s k-dim. Sticking to three sessions until we have a clean chain-friendly layout convention.



1366
1367
1368
1369
1370
# File 'lib/toy/ffi/tinynn.rb', line 1366

def self.ffn_pipeline(h, w1, w2)
  pre    = TinyNN.matmul(h, w1)
  hidden = TinyNN.gelu(pre)
  TinyNN.matmul(hidden, w2)
end

.gelu(a) ⇒ Object

Element-wise GeLU (tanh approximation, matches project’s feed_forward).



703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
# File 'lib/toy/ffi/tinynn.rb', line 703

def self.gelu(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_gelu(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.gelu_back(x, dh) ⇒ Object

GeLU backward: dx = dh * d/dx GeLU(x) (tanh approx). Skips ggml entirely — uses tnn_gelu_back_scratch which operates on the session’s scratch buffer directly. CPU-only.



1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
# File 'lib/toy/ffi/tinynn.rb', line 1125

def self.gelu_back(x, dh)
  sess = TinyNN.tnn_session_new(0)
  n = x.nrows * x.ncols
  # Stage x at [0..n), dh at [n..2n)
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, x.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, n + i, dh.flat[i])
    i = i + 1
  end
  TinyNN.tnn_gelu_back_scratch(sess, n)
  out = Mat.new(x.nrows, x.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, 2 * n + i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.matmul(a, b) ⇒ Object

a ** b where both are project Mats (row-major f64). Returns a Mat (rows = a.nrows, cols = b.ncols).

Implementation note: ggml_mul_mat computes A ** B^T. To get A ** B we upload b TRANSPOSED *** b is (br x bc) row-major; we present it to ggml as a (bc x br) tensor whose rows are b’s columns. Then ggml’s A ** B^T = A ** B (because the “B^T” inside ggml lines up with the original b shape).



609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
# File 'lib/toy/ffi/tinynn.rb', line 609

def self.matmul(a, b)
  sess = TinyNN.tnn_session_new(0)   # 0 = CPU

  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  # ggml-side tensor for b^T: rows=b.ncols, cols=b.nrows.
  tb_t = TinyNN.tnn_input_2d_f32(sess, b.ncols, b.nrows)
  tc = TinyNN.tnn_matmul(sess, ta, tb_t)
  TinyNN.tnn_realize(sess, tc)

  # Upload a (row-major flat).
  i = 0
  na = a.nrows * a.ncols
  while i < na
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  # Upload b TRANSPOSED into scratch: scratch[j*b.nrows + i] = b[i,j].
  bc = b.ncols
  br = b.nrows
  i = 0
  while i < br
    j = 0
    while j < bc
      TinyNN.tnn_scratch_set(sess, j * br + i, b.flat[i * bc + j])
      j = j + 1
    end
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tb_t)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  # Result tensor ggml shape: ne0=m=a.nrows, ne1=n=b.ncols. Read into
  # row-major Mat[i][j] (= flat[i*ncols+j]) from scratch[j*m + i].
  out = Mat.new(a.nrows, b.ncols)
  m = a.nrows
  n = b.ncols
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.matmul_t(a, b) ⇒ Object

a * b^T natively (matches Mat#matmul_t). Faster than .matmul(b) for the same shapes because there’s no Ruby-side transpose of b on upload.



1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
# File 'lib/toy/ffi/tinynn.rb', line 1374

def self.matmul_t(a, b)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tb = TinyNN.tnn_input_2d_f32(sess, b.nrows, b.ncols)
  tc = TinyNN.tnn_matmul(sess, ta, tb)
  TinyNN.tnn_realize(sess, tc)

  TinyNN.stage_row_major_and_upload(sess, ta, a)
  TinyNN.stage_row_major_and_upload(sess, tb, b)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, b.nrows)
  m = a.nrows
  n = b.nrows
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.mul(a, b) ⇒ Object

Element-wise multiply c = a * b. Matching shape required. One-shot wrapper. Used in SwiGLU between silu(gate) and up.



765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
# File 'lib/toy/ffi/tinynn.rb', line 765

def self.mul(a, b)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tb = TinyNN.tnn_input_2d_f32(sess, b.nrows, b.ncols)
  tc = TinyNN.tnn_mul(sess, ta, tb)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, b.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tb)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.persistent_free(sess) ⇒ Object



941
942
943
# File 'lib/toy/ffi/tinynn.rb', line 941

def self.persistent_free(sess)
  TinyNN.tnn_session_free(sess)
end

.persistent_new(prefer_cuda) ⇒ Object


Persistent-session API: build a graph once, run it many times.

Workflow:

sess = TinyNN.persistent_new(0)
ta   = TinyNN.alloc_2d(sess, rows, cols)
tb   = TinyNN.alloc_2d(sess, rows, cols)
tc   = TinyNN.build_matmul(sess, ta, tb)   # or build_add / build_gelu / ...
TinyNN.realize(sess, tc)                    # allocates all backend buffers
# Upload weights once:
TinyNN.upload_row_major(sess, tb, w_mat)
# Per training step:
loop do
  TinyNN.upload_row_major(sess, ta, input_mat)
  TinyNN.compute(sess)
  result = TinyNN.download_matmul(sess, tc, m, n)    # transposed readback
end
TinyNN.persistent_free(sess)

The win over the one-shot wrappers (TinyNN.matmul etc.) is that ggml_init / ggml_backend_sched_alloc_graph runs once instead of per op, and backend buffers (the cuda-side storage for tensors) are allocated once instead of per call. At the toy LM’s transformer shapes (see ab_smoke_big), this should flip CUDA from losing to native at small shapes.



937
938
939
# File 'lib/toy/ffi/tinynn.rb', line 937

def self.persistent_new(prefer_cuda)
  TinyNN.tnn_session_new(prefer_cuda)
end

.realize(sess, result) ⇒ Object



977
978
979
# File 'lib/toy/ffi/tinynn.rb', line 977

def self.realize(sess, result)
  TinyNN.tnn_realize(sess, result)
end

.rms_norm(x, gamma, eps) ⇒ Object

RMSNorm(x) * gamma. x is (T, d_model), gamma is Array<Float> of length d_model. eps defaults to 1e-5 (matches the project’s rms_norm helper).



803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
# File 'lib/toy/ffi/tinynn.rb', line 803

def self.rms_norm(x, gamma, eps)
  sess = TinyNN.tnn_session_new(0)
  tx = TinyNN.tnn_input_2d_f32(sess, x.nrows, x.ncols)
  # gamma as a 1-row tensor: shape (1, d_model). ggml will broadcast
  # across x's leading dimension during the mul.
  tg = TinyNN.tnn_input_2d_f32(sess, 1, x.ncols)
  tc = TinyNN.tnn_rms_norm(sess, tx, tg, eps)
  TinyNN.tnn_realize(sess, tc)

  # Upload x.
  nx = x.nrows * x.ncols
  i = 0
  while i < nx
    TinyNN.tnn_scratch_set(sess, i, x.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tx)

  # Upload gamma (length d_model).
  i = 0
  while i < x.ncols
    TinyNN.tnn_scratch_set(sess, i, gamma[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tg)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(x.nrows, x.ncols)
  i = 0
  while i < nx
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.rms_norm_back(x, dy, eps) ⇒ Object

d/dx of plain RMSNorm(x) given dy (= grad of normalized output). No gamma — caller is responsible for the gamma part of the chain rule.

Note on arg order: ggml’s header says “a - x, b - dy” but the CPU source (ggml-cpu/ops.cpp ggml_compute_forward_rms_norm_back_f32) treats src0 as gradients and src1 as the forward input. We pass (dy, x) to match the source.



1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
# File 'lib/toy/ffi/tinynn.rb', line 1101

def self.rms_norm_back(x, dy, eps)
  sess = TinyNN.tnn_session_new(0)
  tdy = TinyNN.tnn_input_2d_f32(sess, dy.nrows, dy.ncols)
  tx  = TinyNN.tnn_input_2d_f32(sess, x.nrows, x.ncols)
  tc  = TinyNN.tnn_rms_norm_back(sess, tdy, tx, eps)
  TinyNN.tnn_realize(sess, tc)
  TinyNN.stage_row_major_and_upload(sess, tdy, dy)
  TinyNN.stage_row_major_and_upload(sess, tx,  x)
  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)
  out = Mat.new(x.nrows, x.ncols)
  n = x.nrows * x.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.scale(a, s) ⇒ Object

Element-wise a * s for scalar s. Returns a new Mat (out-of-place).



1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
# File 'lib/toy/ffi/tinynn.rb', line 1440

def self.scale(a, s)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_scale(sess, ta, s)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.sgd_step(param, grad, lr) ⇒ Object

SGD parameter update: param_new = param - lr * grad. Returns a fresh Mat with the updated parameter (caller is responsible for swapping it back into wherever param came from —we don’t have persistent-session storage yet).

Composed from TinyNN.add and TinyNN.scale rather than ggml_opt_step_sgd (which would need an sgd_params tensor with (alpha, weight_decay)). Faster path is a single fused op; this version is the cleanest one with the primitives we already have.



1238
1239
1240
# File 'lib/toy/ffi/tinynn.rb', line 1238

def self.sgd_step(param, grad, lr)
  TinyNN.add(param, TinyNN.scale(grad, -lr))
end

.silu(a) ⇒ Object

Element-wise SiLU (x * sigmoid(x)), llama-family activation. One-shot wrapper (slow per-call: session + graph + free); used by ab_smoke_silu and as a building block. The persistent-session FFN path doesn’t go through this — it builds silu into a fused graph.



735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
# File 'lib/toy/ffi/tinynn.rb', line 735

def self.silu(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_silu(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.silu_back(x, dy) ⇒ Object

Backward for SiLU: given x (the input to silu) and dy (gradient from upstream), returns dx. dx = dy * (sigmoid(x) * (1 + x * (1 - sigmoid(x)))).



1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
# File 'lib/toy/ffi/tinynn.rb', line 1314

def self.silu_back(x, dy)
  sess = TinyNN.tnn_session_new(0)
  tx  = TinyNN.tnn_input_2d_f32(sess, x.nrows, x.ncols)
  tdy = TinyNN.tnn_input_2d_f32(sess, dy.nrows, dy.ncols)
  tc  = TinyNN.tnn_silu_back(sess, tx, tdy)
  TinyNN.tnn_realize(sess, tc)
  TinyNN.stage_row_major_and_upload(sess, tx,  x)
  TinyNN.stage_row_major_and_upload(sess, tdy, dy)
  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)
  out = Mat.new(x.nrows, x.ncols)
  n = x.nrows * x.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.softmax(a) ⇒ Object

Per-row softmax. Matches the project’s softmax_rows! (out-of-place).



844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
# File 'lib/toy/ffi/tinynn.rb', line 844

def self.softmax(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_softmax(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.softmax_back(a_softmax, dy) ⇒ Object

d/dx of per-row softmax. ‘a_softmax` is the softmax output; `dy` is grad of output. (ggml source: src0=dy, src1=y_softmax.)



1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
# File 'lib/toy/ffi/tinynn.rb', line 1337

def self.softmax_back(a_softmax, dy)
  sess = TinyNN.tnn_session_new(0)
  tdy = TinyNN.tnn_input_2d_f32(sess, dy.nrows, dy.ncols)
  ta  = TinyNN.tnn_input_2d_f32(sess, a_softmax.nrows, a_softmax.ncols)
  tc  = TinyNN.tnn_softmax_back(sess, tdy, ta)
  TinyNN.tnn_realize(sess, tc)
  TinyNN.stage_row_major_and_upload(sess, tdy, dy)
  TinyNN.stage_row_major_and_upload(sess, ta,  a_softmax)
  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)
  out = Mat.new(a_softmax.nrows, a_softmax.ncols)
  n = a_softmax.nrows * a_softmax.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.stage_row_major_and_upload(sess, target, m) ⇒ Object

Internal: stage ‘m` row-major into scratch, then bulk-upload to `target`.



1084
1085
1086
1087
1088
1089
1090
1091
1092
# File 'lib/toy/ffi/tinynn.rb', line 1084

def self.stage_row_major_and_upload(sess, target, m)
  n = m.nrows * m.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, m.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, target)
end

.stage_transposed_and_upload(sess, target, b) ⇒ Object

Internal: stage b TRANSPOSED into scratch, then bulk-upload to ‘target`. The C side does both the transpose and a chunked upload so the call works for tensors larger than the 16 MiB scratch buffer (Qwen2.5-0.5B’s ffn_gate is 17.4 MB; the old per-element + single bulk-upload path silently truncated at the 4M-float boundary, leaving the tail uninitialised and producing 1e+37 magnitudes in the subsequent matmul output).



1079
1080
1081
# File 'lib/toy/ffi/tinynn.rb', line 1079

def self.stage_transposed_and_upload(sess, target, b)
  TinyNN.tnn_upload_transposed_f64(sess, target, b.flat, b.nrows, b.ncols)
end

.t_matmul(a, b) ⇒ Object

a^T * b (matches Mat#t_matmul). Both inputs uploaded transposed so ggml’s ne0 lines up with the summed-over K dimension.



1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
# File 'lib/toy/ffi/tinynn.rb', line 1406

def self.t_matmul(a, b)
  sess = TinyNN.tnn_session_new(0)
  # Both tensors created as their transposed shape:
  #   ta_t: ne0=a.nrows (=K), ne1=a.ncols (=M)
  #   tb_t: ne0=b.nrows (=K), ne1=b.ncols (=N)
  ta_t = TinyNN.tnn_input_2d_f32(sess, a.ncols, a.nrows)
  tb_t = TinyNN.tnn_input_2d_f32(sess, b.ncols, b.nrows)
  tc = TinyNN.tnn_matmul(sess, ta_t, tb_t)
  TinyNN.tnn_realize(sess, tc)

  TinyNN.stage_transposed_and_upload(sess, ta_t, a)
  TinyNN.stage_transposed_and_upload(sess, tb_t, b)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.ncols, b.ncols)
  m = a.ncols
  n = b.ncols
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.transpose(a) ⇒ Object

Transpose. Returns a Mat with rows/cols swapped.



873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
# File 'lib/toy/ffi/tinynn.rb', line 873

def self.transpose(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_transpose(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  # Result shape: (a.ncols, a.nrows) *** rows and cols swapped.
  # ggml stores it contiguous after ggml_cont; row-major readout is
  # straightforward since the transposed tensor's ne0/ne1 already
  # match the target Mat's cols/rows.
  out = Mat.new(a.ncols, a.nrows)
  rin  = a.nrows
  cin  = a.ncols
  i = 0
  while i < cin
    j = 0
    while j < rin
      out.flat[i * rin + j] = TinyNN.tnn_scratch_get(sess, i * rin + j)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.upload_int_array(sess, tensor, indices) ⇒ Object

Upload an Array<Int> to a 1D int32 tensor in one FFI call. Uses Spinel’s :int_array spec (matz/spinel#474).



998
999
1000
# File 'lib/toy/ffi/tinynn.rb', line 998

def self.upload_int_array(sess, tensor, indices)
  TinyNN.tnn_upload_from_int_array(sess, tensor, indices, indices.length)
end

.upload_row_major(sess, tensor, mat) ⇒ Object

Stage a Mat row-major into scratch and upload to ‘tensor`. Use for elementwise inputs or for matmul’s A operand. For matmul’s B we also have upload_transposed below.

Uses Spinel’s :float_array spec (matz/spinel#474) for zero-copy transfer of mat.flat — single FFI call replaces O(n) per-element tnn_scratch_set loop.



992
993
994
# File 'lib/toy/ffi/tinynn.rb', line 992

def self.upload_row_major(sess, tensor, mat)
  TinyNN.tnn_upload_from_float_array(sess, tensor, mat.flat, mat.nrows * mat.ncols)
end

.upload_transposed(sess, tensor, mat) ⇒ Object

Stage a Mat TRANSPOSED into scratch and upload. Use this for the ‘b` operand of build_matmul to get logical A*B semantics (ggml’s mul_mat is A*B^T natively).



1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
# File 'lib/toy/ffi/tinynn.rb', line 1005

def self.upload_transposed(sess, tensor, mat)
  br = mat.nrows
  bc = mat.ncols
  i = 0
  while i < br
    j = 0
    while j < bc
      TinyNN.tnn_scratch_set(sess, j * br + i, mat.flat[i * bc + j])
      j = j + 1
    end
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tensor)
end