Module: TinyNN

Defined in:
lib/toy/ffi/tinynn.rb

Class Method Summary collapse

Class Method Details

.adam_step(param, grad, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object

Adam optimizer step. Matches the project’s adam_step_mat.

Returns three new Mats: [param_new, m_new, v_new]. Caller is responsible for swapping them back into wherever they came from (no persistent storage yet — once persistent sessions are wired into transformer.rb, m/v can stay on-device).

omc1, omc2 are pre-computed bias-correction divisors:

omc1 = 1 - beta1^t,  omc2 = 1 - beta2^t

where t is the step number. (The project tracks them as running products in AdamState.bc1 / bc2; both conventions work.)



1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
# File 'lib/toy/ffi/tinynn.rb', line 1206

def self.adam_step(param, grad, m, v, lr, b1, b2, eps, omc1, omc2)
  sess = TinyNN.tnn_session_new(0)
  n = param.nrows * param.ncols
  # Stage param at [0..n), grad at [n..2n), m at [2n..3n), v at [3n..4n).
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, param.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, n + i, grad.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, 2 * n + i, m.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, 3 * n + i, v.flat[i])
    i = i + 1
  end

  TinyNN.tnn_adam_step_scratch(sess, n, lr, b1, b2, eps, omc1, omc2)

  new_param = Mat.new(param.nrows, param.ncols)
  new_mom_m = Mat.new(param.nrows, param.ncols)
  new_mom_v = Mat.new(param.nrows, param.ncols)
  i = 0
  while i < n
    new_param.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    new_mom_m.flat[i] = TinyNN.tnn_scratch_get(sess, 2 * n + i)
    new_mom_v.flat[i] = TinyNN.tnn_scratch_get(sess, 3 * n + i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  AdamStepResult.new(new_param, new_mom_m, new_mom_v)
end

.add(a, b) ⇒ Object

Element-wise a + b. Both Mats must have the same shape.



683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
# File 'lib/toy/ffi/tinynn.rb', line 683

def self.add(a, b)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tb = TinyNN.tnn_input_2d_f32(sess, b.nrows, b.ncols)
  tc = TinyNN.tnn_add(sess, ta, tb)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, b.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tb)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  # Result is row-major same shape as a (ne0=cols, ne1=rows, flat
  # is row-major already since ggml_add preserves layout).
  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.alloc_1d_i32(sess, n) ⇒ Object



968
969
970
# File 'lib/toy/ffi/tinynn.rb', line 968

def self.alloc_1d_i32(sess, n)
  TinyNN.tnn_input_1d_i32(sess, n)
end

.alloc_2d(sess, rows, cols) ⇒ Object



964
965
966
# File 'lib/toy/ffi/tinynn.rb', line 964

def self.alloc_2d(sess, rows, cols)
  TinyNN.tnn_input_2d_f32(sess, rows, cols)
end

.build_add(sess, ta, tb) ⇒ Object



976
977
978
# File 'lib/toy/ffi/tinynn.rb', line 976

def self.build_add(sess, ta, tb)
  TinyNN.tnn_add(sess, ta, tb)
end

.build_gelu(sess, ta) ⇒ Object



980
981
982
# File 'lib/toy/ffi/tinynn.rb', line 980

def self.build_gelu(sess, ta)
  TinyNN.tnn_gelu(sess, ta)
end

.build_matmul(sess, ta, tb) ⇒ Object



972
973
974
# File 'lib/toy/ffi/tinynn.rb', line 972

def self.build_matmul(sess, ta, tb)
  TinyNN.tnn_matmul(sess, ta, tb)
end

.build_rms_norm(sess, tx, tgamma, eps) ⇒ Object



992
993
994
# File 'lib/toy/ffi/tinynn.rb', line 992

def self.build_rms_norm(sess, tx, tgamma, eps)
  TinyNN.tnn_rms_norm(sess, tx, tgamma, eps)
end

.build_scale(sess, ta, s) ⇒ Object



988
989
990
# File 'lib/toy/ffi/tinynn.rb', line 988

def self.build_scale(sess, ta, s)
  TinyNN.tnn_scale(sess, ta, s)
end

.build_softmax(sess, ta) ⇒ Object



984
985
986
# File 'lib/toy/ffi/tinynn.rb', line 984

def self.build_softmax(sess, ta)
  TinyNN.tnn_softmax(sess, ta)
end

.compute(sess) ⇒ Object



1000
1001
1002
# File 'lib/toy/ffi/tinynn.rb', line 1000

def self.compute(sess)
  TinyNN.tnn_compute(sess)
end

.cross_entropy_grad(logits, targets, n_pred) ⇒ Object

Fused softmax-cross-entropy gradient:

dlogits[i, v] = (softmax(logits)[i, v] - one_hot(targets[i])[v]) / n_pred

Composable from existing ops:

sm  = softmax(logits)
oh  = one_hot mat (built on the Ruby side; cheap  n_pred sets)
dlg = (sm - oh) / n_pred = scale(sm, 1/n_pred) + scale(oh, -1/n_pred)

‘logits` is (n_pred, vocab); `targets` is Array<Int> of length n_pred where targets in [0, vocab) is the desired class at row i.



1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
# File 'lib/toy/ffi/tinynn.rb', line 1179

def self.cross_entropy_grad(logits, targets, n_pred)
  # 1. one-hot in Ruby.
  oh = Mat.new(logits.nrows, logits.ncols)
  i = 0
  while i < n_pred
    oh.flat[i * logits.ncols + targets[i]] = 1.0
    i = i + 1
  end
  # 2. softmax + scale + scale + add through FFI.
  sm = TinyNN.softmax(logits)
  inv_n = 1.0 / n_pred.to_f
  sm_s  = TinyNN.scale(sm, inv_n)
  oh_s  = TinyNN.scale(oh, -inv_n)
  TinyNN.add(sm_s, oh_s)
end

.download_matmul(sess, tensor, m, n) ⇒ Object

Download a matmul result. ggml’s mul_mat result has ne0=m, ne1=n; reading row-major (rows=m, cols=n) means scratch[j*m + i].



1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
# File 'lib/toy/ffi/tinynn.rb', line 1076

def self.download_matmul(sess, tensor, m, n)
  TinyNN.tnn_download(sess, tensor)
  out = Mat.new(m, n)
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end
  out
end

.download_row_major(sess, dl_handle, rows, cols) ⇒ Object

Download a tensor whose data is row-major (output of elementwise ops like add, gelu, rms_norm, softmax, scale).

Param name ‘dl_handle` (not `tensor`) intentionally — Spinel unifies param-name types across the whole program, and `tensor` collides with a dead `upload_transposed` definition whose param got mistyped as mrb_int. Result: download_row_major’s tensor arg gets boxed at call sites and the (void *) cast inside fails.



1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
# File 'lib/toy/ffi/tinynn.rb', line 1047

def self.download_row_major(sess, dl_handle, rows, cols)
  TinyNN.tnn_download(sess, dl_handle)
  out = Mat.new(rows, cols)
  n = rows * cols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  out
end

.download_to_mat(sess, dl_handle, rows, cols) ⇒ Object

Chunked Mat-roundtrip for large tensors. Unlike download_row_major this bypasses the 16 MiB scratch (via tnn_download_to_f64_array’s internal chunking) and so works on weight-sized tensors loaded via the direct GGUF→FFI path. Mirrors ‘upload_row_major`.

Use this when you want a Mat copy of a persistent FFI tensor —inspection, Mat-side fine-tuning, export. For small graph intermediates (norms / per-step logits) the scratch-based download_row_major is fine and slightly faster.



1068
1069
1070
1071
1072
# File 'lib/toy/ffi/tinynn.rb', line 1068

def self.download_to_mat(sess, dl_handle, rows, cols)
  out = Mat.new(rows, cols)
  TinyNN.tnn_download_to_f64_array(sess, dl_handle, out.flat, rows * cols)
  out
end

.embed_back(d_out, indices, vocab_size) ⇒ Object

Embedding backward: scatter-add d_out rows into a vocab-sized table. ‘d_out` is (n_idx, d_model). `indices` is Array<Int>. Returns (vocab_size, d_model) Mat where out[indices] += d_out.



1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
# File 'lib/toy/ffi/tinynn.rb', line 1298

def self.embed_back(d_out, indices, vocab_size)
  n_idx = indices.length
  sess  = TinyNN.tnn_session_new(0)
  td    = TinyNN.tnn_input_2d_f32(sess, d_out.nrows, d_out.ncols)
  tidx  = TinyNN.tnn_input_1d_i32(sess, n_idx)
  # Shape reference for the result: a freshly-allocated (vocab, d) tensor.
  tshape = TinyNN.tnn_input_2d_f32(sess, vocab_size, d_out.ncols)
  tout  = TinyNN.tnn_get_rows_back(sess, td, tidx, tshape)
  TinyNN.tnn_realize(sess, tout)

  TinyNN.stage_row_major_and_upload(sess, td, d_out)

  i = 0
  while i < n_idx
    TinyNN.tnn_scratch_set_i32(sess, i, indices[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tidx)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tout)

  out = Mat.new(vocab_size, d_out.ncols)
  n = vocab_size * d_out.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.embed_lookup(table, indices) ⇒ Object

Embedding lookup: gather table rows by indices. ‘table` is (vocab, d_model) Mat; `indices` is Array<Int>. Returns (indices.length, d_model) Mat with table[indices] in row i.



1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
# File 'lib/toy/ffi/tinynn.rb', line 1264

def self.embed_lookup(table, indices)
  n_idx = indices.length
  sess  = TinyNN.tnn_session_new(0)
  ttab  = TinyNN.tnn_input_2d_f32(sess, table.nrows, table.ncols)
  tidx  = TinyNN.tnn_input_1d_i32(sess, n_idx)
  tout  = TinyNN.tnn_get_rows(sess, ttab, tidx)
  TinyNN.tnn_realize(sess, tout)

  TinyNN.stage_row_major_and_upload(sess, ttab, table)

  i = 0
  while i < n_idx
    TinyNN.tnn_scratch_set_i32(sess, i, indices[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tidx)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tout)

  out = Mat.new(n_idx, table.ncols)
  n = n_idx * table.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.ffn_pipeline(h, w1, w2) ⇒ Object

FFN-shaped chain: result = gelu(h * w1) * w2.

Calls three op-sized sessions, each reusing the cached engine (the backend + scheduler init runs once, not three times). One ggml-graph chaining is theoretically possible but needs explicit intermediate transposes because mul_mat’s result has ne0 swapped relative to the next op’s k-dim. Sticking to three sessions until we have a clean chain-friendly layout convention.



1385
1386
1387
1388
1389
# File 'lib/toy/ffi/tinynn.rb', line 1385

def self.ffn_pipeline(h, w1, w2)
  pre    = TinyNN.matmul(h, w1)
  hidden = TinyNN.gelu(pre)
  TinyNN.matmul(hidden, w2)
end

.gelu(a) ⇒ Object

Element-wise GeLU (tanh approximation, matches project’s feed_forward).



722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
# File 'lib/toy/ffi/tinynn.rb', line 722

def self.gelu(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_gelu(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.gelu_back(x, dh) ⇒ Object

GeLU backward: dx = dh * d/dx GeLU(x) (tanh approx). Skips ggml entirely — uses tnn_gelu_back_scratch which operates on the session’s scratch buffer directly. CPU-only.



1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
# File 'lib/toy/ffi/tinynn.rb', line 1144

def self.gelu_back(x, dh)
  sess = TinyNN.tnn_session_new(0)
  n = x.nrows * x.ncols
  # Stage x at [0..n), dh at [n..2n)
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, x.flat[i])
    i = i + 1
  end
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, n + i, dh.flat[i])
    i = i + 1
  end
  TinyNN.tnn_gelu_back_scratch(sess, n)
  out = Mat.new(x.nrows, x.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, 2 * n + i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.matmul(a, b) ⇒ Object

a ** b where both are project Mats (row-major f64). Returns a Mat (rows = a.nrows, cols = b.ncols).

Implementation note: ggml_mul_mat computes A ** B^T. To get A ** B we upload b TRANSPOSED *** b is (br x bc) row-major; we present it to ggml as a (bc x br) tensor whose rows are b’s columns. Then ggml’s A ** B^T = A ** B (because the “B^T” inside ggml lines up with the original b shape).



628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
# File 'lib/toy/ffi/tinynn.rb', line 628

def self.matmul(a, b)
  sess = TinyNN.tnn_session_new(0)   # 0 = CPU

  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  # ggml-side tensor for b^T: rows=b.ncols, cols=b.nrows.
  tb_t = TinyNN.tnn_input_2d_f32(sess, b.ncols, b.nrows)
  tc = TinyNN.tnn_matmul(sess, ta, tb_t)
  TinyNN.tnn_realize(sess, tc)

  # Upload a (row-major flat).
  i = 0
  na = a.nrows * a.ncols
  while i < na
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  # Upload b TRANSPOSED into scratch: scratch[j*b.nrows + i] = b[i,j].
  bc = b.ncols
  br = b.nrows
  i = 0
  while i < br
    j = 0
    while j < bc
      TinyNN.tnn_scratch_set(sess, j * br + i, b.flat[i * bc + j])
      j = j + 1
    end
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tb_t)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  # Result tensor ggml shape: ne0=m=a.nrows, ne1=n=b.ncols. Read into
  # row-major Mat[i][j] (= flat[i*ncols+j]) from scratch[j*m + i].
  out = Mat.new(a.nrows, b.ncols)
  m = a.nrows
  n = b.ncols
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.matmul_t(a, b) ⇒ Object

a * b^T natively (matches Mat#matmul_t). Faster than .matmul(b) for the same shapes because there’s no Ruby-side transpose of b on upload.



1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
# File 'lib/toy/ffi/tinynn.rb', line 1393

def self.matmul_t(a, b)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tb = TinyNN.tnn_input_2d_f32(sess, b.nrows, b.ncols)
  tc = TinyNN.tnn_matmul(sess, ta, tb)
  TinyNN.tnn_realize(sess, tc)

  TinyNN.stage_row_major_and_upload(sess, ta, a)
  TinyNN.stage_row_major_and_upload(sess, tb, b)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, b.nrows)
  m = a.nrows
  n = b.nrows
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.mul(a, b) ⇒ Object

Element-wise multiply c = a * b. Matching shape required. One-shot wrapper. Used in SwiGLU between silu(gate) and up.



784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
# File 'lib/toy/ffi/tinynn.rb', line 784

def self.mul(a, b)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tb = TinyNN.tnn_input_2d_f32(sess, b.nrows, b.ncols)
  tc = TinyNN.tnn_mul(sess, ta, tb)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, b.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tb)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.persistent_free(sess) ⇒ Object



960
961
962
# File 'lib/toy/ffi/tinynn.rb', line 960

def self.persistent_free(sess)
  TinyNN.tnn_session_free(sess)
end

.persistent_new(prefer_cuda) ⇒ Object


Persistent-session API: build a graph once, run it many times.

Workflow:

sess = TinyNN.persistent_new(0)
ta   = TinyNN.alloc_2d(sess, rows, cols)
tb   = TinyNN.alloc_2d(sess, rows, cols)
tc   = TinyNN.build_matmul(sess, ta, tb)   # or build_add / build_gelu / ...
TinyNN.realize(sess, tc)                    # allocates all backend buffers
# Upload weights once:
TinyNN.upload_row_major(sess, tb, w_mat)
# Per training step:
loop do
  TinyNN.upload_row_major(sess, ta, input_mat)
  TinyNN.compute(sess)
  result = TinyNN.download_matmul(sess, tc, m, n)    # transposed readback
end
TinyNN.persistent_free(sess)

The win over the one-shot wrappers (TinyNN.matmul etc.) is that ggml_init / ggml_backend_sched_alloc_graph runs once instead of per op, and backend buffers (the cuda-side storage for tensors) are allocated once instead of per call. At the toy LM’s transformer shapes (see ab_smoke_big), this should flip CUDA from losing to native at small shapes.



956
957
958
# File 'lib/toy/ffi/tinynn.rb', line 956

def self.persistent_new(prefer_cuda)
  TinyNN.tnn_session_new(prefer_cuda)
end

.realize(sess, result) ⇒ Object



996
997
998
# File 'lib/toy/ffi/tinynn.rb', line 996

def self.realize(sess, result)
  TinyNN.tnn_realize(sess, result)
end

.rms_norm(x, gamma, eps) ⇒ Object

RMSNorm(x) * gamma. x is (T, d_model), gamma is Array<Float> of length d_model. eps defaults to 1e-5 (matches the project’s rms_norm helper).



822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
# File 'lib/toy/ffi/tinynn.rb', line 822

def self.rms_norm(x, gamma, eps)
  sess = TinyNN.tnn_session_new(0)
  tx = TinyNN.tnn_input_2d_f32(sess, x.nrows, x.ncols)
  # gamma as a 1-row tensor: shape (1, d_model). ggml will broadcast
  # across x's leading dimension during the mul.
  tg = TinyNN.tnn_input_2d_f32(sess, 1, x.ncols)
  tc = TinyNN.tnn_rms_norm(sess, tx, tg, eps)
  TinyNN.tnn_realize(sess, tc)

  # Upload x.
  nx = x.nrows * x.ncols
  i = 0
  while i < nx
    TinyNN.tnn_scratch_set(sess, i, x.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tx)

  # Upload gamma (length d_model).
  i = 0
  while i < x.ncols
    TinyNN.tnn_scratch_set(sess, i, gamma[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tg)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(x.nrows, x.ncols)
  i = 0
  while i < nx
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.rms_norm_back(x, dy, eps) ⇒ Object

d/dx of plain RMSNorm(x) given dy (= grad of normalized output). No gamma — caller is responsible for the gamma part of the chain rule.

Note on arg order: ggml’s header says “a - x, b - dy” but the CPU source (ggml-cpu/ops.cpp ggml_compute_forward_rms_norm_back_f32) treats src0 as gradients and src1 as the forward input. We pass (dy, x) to match the source.



1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
# File 'lib/toy/ffi/tinynn.rb', line 1120

def self.rms_norm_back(x, dy, eps)
  sess = TinyNN.tnn_session_new(0)
  tdy = TinyNN.tnn_input_2d_f32(sess, dy.nrows, dy.ncols)
  tx  = TinyNN.tnn_input_2d_f32(sess, x.nrows, x.ncols)
  tc  = TinyNN.tnn_rms_norm_back(sess, tdy, tx, eps)
  TinyNN.tnn_realize(sess, tc)
  TinyNN.stage_row_major_and_upload(sess, tdy, dy)
  TinyNN.stage_row_major_and_upload(sess, tx,  x)
  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)
  out = Mat.new(x.nrows, x.ncols)
  n = x.nrows * x.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.scale(a, s) ⇒ Object

Element-wise a * s for scalar s. Returns a new Mat (out-of-place).



1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
# File 'lib/toy/ffi/tinynn.rb', line 1459

def self.scale(a, s)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_scale(sess, ta, s)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.sgd_step(param, grad, lr) ⇒ Object

SGD parameter update: param_new = param - lr * grad. Returns a fresh Mat with the updated parameter (caller is responsible for swapping it back into wherever param came from —we don’t have persistent-session storage yet).

Composed from TinyNN.add and TinyNN.scale rather than ggml_opt_step_sgd (which would need an sgd_params tensor with (alpha, weight_decay)). Faster path is a single fused op; this version is the cleanest one with the primitives we already have.



1257
1258
1259
# File 'lib/toy/ffi/tinynn.rb', line 1257

def self.sgd_step(param, grad, lr)
  TinyNN.add(param, TinyNN.scale(grad, -lr))
end

.silu(a) ⇒ Object

Element-wise SiLU (x * sigmoid(x)), llama-family activation. One-shot wrapper (slow per-call: session + graph + free); used by ab_smoke_silu and as a building block. The persistent-session FFN path doesn’t go through this — it builds silu into a fused graph.



754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
# File 'lib/toy/ffi/tinynn.rb', line 754

def self.silu(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_silu(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.silu_back(x, dy) ⇒ Object

Backward for SiLU: given x (the input to silu) and dy (gradient from upstream), returns dx. dx = dy * (sigmoid(x) * (1 + x * (1 - sigmoid(x)))).



1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
# File 'lib/toy/ffi/tinynn.rb', line 1333

def self.silu_back(x, dy)
  sess = TinyNN.tnn_session_new(0)
  tx  = TinyNN.tnn_input_2d_f32(sess, x.nrows, x.ncols)
  tdy = TinyNN.tnn_input_2d_f32(sess, dy.nrows, dy.ncols)
  tc  = TinyNN.tnn_silu_back(sess, tx, tdy)
  TinyNN.tnn_realize(sess, tc)
  TinyNN.stage_row_major_and_upload(sess, tx,  x)
  TinyNN.stage_row_major_and_upload(sess, tdy, dy)
  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)
  out = Mat.new(x.nrows, x.ncols)
  n = x.nrows * x.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.softmax(a) ⇒ Object

Per-row softmax. Matches the project’s softmax_rows! (out-of-place).



863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
# File 'lib/toy/ffi/tinynn.rb', line 863

def self.softmax(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_softmax(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.nrows, a.ncols)
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.softmax_back(a_softmax, dy) ⇒ Object

d/dx of per-row softmax. ‘a_softmax` is the softmax output; `dy` is grad of output. (ggml source: src0=dy, src1=y_softmax.)



1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
# File 'lib/toy/ffi/tinynn.rb', line 1356

def self.softmax_back(a_softmax, dy)
  sess = TinyNN.tnn_session_new(0)
  tdy = TinyNN.tnn_input_2d_f32(sess, dy.nrows, dy.ncols)
  ta  = TinyNN.tnn_input_2d_f32(sess, a_softmax.nrows, a_softmax.ncols)
  tc  = TinyNN.tnn_softmax_back(sess, tdy, ta)
  TinyNN.tnn_realize(sess, tc)
  TinyNN.stage_row_major_and_upload(sess, tdy, dy)
  TinyNN.stage_row_major_and_upload(sess, ta,  a_softmax)
  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)
  out = Mat.new(a_softmax.nrows, a_softmax.ncols)
  n = a_softmax.nrows * a_softmax.ncols
  i = 0
  while i < n
    out.flat[i] = TinyNN.tnn_scratch_get(sess, i)
    i = i + 1
  end
  TinyNN.tnn_session_free(sess)
  out
end

.stage_row_major_and_upload(sess, target, m) ⇒ Object

Internal: stage ‘m` row-major into scratch, then bulk-upload to `target`.



1103
1104
1105
1106
1107
1108
1109
1110
1111
# File 'lib/toy/ffi/tinynn.rb', line 1103

def self.stage_row_major_and_upload(sess, target, m)
  n = m.nrows * m.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, m.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, target)
end

.stage_transposed_and_upload(sess, target, b) ⇒ Object

Internal: stage b TRANSPOSED into scratch, then bulk-upload to ‘target`. The C side does both the transpose and a chunked upload so the call works for tensors larger than the 16 MiB scratch buffer (Qwen2.5-0.5B’s ffn_gate is 17.4 MB; the old per-element + single bulk-upload path silently truncated at the 4M-float boundary, leaving the tail uninitialised and producing 1e+37 magnitudes in the subsequent matmul output).



1098
1099
1100
# File 'lib/toy/ffi/tinynn.rb', line 1098

def self.stage_transposed_and_upload(sess, target, b)
  TinyNN.tnn_upload_transposed_f64(sess, target, b.flat, b.nrows, b.ncols)
end

.t_matmul(a, b) ⇒ Object

a^T * b (matches Mat#t_matmul). Both inputs uploaded transposed so ggml’s ne0 lines up with the summed-over K dimension.



1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
# File 'lib/toy/ffi/tinynn.rb', line 1425

def self.t_matmul(a, b)
  sess = TinyNN.tnn_session_new(0)
  # Both tensors created as their transposed shape:
  #   ta_t: ne0=a.nrows (=K), ne1=a.ncols (=M)
  #   tb_t: ne0=b.nrows (=K), ne1=b.ncols (=N)
  ta_t = TinyNN.tnn_input_2d_f32(sess, a.ncols, a.nrows)
  tb_t = TinyNN.tnn_input_2d_f32(sess, b.ncols, b.nrows)
  tc = TinyNN.tnn_matmul(sess, ta_t, tb_t)
  TinyNN.tnn_realize(sess, tc)

  TinyNN.stage_transposed_and_upload(sess, ta_t, a)
  TinyNN.stage_transposed_and_upload(sess, tb_t, b)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  out = Mat.new(a.ncols, b.ncols)
  m = a.ncols
  n = b.ncols
  i = 0
  while i < m
    j = 0
    while j < n
      out.flat[i * n + j] = TinyNN.tnn_scratch_get(sess, j * m + i)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.transpose(a) ⇒ Object

Transpose. Returns a Mat with rows/cols swapped.



892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
# File 'lib/toy/ffi/tinynn.rb', line 892

def self.transpose(a)
  sess = TinyNN.tnn_session_new(0)
  ta = TinyNN.tnn_input_2d_f32(sess, a.nrows, a.ncols)
  tc = TinyNN.tnn_transpose(sess, ta)
  TinyNN.tnn_realize(sess, tc)

  n = a.nrows * a.ncols
  i = 0
  while i < n
    TinyNN.tnn_scratch_set(sess, i, a.flat[i])
    i = i + 1
  end
  TinyNN.tnn_upload(sess, ta)

  TinyNN.tnn_compute(sess)
  TinyNN.tnn_download(sess, tc)

  # Result shape: (a.ncols, a.nrows) *** rows and cols swapped.
  # ggml stores it contiguous after ggml_cont; row-major readout is
  # straightforward since the transposed tensor's ne0/ne1 already
  # match the target Mat's cols/rows.
  out = Mat.new(a.ncols, a.nrows)
  rin  = a.nrows
  cin  = a.ncols
  i = 0
  while i < cin
    j = 0
    while j < rin
      out.flat[i * rin + j] = TinyNN.tnn_scratch_get(sess, i * rin + j)
      j = j + 1
    end
    i = i + 1
  end

  TinyNN.tnn_session_free(sess)
  out
end

.upload_int_array(sess, tensor, indices) ⇒ Object

Upload an Array<Int> to a 1D int32 tensor in one FFI call. Uses Spinel’s :int_array spec (matz/spinel#474).



1017
1018
1019
# File 'lib/toy/ffi/tinynn.rb', line 1017

def self.upload_int_array(sess, tensor, indices)
  TinyNN.tnn_upload_from_int_array(sess, tensor, indices, indices.length)
end

.upload_row_major(sess, tensor, mat) ⇒ Object

Stage a Mat row-major into scratch and upload to ‘tensor`. Use for elementwise inputs or for matmul’s A operand. For matmul’s B we also have upload_transposed below.

Uses Spinel’s :float_array spec (matz/spinel#474) for zero-copy transfer of mat.flat — single FFI call replaces O(n) per-element tnn_scratch_set loop.



1011
1012
1013
# File 'lib/toy/ffi/tinynn.rb', line 1011

def self.upload_row_major(sess, tensor, mat)
  TinyNN.tnn_upload_from_float_array(sess, tensor, mat.flat, mat.nrows * mat.ncols)
end

.upload_transposed(sess, tensor, mat) ⇒ Object

Stage a Mat TRANSPOSED into scratch and upload. Use this for the ‘b` operand of build_matmul to get logical A*B semantics (ggml’s mul_mat is A*B^T natively).



1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
# File 'lib/toy/ffi/tinynn.rb', line 1024

def self.upload_transposed(sess, tensor, mat)
  br = mat.nrows
  bc = mat.ncols
  i = 0
  while i < br
    j = 0
    while j < bc
      TinyNN.tnn_scratch_set(sess, j * br + i, mat.flat[i * bc + j])
      j = j + 1
    end
    i = i + 1
  end
  TinyNN.tnn_upload(sess, tensor)
end