Class: TransformerLM
- Inherits:
-
Object
- Object
- TransformerLM
- Defined in:
- lib/toy/models/transformer.rb
Overview
TransformerLM
Instance Attribute Summary collapse
-
#blocks ⇒ Object
Returns the value of attribute blocks.
-
#cache ⇒ Object
Returns the value of attribute cache.
-
#context_length ⇒ Object
Returns the value of attribute context_length.
-
#d_ff ⇒ Object
Returns the value of attribute d_ff.
-
#d_head ⇒ Object
Returns the value of attribute d_head.
-
#d_model ⇒ Object
Returns the value of attribute d_model.
-
#ffn_ffi_caches ⇒ Object
Returns the value of attribute ffn_ffi_caches.
-
#layer_caches ⇒ Object
Returns the value of attribute layer_caches.
-
#n_heads ⇒ Object
Returns the value of attribute n_heads.
-
#n_layers ⇒ Object
Returns the value of attribute n_layers.
-
#norm_final_gamma ⇒ Object
Returns the value of attribute norm_final_gamma.
-
#pos_embed ⇒ Object
Returns the value of attribute pos_embed.
-
#token_embed ⇒ Object
Returns the value of attribute token_embed.
-
#vocab_size ⇒ Object
Returns the value of attribute vocab_size.
-
#vocabulary ⇒ Object
Returns the value of attribute vocabulary.
Instance Method Summary collapse
- #adam_step_block(p_block, g_block, m_block, v_block, lr, b1, b2, eps, omc1, omc2) ⇒ Object
- #adam_step_mat(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object
- #adam_step_vec(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object
-
#apply_causal_mask!(scores, query_offset) ⇒ Object
Causal mask: for each row i, set scores[i, j] = -1e30 for j > query_offset + i.
-
#apply_gradients_adam(grads, state, lr, beta1, beta2, eps) ⇒ Object
Adam: per-coordinate adaptive learning rate driven by running estimates of the gradient mean (m) and squared mean (v).
-
#apply_gradients_sgd(grads, lr) ⇒ Object
—– Optimization —–.
-
#backward(input_ids, target_grads) ⇒ Object
Full backward pass.
-
#cross_entropy_grad(logits, token_ids) ⇒ Object
Cross-entropy on next-token prediction.
-
#embed(token_ids, start_pos) ⇒ Object
x = token_embed[token_ids] + pos_embed[start_pos + i].
-
#embed_backward(token_ids, dx, target_grads) ⇒ Object
Embedding backward: each row of dx routes to its token’s embedding row and to position i’s positional embedding row.
-
#feed_forward(h, block) ⇒ Object
FFN: gelu(h · W_ff1) · W_ff2.
-
#feed_forward_backward(d_ff_out, h, ff_cache, block, target_block) ⇒ Object
FFN backward.
-
#feed_forward_ffi(h, block, ffi_cache) ⇒ Object
Persistent-session FFI variant of feed_forward.
-
#forward(token_ids) ⇒ Object
Full forward pass.
-
#generate_from_ids(start_ids, max_tokens, temperature) ⇒ Object
GENERATION — autoregressive sampling from a starting token-id list.
-
#hsplit_heads(d_concat) ⇒ Object
Split a (T × d_model) matrix back into n_heads × (T × d_head) heads.
-
#hstack_heads(per_head) ⇒ Object
Concatenate per-head outputs side by side: n_heads × (T × d_head) → (T × d_model).
-
#initialize(vocab_size, d_model, d_ff, n_heads, n_layers, context_length) ⇒ TransformerLM
constructor
A new instance of TransformerLM.
-
#rms_norm(x, gamma) ⇒ Object
RMSNorm: y_j = gamma_j * x_j / sqrt(mean(x²) + eps), per row.
-
#rms_norm_backward(x, gamma, rms, dy, target_dgamma) ⇒ Object
RMSNorm backward.
-
#sample_logits_row(logits, row, temperature) ⇒ Object
Sample a token ID from row ‘row` of `logits` (T × vocab_size flat).
-
#self_attention(h_in, block) ⇒ Object
Multi-head self-attention.
-
#self_attention_backward(d_proj, h_in, attn_cache, block, target_block) ⇒ Object
Self-attention backward.
- #self_attention_head(h_in, block, head_idx, inv_sqrt) ⇒ Object
- #sgd_step_block(p_block, g_block, lr) ⇒ Object
- #sgd_step_mat(p, g, lr) ⇒ Object
- #sgd_step_vec(p, g, lr) ⇒ Object
-
#softmax_rows!(m) ⇒ Object
Row-wise softmax with numerical-stability max-shift, in place on ‘m`.
-
#softmax_rows_backward(softmax_out, d_softmax) ⇒ Object
Row-wise softmax backward (for attention).
-
#transformer_block(x, block) ⇒ Object
One transformer block (pre-norm).
-
#transformer_block_backward(dx_out, x_in, block, layer_cache, target_block_grads) ⇒ Object
Backward through one block.
-
#transformer_block_into(x, block, cache, ffi_cache) ⇒ Object
Same as transformer_block but writes into a pre-existing LayerCache.
-
#x_in_for_layer(li) ⇒ Object
No ‘train_step` here: Spinel compiles every class method whether or not it has callers.
Constructor Details
#initialize(vocab_size, d_model, d_ff, n_heads, n_layers, context_length) ⇒ TransformerLM
Returns a new instance of TransformerLM.
510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 |
# File 'lib/toy/models/transformer.rb', line 510 def initialize(vocab_size, d_model, d_ff, n_heads, n_layers, context_length) @vocab_size = vocab_size @d_model = d_model @d_ff = d_ff @n_heads = n_heads @d_head = d_model / n_heads @n_layers = n_layers @context_length = context_length s = 1.0 / Math.sqrt(d_model) @token_embed = Mat.new(vocab_size, d_model) @token_embed.fill_random(s) @pos_embed = Mat.new(context_length, d_model) @pos_embed.fill_random(s) # Tied embeddings: lm_head is @token_embed used in transposed form # at unembed time (logits = x_final · token_embedᵀ). No separate # @lm_head matrix; the unembed gradient accumulates into the same # token_embed grad slot as the input-side embedding lookup. @norm_final_gamma = Array.new(d_model, 1.0) # Vocabulary: seeded with one placeholder so Spinel infers it as # an array of strings; callers should set the real vocab after construction. @vocabulary = ["?"] # Inline Block.new in the literal — Spinel's scan_ivars runs before # local-variable types are inferred, so storing through a temp would # mistype @blocks's element class. @blocks = [Block.new(d_model, @d_head, d_ff, n_heads)] @blocks[0].fill_random_all(s) li = 1 while li < n_layers @blocks.push(Block.new(d_model, @d_head, d_ff, n_heads)) @blocks[li].fill_random_all(s) li += 1 end # Pre-allocate layer caches so the array's element type is fixed at # construction time. Forward populates fields on these existing objects. @layer_caches = [LayerCache.new] li = 1 while li < n_layers @layer_caches.push(LayerCache.new) li += 1 end # Per-block persistent FFI caches for feed_forward. Lazily realized # on first call so we don't need to decide T (sequence length) at # model-construction time. With USE_FFI_MATMUL=false they sit # unused; the cost is one cheap object alloc per block. @ffn_ffi_caches = [FFNFFICache.new] li = 1 while li < n_layers @ffn_ffi_caches.push(FFNFFICache.new) li += 1 end end |
Instance Attribute Details
#blocks ⇒ Object
Returns the value of attribute blocks.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def blocks @blocks end |
#cache ⇒ Object
Returns the value of attribute cache.
823 824 825 |
# File 'lib/toy/models/transformer.rb', line 823 def cache @cache end |
#context_length ⇒ Object
Returns the value of attribute context_length.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def context_length @context_length end |
#d_ff ⇒ Object
Returns the value of attribute d_ff.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def d_ff @d_ff end |
#d_head ⇒ Object
Returns the value of attribute d_head.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def d_head @d_head end |
#d_model ⇒ Object
Returns the value of attribute d_model.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def d_model @d_model end |
#ffn_ffi_caches ⇒ Object
Returns the value of attribute ffn_ffi_caches.
571 572 573 |
# File 'lib/toy/models/transformer.rb', line 571 def ffn_ffi_caches @ffn_ffi_caches end |
#layer_caches ⇒ Object
Returns the value of attribute layer_caches.
573 574 575 |
# File 'lib/toy/models/transformer.rb', line 573 def layer_caches @layer_caches end |
#n_heads ⇒ Object
Returns the value of attribute n_heads.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def n_heads @n_heads end |
#n_layers ⇒ Object
Returns the value of attribute n_layers.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def n_layers @n_layers end |
#norm_final_gamma ⇒ Object
Returns the value of attribute norm_final_gamma.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def norm_final_gamma @norm_final_gamma end |
#pos_embed ⇒ Object
Returns the value of attribute pos_embed.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def @pos_embed end |
#token_embed ⇒ Object
Returns the value of attribute token_embed.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def @token_embed end |
#vocab_size ⇒ Object
Returns the value of attribute vocab_size.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def vocab_size @vocab_size end |
#vocabulary ⇒ Object
Returns the value of attribute vocabulary.
505 506 507 |
# File 'lib/toy/models/transformer.rb', line 505 def vocabulary @vocabulary end |
Instance Method Details
#adam_step_block(p_block, g_block, m_block, v_block, lr, b1, b2, eps, omc1, omc2) ⇒ Object
1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 |
# File 'lib/toy/models/transformer.rb', line 1249 def adam_step_block(p_block, g_block, m_block, v_block, lr, b1, b2, eps, omc1, omc2) self.adam_step_vec(p_block.norm1_gamma, g_block.norm1_gamma, m_block.norm1_gamma, v_block.norm1_gamma, lr, b1, b2, eps, omc1, omc2) self.adam_step_vec(p_block.norm2_gamma, g_block.norm2_gamma, m_block.norm2_gamma, v_block.norm2_gamma, lr, b1, b2, eps, omc1, omc2) self.adam_step_mat(p_block.w_o, g_block.w_o, m_block.w_o, v_block.w_o, lr, b1, b2, eps, omc1, omc2) self.adam_step_mat(p_block.w_ff1, g_block.w_ff1, m_block.w_ff1, v_block.w_ff1, lr, b1, b2, eps, omc1, omc2) self.adam_step_mat(p_block.w_ff2, g_block.w_ff2, m_block.w_ff2, v_block.w_ff2, lr, b1, b2, eps, omc1, omc2) h = 0 while h < @n_heads self.adam_step_mat(p_block.w_q[h], g_block.w_q[h], m_block.w_q[h], v_block.w_q[h], lr, b1, b2, eps, omc1, omc2) self.adam_step_mat(p_block.w_k[h], g_block.w_k[h], m_block.w_k[h], v_block.w_k[h], lr, b1, b2, eps, omc1, omc2) self.adam_step_mat(p_block.w_v[h], g_block.w_v[h], m_block.w_v[h], v_block.w_v[h], lr, b1, b2, eps, omc1, omc2) h += 1 end end |
#adam_step_mat(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object
1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 |
# File 'lib/toy/models/transformer.rb', line 1213 def adam_step_mat(p, g, m, v, lr, b1, b2, eps, omc1, omc2) one_minus_b1 = 1.0 - b1 one_minus_b2 = 1.0 - b2 n = p.flat.length i = 0 while i < n gi = g.flat[i] new_m = b1 * m.flat[i] + one_minus_b1 * gi new_v = b2 * v.flat[i] + one_minus_b2 * gi * gi m.flat[i] = new_m v.flat[i] = new_v m_hat = new_m / omc1 v_hat = new_v / omc2 p.flat[i] -= lr * m_hat / (Math.sqrt(v_hat) + eps) i += 1 end end |
#adam_step_vec(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object
1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 |
# File 'lib/toy/models/transformer.rb', line 1231 def adam_step_vec(p, g, m, v, lr, b1, b2, eps, omc1, omc2) one_minus_b1 = 1.0 - b1 one_minus_b2 = 1.0 - b2 n = p.length i = 0 while i < n gi = g[i] new_m = b1 * m[i] + one_minus_b1 * gi new_v = b2 * v[i] + one_minus_b2 * gi * gi m[i] = new_m v[i] = new_v m_hat = new_m / omc1 v_hat = new_v / omc2 p[i] -= lr * m_hat / (Math.sqrt(v_hat) + eps) i += 1 end end |
#apply_causal_mask!(scores, query_offset) ⇒ Object
Causal mask: for each row i, set scores[i, j] = -1e30 for j > query_offset + i.
661 662 663 664 665 666 667 668 669 670 671 672 673 674 |
# File 'lib/toy/models/transformer.rb', line 661 def apply_causal_mask!(scores, query_offset) t = scores.nrows n = scores.ncols i = 0 while i < t first_masked = query_offset + i + 1 j = first_masked while j < n scores.flat[i * n + j] = NEG_INF_SCORE j += 1 end i += 1 end end |
#apply_gradients_adam(grads, state, lr, beta1, beta2, eps) ⇒ Object
Adam: per-coordinate adaptive learning rate driven by running estimates of the gradient mean (m) and squared mean (v).
m ← β1·m + (1−β1)·g v ← β2·v + (1−β2)·g²
m̂ = m / (1 − β1ᵗ) v̂ = v / (1 − β2ᵗ)
p -= lr · m̂ / (√v̂ + ε)
bc1 / bc2 are kept as running products in AdamState (one multiply per step) rather than recomputing β**t (one pow() per step).
1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 |
# File 'lib/toy/models/transformer.rb', line 1188 def apply_gradients_adam(grads, state, lr, beta1, beta2, eps) state.bc1 = state.bc1 * beta1 state.bc2 = state.bc2 * beta2 omc1 = 1.0 - state.bc1 omc2 = 1.0 - state.bc2 self.adam_step_mat(@token_embed, grads., state.m., state.v., lr, beta1, beta2, eps, omc1, omc2) self.adam_step_mat(@pos_embed, grads., state.m., state.v., lr, beta1, beta2, eps, omc1, omc2) self.adam_step_vec(@norm_final_gamma, grads.norm_final_gamma, state.m.norm_final_gamma, state.v.norm_final_gamma, lr, beta1, beta2, eps, omc1, omc2) li = 0 while li < @n_layers self.adam_step_block(@blocks[li], grads.blocks[li], state.m.blocks[li], state.v.blocks[li], lr, beta1, beta2, eps, omc1, omc2) li += 1 end end |
#apply_gradients_sgd(grads, lr) ⇒ Object
—– Optimization —–
Two optimizers live side-by-side. Plain SGD (apply_gradients_sgd) is what the train_minimal smoke test uses — a few dozen steps to prove forward/backward/update compile and converge, with no extra state. Adam (apply_gradients_adam, below) is what the TinyStories run uses, walking the same parameter inventory but with parallel m/v moment shadows held in AdamState.
1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 |
# File 'lib/toy/models/transformer.rb', line 1133 def apply_gradients_sgd(grads, lr) self.sgd_step_mat(@token_embed, grads., lr) self.sgd_step_mat(@pos_embed, grads., lr) self.sgd_step_vec(@norm_final_gamma, grads.norm_final_gamma, lr) li = 0 while li < @n_layers self.sgd_step_block(@blocks[li], grads.blocks[li], lr) li += 1 end end |
#backward(input_ids, target_grads) ⇒ Object
Full backward pass. Fills ‘target_grads` with this example’s gradients and the loss. Caller is responsible for calling forward(token_ids) first.
1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 |
# File 'lib/toy/models/transformer.rb', line 1318 def backward(input_ids, target_grads) n_pred = input_ids.length - 1 if n_pred <= 0 target_grads.loss = 0.0 return end loss_res = self.cross_entropy_grad(@cache.logits, input_ids) target_grads.loss = loss_res.loss # Tied unembed: logits = x_final · token_embedᵀ. # d_token_embed[v,d] += Σ_t dlogits[t,v] · x_final[t,d] # ⇒ dlogits.t_matmul(x_final) (vocab × d_model) # d_x_final[t,d] = Σ_v dlogits[t,v] · token_embed[v,d] # ⇒ dlogits.matmul(token_embed) (T × d_model) # The unembed gradient is added directly into target_grads.token_embed # — embed_backward later adds the input-side row contributions on top. = loss_res.dlogits.t_matmul(@cache.x_final) target_grads..add!() dx_final = loss_res.dlogits.matmul(@token_embed) # Final RMSNorm. Use `self.` so Spinel's call-site parameter inference # picks up the typed args (only fires for explicit-receiver calls). dx = self.rms_norm_backward(@cache.x_block_out, @norm_final_gamma, @cache.rms_final, dx_final, target_grads.norm_final_gamma) # Each block in reverse. li = @n_layers - 1 while li >= 0 dx = self.transformer_block_backward(dx, self.x_in_for_layer(li), @blocks[li], @cache.layers[li], target_grads.blocks[li]) li -= 1 end self.(input_ids, dx, target_grads) end |
#cross_entropy_grad(logits, token_ids) ⇒ Object
Cross-entropy on next-token prediction. dL/dlogits = softmax(logits) - one_hot(target). Loss is averaged over the (T-1) prediction positions.
856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 |
# File 'lib/toy/models/transformer.rb', line 856 def cross_entropy_grad(logits, token_ids) n_pred = token_ids.length - 1 dlogits = Mat.new(logits.nrows, logits.ncols) total_loss = 0.0 if n_pred <= 0 return LossResult.new(dlogits, 0.0) end inv_n = 1.0 / n_pred v = logits.ncols i = 0 while i < n_pred base = i * v mx = logits.flat[base] j = 1 while j < v val = logits.flat[base + j] if val > mx mx = val end j += 1 end sum = 0.0 j = 0 while j < v e = Math.exp(logits.flat[base + j] - mx) sum += e j += 1 end target = token_ids[i + 1] target_logit = logits.flat[base + target] pt = Math.exp(target_logit - mx) / sum if pt < LOG_PROB_FLOOR pt = LOG_PROB_FLOOR end total_loss -= Math.log(pt) j = 0 while j < v p = Math.exp(logits.flat[base + j] - mx) / sum dlogits.flat[base + j] = p * inv_n j += 1 end ti = base + target dlogits.flat[ti] = dlogits.flat[ti] - inv_n i += 1 end LossResult.new(dlogits, total_loss / n_pred) end |
#embed(token_ids, start_pos) ⇒ Object
x = token_embed[token_ids] + pos_embed[start_pos + i]
578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 |
# File 'lib/toy/models/transformer.rb', line 578 def (token_ids, start_pos) t = token_ids.length out = Mat.new(t, @d_model) i = 0 while i < t tok_id = token_ids[i] j = 0 while j < @d_model out.flat[i * @d_model + j] = @token_embed.flat[tok_id * @d_model + j] + @pos_embed.flat[(start_pos + i) * @d_model + j] j += 1 end i += 1 end out end |
#embed_backward(token_ids, dx, target_grads) ⇒ Object
Embedding backward: each row of dx routes to its token’s embedding row and to position i’s positional embedding row. Repeats accumulate.
1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 |
# File 'lib/toy/models/transformer.rb', line 1295 def (token_ids, dx, target_grads) t_seq = token_ids.length i = 0 while i < t_seq # Spinel-pin: `.to_i` forces Int when this method is dead code # (qwen25_kv doesn't call it). Without a caller to constrain # `token_ids`, Spinel boxes `tok_id` as RbVal and breaks the # int-context use below. The explicit cast is a no-op at # runtime when token_ids[i] is already Int. tok_id = token_ids[i].to_i j = 0 while j < @d_model pi = i * @d_model + j target_grads..flat[tok_id * @d_model + j] += dx.flat[pi] target_grads..flat[pi] += dx.flat[pi] j += 1 end i += 1 end end |
#feed_forward(h, block) ⇒ Object
FFN: gelu(h · W_ff1) · W_ff2. Returns (out_mat, FFCache). GeLU uses the tanh approximation: 0.5 x (1 + tanh(c (x + 0.044715 x³))), c = √(2/π).
738 739 740 741 742 743 744 745 746 747 748 749 750 751 |
# File 'lib/toy/models/transformer.rb', line 738 def feed_forward(h, block) pre = h.matmul(block.w_ff1) hidden = Mat.new(pre.nrows, pre.ncols) n = pre.nrows * pre.ncols i = 0 while i < n x = pre.flat[i] u = GELU_C * (x + GELU_K * x * x * x) hidden.flat[i] = 0.5 * x * (1.0 + Math.tanh(u)) i += 1 end out = hidden.matmul(block.w_ff2) FFResult.new(out, FFCache.new(pre, hidden)) end |
#feed_forward_backward(d_ff_out, h, ff_cache, block, target_block) ⇒ Object
FFN backward. Writes w_ff1, w_ff2 grads into target_block. Returns d_h.
1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 |
# File 'lib/toy/models/transformer.rb', line 1007 def feed_forward_backward(d_ff_out, h, ff_cache, block, target_block) t_seq = h.nrows # type hint: h is a Mat d_w_ff2 = ff_cache.hidden.t_matmul(d_ff_out) d_hidden = d_ff_out.matmul_t(block.w_ff2) # GeLU' (tanh approximation; see top-of-file GELU_* constants): # gelu(x) = 0.5 x (1 + t), t = tanh(u), u = C (x + K x³) # gelu'(x) = 0.5 (1 + t) + 0.5 x (1 - t²) · C (1 + DK x²) # where C = sqrt(2/π), K = 0.044715, DK = 3 K = 0.134145. d_pre = Mat.new(d_hidden.nrows, d_hidden.ncols) n = d_hidden.nrows * d_hidden.ncols i = 0 while i < n x = ff_cache.pre.flat[i] u = GELU_C * (x + GELU_K * x * x * x) t = Math.tanh(u) du_dx = GELU_C * (1.0 + GELU_DK * x * x) deriv = 0.5 * (1.0 + t) + 0.5 * x * (1.0 - t * t) * du_dx d_pre.flat[i] = d_hidden.flat[i] * deriv i += 1 end d_w_ff1 = h.t_matmul(d_pre) d_h = d_pre.matmul_t(block.w_ff1) target_block.w_ff1 = d_w_ff1 target_block.w_ff2 = d_w_ff2 d_h end |
#feed_forward_ffi(h, block, ffi_cache) ⇒ Object
Persistent-session FFI variant of feed_forward. Single ggml session runs the chain ‘mul_mat(w1_t, h) -> gelu -> mul_mat(w2_t, hidden)` in one dispatch; activations live in ggml memory between matmul1 and matmul2 (no host round-trip for GeLU).
Operand-order trick: with matmul1 = mul_mat(w1_t, h), the result’s ne0 is d_ff – which equals matmul2’s k – so the chain composes without an intermediate transpose. All three result tensors then read back as a straight row-major memcpy.
762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 |
# File 'lib/toy/models/transformer.rb', line 762 def feed_forward_ffi(h, block, ffi_cache) t_seq = h.nrows d_model = h.ncols d_ff = block.w_ff1.ncols if !ffi_cache.realized ffi_cache.realize_for(t_seq, d_model, d_ff) end TinyNN.upload_row_major(ffi_cache.sess, ffi_cache.t_h, h) TinyNN.stage_transposed_and_upload(ffi_cache.sess, ffi_cache.t_w1_t, block.w_ff1) TinyNN.stage_transposed_and_upload(ffi_cache.sess, ffi_cache.t_w2_t, block.w_ff2) TinyNN.tnn_compute(ffi_cache.sess) pre = TinyNN.download_row_major(ffi_cache.sess, ffi_cache.t_pre, t_seq, d_ff) hidden = TinyNN.download_row_major(ffi_cache.sess, ffi_cache.t_hidden, t_seq, d_ff) out = TinyNN.download_row_major(ffi_cache.sess, ffi_cache.t_out, t_seq, d_model) FFResult.new(out, FFCache.new(pre, hidden)) end |
#forward(token_ids) ⇒ Object
Full forward pass. Writes intermediates into @layer_caches and @cache, which are pre-allocated so their types are unambiguous to Spinel. Returns the logits Mat (T × vocab_size).
796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 |
# File 'lib/toy/models/transformer.rb', line 796 def forward(token_ids) cache = ForwardCache.new cache.token_ids = token_ids x = (token_ids, 0) cache. = x x_cur = x li = 0 while li < @n_layers transformer_block_into(x_cur, @blocks[li], @layer_caches[li], @ffn_ffi_caches[li]) x_cur = @layer_caches[li].x_out li += 1 end cache.layers = @layer_caches cache.x_block_out = x_cur nr = rms_norm(x_cur, @norm_final_gamma) cache.x_final = nr.y cache.rms_final = nr.rms # Tied unembed: logits[t,v] = Σ_d x_final[t,d] · token_embed[v,d] cache.logits = nr.y.matmul_t(@token_embed) @cache = cache cache.logits end |
#generate_from_ids(start_ids, max_tokens, temperature) ⇒ Object
GENERATION — autoregressive sampling from a starting token-id list.
Tokenizing a prompt string would drag the French tokenizer (which
uses unicode_normalize and complex regex) into the Spinel-compiled
binary; instead we let the caller pre-tokenize and pass IDs.
1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 |
# File 'lib/toy/models/transformer.rb', line 1393 def generate_from_ids(start_ids, max_tokens, temperature) # Anchor start_ids as an IntArray for Spinel param-type inference. n_start = start_ids.length # Copy start_ids into a fresh IntArray we'll grow. tokens = [start_ids[0]] i = 1 while i < n_start tokens.push(start_ids[i]) i += 1 end step = 0 while step < max_tokens ctx_len = tokens.length if ctx_len > @context_length ctx_len = @context_length end # Build the trailing-window context. ctx = [tokens[tokens.length - ctx_len]] j = 1 while j < ctx_len ctx.push(tokens[tokens.length - ctx_len + j]) j += 1 end logits = self.forward(ctx) next_id = self.sample_logits_row(logits, ctx_len - 1, temperature) tokens.push(next_id) step += 1 end tokens end |
#hsplit_heads(d_concat) ⇒ Object
Split a (T × d_model) matrix back into n_heads × (T × d_head) heads.
980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 |
# File 'lib/toy/models/transformer.rb', line 980 def hsplit_heads(d_concat) t_seq = d_concat.nrows out = [Mat.new(t_seq, @d_head)] h = 1 while h < @n_heads out.push(Mat.new(t_seq, @d_head)) h += 1 end h = 0 while h < @n_heads base = h * @d_head m = out[h] i = 0 while i < t_seq j = 0 while j < @d_head m.flat[i * @d_head + j] = d_concat.flat[i * @d_model + (base + j)] j += 1 end i += 1 end h += 1 end out end |
#hstack_heads(per_head) ⇒ Object
Concatenate per-head outputs side by side: n_heads × (T × d_head) → (T × d_model)
677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 |
# File 'lib/toy/models/transformer.rb', line 677 def hstack_heads(per_head) t = per_head[0].head_out.nrows out = Mat.new(t, @d_model) h = 0 while h < @n_heads head = per_head[h].head_out base = h * @d_head i = 0 while i < t j = 0 while j < @d_head out.flat[i * @d_model + (base + j)] = head.flat[i * @d_head + j] j += 1 end i += 1 end h += 1 end out end |
#rms_norm(x, gamma) ⇒ Object
RMSNorm: y_j = gamma_j * x_j / sqrt(mean(x²) + eps), per row. Returns a NormResult holding the normed Mat and the per-row rms.
598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 |
# File 'lib/toy/models/transformer.rb', line 598 def rms_norm(x, gamma) eps = RMS_EPS_DEFAULT d = gamma.length t = x.nrows rms = Array.new(t, 0.0) out = Mat.new(t, d) i = 0 while i < t sumsq = 0.0 j = 0 while j < d v = x.flat[i * d + j] sumsq += v * v j += 1 end r = Math.sqrt(sumsq / d + eps) rms[i] = r j = 0 while j < d out.flat[i * d + j] = x.flat[i * d + j] * gamma[j] / r j += 1 end i += 1 end NormResult.new(out, rms) end |
#rms_norm_backward(x, gamma, rms, dy, target_dgamma) ⇒ Object
RMSNorm backward.
For y = gamma * x / r, with r = sqrt(mean(x²) + eps):
dL/dx_k = (dy_k * gamma_k - x_k * coef) / r,
coef = (Σ_j dy_j * gamma_j * x_j) / (d * r²)
dL/dgamma_j (summed over rows) += dy_j * x_j / r
‘rms` is the FloatArray of per-row r values cached from the forward pass — saves recomputing sumsq.
922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 |
# File 'lib/toy/models/transformer.rb', line 922 def rms_norm_backward(x, gamma, rms, dy, target_dgamma) d = gamma.length t_seq = x.nrows dx = Mat.new(t_seq, d) i = 0 while i < t_seq r = rms[i] inner = 0.0 j = 0 while j < d inner += dy.flat[i * d + j] * gamma[j] * x.flat[i * d + j] j += 1 end coef = inner / (d * r * r) j = 0 while j < d dx.flat[i * d + j] = (dy.flat[i * d + j] * gamma[j] - x.flat[i * d + j] * coef) / r target_dgamma[j] = target_dgamma[j] + dy.flat[i * d + j] * x.flat[i * d + j] / r j += 1 end i += 1 end dx end |
#sample_logits_row(logits, row, temperature) ⇒ Object
Sample a token ID from row ‘row` of `logits` (T × vocab_size flat). temperature <= 0 → argmax, else softmax with temperature + cumulative sample. `rand(N).to_f / N` gives a uniform [0,1) under both Spinel (where bare `rand` returns C’s int rand) and CRuby.
1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 |
# File 'lib/toy/models/transformer.rb', line 1430 def sample_logits_row(logits, row, temperature) v = logits.ncols base = row * v if temperature <= 0.0 best_id = 0 best_val = logits.flat[base] j = 1 while j < v val = logits.flat[base + j] if val > best_val best_val = val best_id = j end j += 1 end return best_id end inv_t = 1.0 / temperature # Stable-softmax: subtract the max before exp. mx = logits.flat[base] j = 1 while j < v val = logits.flat[base + j] if val > mx mx = val end j += 1 end sum = 0.0 j = 0 while j < v sum = sum + Math.exp((logits.flat[base + j] - mx) * inv_t) j += 1 end r = (rand(1_000_000).to_f / 1_000_000.0) * sum cum = 0.0 j = 0 while j < v cum = cum + Math.exp((logits.flat[base + j] - mx) * inv_t) if r < cum return j end j += 1 end v - 1 end |
#self_attention(h_in, block) ⇒ Object
Multi-head self-attention. Returns AttnResult.
699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 |
# File 'lib/toy/models/transformer.rb', line 699 def self_attention(h_in, block) # Force h_in's type inference via an early Mat-typed access. t_seq = h_in.nrows inv_sqrt = 1.0 / Math.sqrt(@d_head) # Build per-head caches with the seed-then-push pattern. head0 = self_attention_head(h_in, block, 0, inv_sqrt) per_head = [head0] hi = 1 while hi < @n_heads per_head.push(self_attention_head(h_in, block, hi, inv_sqrt)) hi += 1 end concat = hstack_heads(per_head) proj = concat.matmul(block.w_o) AttnResult.new(proj, AttnCache.new(per_head, concat)) end |
#self_attention_backward(d_proj, h_in, attn_cache, block, target_block) ⇒ Object
Self-attention backward. Writes per-head w_q/k/v + w_o grads into target_block. Returns d_h_in.
1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 |
# File 'lib/toy/models/transformer.rb', line 1039 def self_attention_backward(d_proj, h_in, attn_cache, block, target_block) t_seq = h_in.nrows # type hint inv_sqrt = 1.0 / Math.sqrt(@d_head) # proj = concat · w_o d_w_o = attn_cache.concat.t_matmul(d_proj) d_concat = d_proj.matmul_t(block.w_o) target_block.w_o = d_w_o d_outs = self.hsplit_heads(d_concat) # Build per-head Q/K/V grads (Mat per head). Seed-then-push for typing. d_w_q_heads = [Mat.new(@d_model, @d_head)] d_w_k_heads = [Mat.new(@d_model, @d_head)] d_w_v_heads = [Mat.new(@d_model, @d_head)] h = 1 while h < @n_heads d_w_q_heads.push(Mat.new(@d_model, @d_head)) d_w_k_heads.push(Mat.new(@d_model, @d_head)) d_w_v_heads.push(Mat.new(@d_model, @d_head)) h += 1 end d_h_in = Mat.new(t_seq, @d_model) h = 0 while h < @n_heads head = attn_cache.per_head[h] d_out_h = d_outs[h] # out = attn · V d_attn = d_out_h.matmul_t(head.v) d_v = head.attn.t_matmul(d_out_h) # softmax row-wise (masked entries had attn = 0 so contribute nothing) d_scores = self.softmax_rows_backward(head.attn, d_attn) d_scores.scale!(inv_sqrt) # scores = Q · Kᵀ d_q = d_scores.matmul(head.k) d_k = d_scores.transpose.matmul(head.q) d_w_q_heads[h] = h_in.t_matmul(d_q) d_w_k_heads[h] = h_in.t_matmul(d_k) d_w_v_heads[h] = h_in.t_matmul(d_v) d_h_in.add!(d_q.matmul_t(block.w_q[h])) d_h_in.add!(d_k.matmul_t(block.w_k[h])) d_h_in.add!(d_v.matmul_t(block.w_v[h])) h += 1 end target_block.w_q = d_w_q_heads target_block.w_k = d_w_k_heads target_block.w_v = d_w_v_heads d_h_in end |
#self_attention_head(h_in, block, head_idx, inv_sqrt) ⇒ Object
719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 |
# File 'lib/toy/models/transformer.rb', line 719 def self_attention_head(h_in, block, head_idx, inv_sqrt) q = h_in.matmul(block.w_q[head_idx]) k = h_in.matmul(block.w_k[head_idx]) v = h_in.matmul(block.w_v[head_idx]) # scores = (Q · Kᵀ) / sqrt(d_head) scores = q.matmul_t(k) scores.scale!(inv_sqrt) apply_causal_mask!(scores, 0) softmax_rows!(scores) head_out = scores.matmul(v) HeadCache.new(q, k, v, scores, head_out) end |
#sgd_step_block(p_block, g_block, lr) ⇒ Object
1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 |
# File 'lib/toy/models/transformer.rb', line 1163 def sgd_step_block(p_block, g_block, lr) self.sgd_step_vec(p_block.norm1_gamma, g_block.norm1_gamma, lr) self.sgd_step_vec(p_block.norm2_gamma, g_block.norm2_gamma, lr) self.sgd_step_mat(p_block.w_o, g_block.w_o, lr) self.sgd_step_mat(p_block.w_ff1, g_block.w_ff1, lr) self.sgd_step_mat(p_block.w_ff2, g_block.w_ff2, lr) h = 0 while h < @n_heads self.sgd_step_mat(p_block.w_q[h], g_block.w_q[h], lr) self.sgd_step_mat(p_block.w_k[h], g_block.w_k[h], lr) self.sgd_step_mat(p_block.w_v[h], g_block.w_v[h], lr) h += 1 end end |
#sgd_step_mat(p, g, lr) ⇒ Object
1145 1146 1147 1148 1149 1150 1151 1152 |
# File 'lib/toy/models/transformer.rb', line 1145 def sgd_step_mat(p, g, lr) n = p.flat.length i = 0 while i < n p.flat[i] -= lr * g.flat[i] i += 1 end end |
#sgd_step_vec(p, g, lr) ⇒ Object
1154 1155 1156 1157 1158 1159 1160 1161 |
# File 'lib/toy/models/transformer.rb', line 1154 def sgd_step_vec(p, g, lr) n = p.length i = 0 while i < n p[i] -= lr * g[i] i += 1 end end |
#softmax_rows!(m) ⇒ Object
Row-wise softmax with numerical-stability max-shift, in place on ‘m`.
628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 |
# File 'lib/toy/models/transformer.rb', line 628 def softmax_rows!(m) t = m.nrows n = m.ncols i = 0 while i < t base = i * n mx = m.flat[base] j = 1 while j < n v = m.flat[base + j] if v > mx mx = v end j += 1 end sum = 0.0 j = 0 while j < n e = Math.exp(m.flat[base + j] - mx) m.flat[base + j] = e sum += e j += 1 end j = 0 while j < n m.flat[base + j] = m.flat[base + j] / sum j += 1 end i += 1 end end |
#softmax_rows_backward(softmax_out, d_softmax) ⇒ Object
Row-wise softmax backward (for attention).
d_scores[i,j] = attn[i,j] * (d_attn[i,j] - Σk attn[i,k]·d_attn[i,k])
955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 |
# File 'lib/toy/models/transformer.rb', line 955 def softmax_rows_backward(softmax_out, d_softmax) t_seq = softmax_out.nrows n = softmax_out.ncols out = Mat.new(t_seq, n) i = 0 while i < t_seq base = i * n s = 0.0 j = 0 while j < n s += softmax_out.flat[base + j] * d_softmax.flat[base + j] j += 1 end j = 0 while j < n out.flat[base + j] = softmax_out.flat[base + j] * (d_softmax.flat[base + j] - s) j += 1 end i += 1 end out end |
#transformer_block(x, block) ⇒ Object
One transformer block (pre-norm). Returns BlockResult. Locals are explicit so Spinel can type-trace argument types into the called methods (passing ‘nr1.y` directly through doesn’t propagate).
1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 |
# File 'lib/toy/models/transformer.rb', line 1360 def transformer_block(x, block) cache = LayerCache.new nr1 = rms_norm(x, block.norm1_gamma) h1 = nr1.y cache.h_norm1 = h1 cache.rms1 = nr1.rms sa = self_attention(h1, block) cache.attn_cache = sa.cache x_attn = x.plus(sa.proj) cache.x_attn = x_attn nr2 = rms_norm(x_attn, block.norm2_gamma) h2 = nr2.y cache.h_norm2 = h2 cache.rms2 = nr2.rms ff = feed_forward(h2, block) cache.ff_cache = ff.cache x_out = x_attn.plus(ff.out) cache.x_out = x_out BlockResult.new(x_out, cache) end |
#transformer_block_backward(dx_out, x_in, block, layer_cache, target_block_grads) ⇒ Object
Backward through one block. Writes grads into target_block_grads. Returns d_x_in (Mat).
1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 |
# File 'lib/toy/models/transformer.rb', line 1100 def transformer_block_backward(dx_out, x_in, block, layer_cache, target_block_grads) # x_in is only passed as an arg below — never accessed directly. Spinel's # body-usage parameter inference needs at least one method call to type # the param. `.nrows` is a Mat-only method, so this anchors x_in's type. _x_t = x_in.nrows # FFN sublayer residual: x_out = x_attn + ff_out → grad flows to both branches. d_h_norm2 = self.feed_forward_backward(dx_out, layer_cache.h_norm2, layer_cache.ff_cache, block, target_block_grads) d_x_attn_via_norm = self.rms_norm_backward(layer_cache.x_attn, block.norm2_gamma, layer_cache.rms2, d_h_norm2, target_block_grads.norm2_gamma) d_x_attn = dx_out.plus(d_x_attn_via_norm) # Attention sublayer residual: x_attn = x_in + attn_proj. d_h_norm1 = self.self_attention_backward(d_x_attn, layer_cache.h_norm1, layer_cache.attn_cache, block, target_block_grads) d_x_in_via_norm = self.rms_norm_backward(x_in, block.norm1_gamma, layer_cache.rms1, d_h_norm1, target_block_grads.norm1_gamma) d_x_attn.plus(d_x_in_via_norm) end |
#transformer_block_into(x, block, cache, ffi_cache) ⇒ Object
Same as transformer_block but writes into a pre-existing LayerCache. ffi_cache is the persistent-session FFNFFICache for this block; used only when USE_FFI_MATMUL is true.
828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 |
# File 'lib/toy/models/transformer.rb', line 828 def transformer_block_into(x, block, cache, ffi_cache) nr1 = rms_norm(x, block.norm1_gamma) h1 = nr1.y cache.h_norm1 = h1 cache.rms1 = nr1.rms sa = self_attention(h1, block) cache.attn_cache = sa.cache x_attn = x.plus(sa.proj) cache.x_attn = x_attn nr2 = rms_norm(x_attn, block.norm2_gamma) h2 = nr2.y cache.h_norm2 = h2 cache.rms2 = nr2.rms if USE_FFI_MATMUL ff = feed_forward_ffi(h2, block, ffi_cache) else ff = feed_forward(h2, block) end cache.ff_cache = ff.cache x_out = x_attn.plus(ff.out) cache.x_out = x_out end |
#x_in_for_layer(li) ⇒ Object
No ‘train_step` here: Spinel compiles every class method whether or not it has callers. With no callers in the current program its IntArray param defaults to `mrb_int`, and the body’s ‘forward(seq_ids)` then fails to type-check. Each driver inlines the forward / backward / optimizer-step sequence at its top level, which is short and makes the per-step cost obvious. Block i’s input is the previous block’s output, or the embedded input for block 0.
1289 1290 1291 |
# File 'lib/toy/models/transformer.rb', line 1289 def x_in_for_layer(li) li == 0 ? @cache. : @cache.layers[li - 1].x_out end |