Class: FullForwardFFICacheCuda
- Inherits:
-
Object
- Object
- FullForwardFFICacheCuda
- Defined in:
- lib/toy/ffi/tinynn_cuda.rb
Overview
Full forward of a TransformerLM as one persistent ggml graph. Built incrementally; M1.1 covered embed + positional embedding + tied unembed (the bookends). M1.2 adds one full transformer block: pre-RMSNorm, multi-head causal attention, residual, pre-RMSNorm, FFN, residual. M1.3+ will scale to n_layers blocks.
Layout conventions (see project_chained_ffn_2026_05_14):
- Mat (rows, cols) row-major upload -> ggml ne=[cols, rows]
- Per-block intermediates carry ne=[d_model, T]: elem(d, t) is the
logical value at (row=t, col=d).
Persistent (ctx_w):
- t_token_embed (vocab, d_model)
- t_pos_slice (T, d_model)
- t_final_norm_gamma (d_model)
- per block (in @blocks_ffi):
- t_norm1_gamma, t_norm2_gamma (d_model)
- t_w_q[h], t_w_k[h], t_w_v[h] (d_model, d_head) per head
- t_w_o (d_model, d_model)
- t_w_ff1 (d_model, d_ff), t_w_ff2 (d_ff, d_model)
Compute (ctx): t_token_ids (T int32), intermediates, t_logits
Instance Attribute Summary collapse
-
#blocks_ffi ⇒ Object
Returns the value of attribute blocks_ffi.
-
#d_ff ⇒ Object
Returns the value of attribute d_ff.
-
#d_head ⇒ Object
Returns the value of attribute d_head.
-
#d_model ⇒ Object
Returns the value of attribute d_model.
-
#n_heads ⇒ Object
Returns the value of attribute n_heads.
-
#n_layers ⇒ Object
Returns the value of attribute n_layers.
-
#realized ⇒ Object
Returns the value of attribute realized.
-
#sess ⇒ Object
Returns the value of attribute sess.
-
#t_final_norm_gamma ⇒ Object
Returns the value of attribute t_final_norm_gamma.
-
#t_logits ⇒ Object
Returns the value of attribute t_logits.
-
#t_pos_slice ⇒ Object
Returns the value of attribute t_pos_slice.
-
#t_seq ⇒ Object
Returns the value of attribute t_seq.
-
#t_token_embed ⇒ Object
Returns the value of attribute t_token_embed.
-
#t_token_ids ⇒ Object
Returns the value of attribute t_token_ids.
-
#t_x_embed ⇒ Object
Returns the value of attribute t_x_embed.
-
#t_x_final ⇒ Object
Returns the value of attribute t_x_final.
-
#vocab_size ⇒ Object
Returns the value of attribute vocab_size.
Instance Method Summary collapse
-
#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ Object
Single attention head, given pre-normed x and the head’s persistent Q/K/V weights.
-
#build_block(t_x, blk, eps, scale) ⇒ Object
Build one transformer block’s graph nodes.
-
#initialize ⇒ FullForwardFFICacheCuda
constructor
A new instance of FullForwardFFICacheCuda.
- #realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ Object
Constructor Details
#initialize ⇒ FullForwardFFICacheCuda
Returns a new instance of FullForwardFFICacheCuda.
927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 927 def initialize @realized = false @t_seq = 0 @d_model = 0 @d_ff = 0 @n_heads = 0 @d_head = 0 @n_layers = 0 @vocab_size = 0 @sess = TinyNNCuda.tnn_null_ptr @t_token_embed = TinyNNCuda.tnn_null_ptr @t_pos_slice = TinyNNCuda.tnn_null_ptr @t_token_ids = TinyNNCuda.tnn_null_ptr @t_final_norm_gamma = TinyNNCuda.tnn_null_ptr @t_x_embed = TinyNNCuda.tnn_null_ptr @t_x_final = TinyNNCuda.tnn_null_ptr @t_logits = TinyNNCuda.tnn_null_ptr @blocks_ffi = [BlockFFICacheCuda.new] end |
Instance Attribute Details
#blocks_ffi ⇒ Object
Returns the value of attribute blocks_ffi.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def blocks_ffi @blocks_ffi end |
#d_ff ⇒ Object
Returns the value of attribute d_ff.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def d_ff @d_ff end |
#d_head ⇒ Object
Returns the value of attribute d_head.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def d_head @d_head end |
#d_model ⇒ Object
Returns the value of attribute d_model.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def d_model @d_model end |
#n_heads ⇒ Object
Returns the value of attribute n_heads.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def n_heads @n_heads end |
#n_layers ⇒ Object
Returns the value of attribute n_layers.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def n_layers @n_layers end |
#realized ⇒ Object
Returns the value of attribute realized.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def realized @realized end |
#sess ⇒ Object
Returns the value of attribute sess.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def sess @sess end |
#t_final_norm_gamma ⇒ Object
Returns the value of attribute t_final_norm_gamma.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def t_final_norm_gamma @t_final_norm_gamma end |
#t_logits ⇒ Object
Returns the value of attribute t_logits.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def t_logits @t_logits end |
#t_pos_slice ⇒ Object
Returns the value of attribute t_pos_slice.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def t_pos_slice @t_pos_slice end |
#t_seq ⇒ Object
Returns the value of attribute t_seq.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def t_seq @t_seq end |
#t_token_embed ⇒ Object
Returns the value of attribute t_token_embed.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def @t_token_embed end |
#t_token_ids ⇒ Object
Returns the value of attribute t_token_ids.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def t_token_ids @t_token_ids end |
#t_x_embed ⇒ Object
Returns the value of attribute t_x_embed.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def @t_x_embed end |
#t_x_final ⇒ Object
Returns the value of attribute t_x_final.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def t_x_final @t_x_final end |
#vocab_size ⇒ Object
Returns the value of attribute vocab_size.
920 921 922 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def vocab_size @vocab_size end |
Instance Method Details
#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ Object
Single attention head, given pre-normed x and the head’s persistent Q/K/V weights. See build_block’s docstring for the math.
1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 1087 def build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) t_q = TinyNNCuda.tnn_matmul(@sess, t_w_q, t_x) # ne=[d_head, T] t_k = TinyNNCuda.tnn_matmul(@sess, t_w_k, t_x) # ne=[d_head, T] # v in Pattern A (ne=[T, d_head]) so head_out's k_dim matches. # mul_mat(x, w_v_t) where x.ne=[d_model, T] and w_v_t.ne=[d_model, d_head] # yields ne=[T, d_head]. ✓ t_v = TinyNNCuda.tnn_matmul(@sess, t_x, t_w_v) t_scores = TinyNNCuda.tnn_matmul(@sess, t_k, t_q) # ne=[T_key, T_query] t_scaled = TinyNNCuda.tnn_scale(@sess, t_scores, scale) t_masked = TinyNNCuda.tnn_diag_mask_inf(@sess, t_scaled, 0) t_attn = TinyNNCuda.tnn_softmax(@sess, t_masked) # softmax along ne0 = key dim TinyNNCuda.tnn_matmul(@sess, t_v, t_attn) # ne=[d_head, T_query] end |
#build_block(t_x, blk, eps, scale) ⇒ Object
Build one transformer block’s graph nodes. Returns the block’s output tensor (post-FFN residual). Mathematics:
h1 = rms_norm(x, norm1_gamma)
per head h:
q_h = w_q[h]^T @ h1 (mul_mat(w_q_t_h, h1) ne=[d_head, T])
k_h = w_k[h]^T @ h1
v_h = h1 @ w_v[h] (mul_mat(h1, w_v_t_h) ne=[T, d_head])
scores_h = mul_mat(k_h, q_h) ne=[T_key, T_query]
scaled_h = scale(scores_h, 1/sqrt(d_head))
masked_h = diag_mask_inf(scaled_h, 0) -- causal
attn_h = soft_max(masked_h) -- per-query softmax over keys
head_out_h = mul_mat(v_h, attn_h) ne=[d_head, T_query]
concat = concat_along_d(head_out_h for h in heads) ne=[d_model, T]
out_proj = mul_mat(w_o_t, concat) ne=[d_model, T]
x_attn = x + out_proj
h2 = rms_norm(x_attn, norm2_gamma)
ffn:
pre = mul_mat(w_ff1_t, h2) ne=[d_ff, T]
hidden = gelu(pre)
ffn_out= mul_mat(w_ff2_t, hidden) ne=[d_model, T]
x_out = x_attn + ffn_out
1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 1049 def build_block(t_x, blk, eps, scale) # Pre-norm before attention. t_h1 = TinyNNCuda.tnn_rms_norm(@sess, t_x, blk.t_norm1_gamma, eps) # Per-head attention. Build each head's output, then concat. t_head_outs = [build_attention_head(t_h1, blk.t_w_q[0], blk.t_w_k[0], blk.t_w_v[0], scale)] h = 1 while h < @n_heads t_head_outs.push(build_attention_head(t_h1, blk.t_w_q[h], blk.t_w_k[h], blk.t_w_v[h], scale)) h = h + 1 end # Concat along ne0 (d_head -> d_model). t_concat = t_head_outs[0] h = 1 while h < @n_heads t_concat = TinyNNCuda.tnn_concat(@sess, t_concat, t_head_outs[h], 0) h = h + 1 end # Output projection + residual. t_out_proj = TinyNNCuda.tnn_matmul(@sess, blk.t_w_o, t_concat) t_x_attn = TinyNNCuda.tnn_add(@sess, t_x, t_out_proj) # Pre-norm before FFN. t_h2 = TinyNNCuda.tnn_rms_norm(@sess, t_x_attn, blk.t_norm2_gamma, eps) # FFN (matches FFNFFICache's chained design). t_pre = TinyNNCuda.tnn_matmul(@sess, blk.t_w_ff1, t_h2) t_hidden = TinyNNCuda.tnn_gelu(@sess, t_pre) t_ffn = TinyNNCuda.tnn_matmul(@sess, blk.t_w_ff2, t_hidden) # Second residual. TinyNNCuda.tnn_add(@sess, t_x_attn, t_ffn) end |
#realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ Object
947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 947 def realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) @t_seq = t_seq @d_model = d_model @d_ff = d_ff @n_heads = n_heads @d_head = d_model / n_heads @n_layers = n_layers @vocab_size = vocab_size @sess = TinyNNCuda.tnn_session_new(1) # === Persistent weights (ctx_w) === @t_token_embed = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, vocab_size, d_model) @t_pos_slice = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, t_seq, d_model) @t_final_norm_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model) # Build per-block tensor handles (seed-then-push for Spinel's # Array<BlockFFICache> inference). @blocks_ffi = [BlockFFICacheCuda.new] li = 1 while li < n_layers @blocks_ffi.push(BlockFFICacheCuda.new) li = li + 1 end li = 0 while li < n_layers blk = @blocks_ffi[li] blk.t_norm1_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model) blk.t_norm2_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model) # Per-head Q/K/V: shape (d_model, d_head). Uploaded TRANSPOSED so # ggml ne=[d_model, d_head] holds w.elem(r, c) = w[r][c]. blk.t_w_q = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)] blk.t_w_k = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)] blk.t_w_v = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)] h = 1 while h < n_heads blk.t_w_q.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)) blk.t_w_k.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)) blk.t_w_v.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)) h = h + 1 end blk.t_w_o = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_model, d_model) blk.t_w_ff1 = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_ff, d_model) blk.t_w_ff2 = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_model, d_ff) li = li + 1 end TinyNNCuda.tnn_finalize_weights(@sess) # === Compute input === @t_token_ids = TinyNNCuda.tnn_input_1d_i32(@sess, t_seq) # === Forward graph === # x_embed = token_embed[ids] + pos_slice (ne=[d_model, T]) = TinyNNCuda.tnn_get_rows(@sess, @t_token_embed, @t_token_ids) @t_x_embed = TinyNNCuda.tnn_add(@sess, , @t_pos_slice) TinyNNCuda.tnn_set_output(@t_x_embed) # Through each block. t_cur = @t_x_embed eps = 1.0e-5 scale = 1.0 / Math.sqrt(d_head.to_f) li = 0 while li < n_layers t_cur = build_block(t_cur, @blocks_ffi[li], eps, scale) li = li + 1 end # Final RMSNorm on the post-blocks x. @t_x_final = TinyNNCuda.tnn_rms_norm(@sess, t_cur, @t_final_norm_gamma, eps) TinyNNCuda.tnn_set_output(@t_x_final) # Tied unembed: logits = mul_mat(token_embed, x_final) ne=[vocab, T] @t_logits = TinyNNCuda.tnn_matmul(@sess, @t_token_embed, @t_x_final) TinyNNCuda.tnn_set_output(@t_logits) TinyNNCuda.tnn_realize(@sess, @t_logits) @realized = true end |