Class: FullForwardFFICacheCuda
- Inherits:
-
Object
- Object
- FullForwardFFICacheCuda
- Defined in:
- lib/toy/ffi/tinynn_cuda.rb
Overview
Full forward of a TransformerLM as one persistent ggml graph. Built incrementally; M1.1 covered embed + positional embedding + tied unembed (the bookends). M1.2 adds one full transformer block: pre-RMSNorm, multi-head causal attention, residual, pre-RMSNorm, FFN, residual. M1.3+ will scale to n_layers blocks.
Layout conventions (see project_chained_ffn_2026_05_14):
- Mat (rows, cols) row-major upload -> ggml ne=[cols, rows]
- Per-block intermediates carry ne=[d_model, T]: elem(d, t) is the
logical value at (row=t, col=d).
Persistent (ctx_w):
- t_token_embed (vocab, d_model)
- t_pos_slice (T, d_model)
- t_final_norm_gamma (d_model)
- per block (in @blocks_ffi):
- t_norm1_gamma, t_norm2_gamma (d_model)
- t_w_q[h], t_w_k[h], t_w_v[h] (d_model, d_head) per head
- t_w_o (d_model, d_model)
- t_w_ff1 (d_model, d_ff), t_w_ff2 (d_ff, d_model)
Compute (ctx): t_token_ids (T int32), intermediates, t_logits
Instance Attribute Summary collapse
-
#blocks_ffi ⇒ Object
Returns the value of attribute blocks_ffi.
-
#d_ff ⇒ Object
Returns the value of attribute d_ff.
-
#d_head ⇒ Object
Returns the value of attribute d_head.
-
#d_model ⇒ Object
Returns the value of attribute d_model.
-
#n_heads ⇒ Object
Returns the value of attribute n_heads.
-
#n_layers ⇒ Object
Returns the value of attribute n_layers.
-
#realized ⇒ Object
Returns the value of attribute realized.
-
#sess ⇒ Object
Returns the value of attribute sess.
-
#t_final_norm_gamma ⇒ Object
Returns the value of attribute t_final_norm_gamma.
-
#t_logits ⇒ Object
Returns the value of attribute t_logits.
-
#t_pos_slice ⇒ Object
Returns the value of attribute t_pos_slice.
-
#t_seq ⇒ Object
Returns the value of attribute t_seq.
-
#t_token_embed ⇒ Object
Returns the value of attribute t_token_embed.
-
#t_token_ids ⇒ Object
Returns the value of attribute t_token_ids.
-
#t_x_embed ⇒ Object
Returns the value of attribute t_x_embed.
-
#t_x_final ⇒ Object
Returns the value of attribute t_x_final.
-
#vocab_size ⇒ Object
Returns the value of attribute vocab_size.
Instance Method Summary collapse
-
#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ Object
Single attention head, given pre-normed x and the head’s persistent Q/K/V weights.
-
#build_block(t_x, blk, eps, scale) ⇒ Object
Build one transformer block’s graph nodes.
-
#initialize ⇒ FullForwardFFICacheCuda
constructor
A new instance of FullForwardFFICacheCuda.
- #realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ Object
Constructor Details
#initialize ⇒ FullForwardFFICacheCuda
Returns a new instance of FullForwardFFICacheCuda.
920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920 def initialize @realized = false @t_seq = 0 @d_model = 0 @d_ff = 0 @n_heads = 0 @d_head = 0 @n_layers = 0 @vocab_size = 0 @sess = TinyNNCuda.tnn_null_ptr @t_token_embed = TinyNNCuda.tnn_null_ptr @t_pos_slice = TinyNNCuda.tnn_null_ptr @t_token_ids = TinyNNCuda.tnn_null_ptr @t_final_norm_gamma = TinyNNCuda.tnn_null_ptr @t_x_embed = TinyNNCuda.tnn_null_ptr @t_x_final = TinyNNCuda.tnn_null_ptr @t_logits = TinyNNCuda.tnn_null_ptr @blocks_ffi = [BlockFFICacheCuda.new] end |
Instance Attribute Details
#blocks_ffi ⇒ Object
Returns the value of attribute blocks_ffi.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def blocks_ffi @blocks_ffi end |
#d_ff ⇒ Object
Returns the value of attribute d_ff.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def d_ff @d_ff end |
#d_head ⇒ Object
Returns the value of attribute d_head.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def d_head @d_head end |
#d_model ⇒ Object
Returns the value of attribute d_model.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def d_model @d_model end |
#n_heads ⇒ Object
Returns the value of attribute n_heads.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def n_heads @n_heads end |
#n_layers ⇒ Object
Returns the value of attribute n_layers.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def n_layers @n_layers end |
#realized ⇒ Object
Returns the value of attribute realized.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def realized @realized end |
#sess ⇒ Object
Returns the value of attribute sess.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def sess @sess end |
#t_final_norm_gamma ⇒ Object
Returns the value of attribute t_final_norm_gamma.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def t_final_norm_gamma @t_final_norm_gamma end |
#t_logits ⇒ Object
Returns the value of attribute t_logits.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def t_logits @t_logits end |
#t_pos_slice ⇒ Object
Returns the value of attribute t_pos_slice.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def t_pos_slice @t_pos_slice end |
#t_seq ⇒ Object
Returns the value of attribute t_seq.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def t_seq @t_seq end |
#t_token_embed ⇒ Object
Returns the value of attribute t_token_embed.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def @t_token_embed end |
#t_token_ids ⇒ Object
Returns the value of attribute t_token_ids.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def t_token_ids @t_token_ids end |
#t_x_embed ⇒ Object
Returns the value of attribute t_x_embed.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def @t_x_embed end |
#t_x_final ⇒ Object
Returns the value of attribute t_x_final.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def t_x_final @t_x_final end |
#vocab_size ⇒ Object
Returns the value of attribute vocab_size.
913 914 915 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913 def vocab_size @vocab_size end |
Instance Method Details
#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ Object
Single attention head, given pre-normed x and the head’s persistent Q/K/V weights. See build_block’s docstring for the math.
1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 1080 def build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) t_q = TinyNNCuda.tnn_matmul(@sess, t_w_q, t_x) # ne=[d_head, T] t_k = TinyNNCuda.tnn_matmul(@sess, t_w_k, t_x) # ne=[d_head, T] # v in Pattern A (ne=[T, d_head]) so head_out's k_dim matches. # mul_mat(x, w_v_t) where x.ne=[d_model, T] and w_v_t.ne=[d_model, d_head] # yields ne=[T, d_head]. ✓ t_v = TinyNNCuda.tnn_matmul(@sess, t_x, t_w_v) t_scores = TinyNNCuda.tnn_matmul(@sess, t_k, t_q) # ne=[T_key, T_query] t_scaled = TinyNNCuda.tnn_scale(@sess, t_scores, scale) t_masked = TinyNNCuda.tnn_diag_mask_inf(@sess, t_scaled, 0) t_attn = TinyNNCuda.tnn_softmax(@sess, t_masked) # softmax along ne0 = key dim TinyNNCuda.tnn_matmul(@sess, t_v, t_attn) # ne=[d_head, T_query] end |
#build_block(t_x, blk, eps, scale) ⇒ Object
Build one transformer block’s graph nodes. Returns the block’s output tensor (post-FFN residual). Mathematics:
h1 = rms_norm(x, norm1_gamma)
per head h:
q_h = w_q[h]^T @ h1 (mul_mat(w_q_t_h, h1) ne=[d_head, T])
k_h = w_k[h]^T @ h1
v_h = h1 @ w_v[h] (mul_mat(h1, w_v_t_h) ne=[T, d_head])
scores_h = mul_mat(k_h, q_h) ne=[T_key, T_query]
scaled_h = scale(scores_h, 1/sqrt(d_head))
masked_h = diag_mask_inf(scaled_h, 0) -- causal
attn_h = soft_max(masked_h) -- per-query softmax over keys
head_out_h = mul_mat(v_h, attn_h) ne=[d_head, T_query]
concat = concat_along_d(head_out_h for h in heads) ne=[d_model, T]
out_proj = mul_mat(w_o_t, concat) ne=[d_model, T]
x_attn = x + out_proj
h2 = rms_norm(x_attn, norm2_gamma)
ffn:
pre = mul_mat(w_ff1_t, h2) ne=[d_ff, T]
hidden = gelu(pre)
ffn_out= mul_mat(w_ff2_t, hidden) ne=[d_model, T]
x_out = x_attn + ffn_out
1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 1042 def build_block(t_x, blk, eps, scale) # Pre-norm before attention. t_h1 = TinyNNCuda.tnn_rms_norm(@sess, t_x, blk.t_norm1_gamma, eps) # Per-head attention. Build each head's output, then concat. t_head_outs = [build_attention_head(t_h1, blk.t_w_q[0], blk.t_w_k[0], blk.t_w_v[0], scale)] h = 1 while h < @n_heads t_head_outs.push(build_attention_head(t_h1, blk.t_w_q[h], blk.t_w_k[h], blk.t_w_v[h], scale)) h = h + 1 end # Concat along ne0 (d_head -> d_model). t_concat = t_head_outs[0] h = 1 while h < @n_heads t_concat = TinyNNCuda.tnn_concat(@sess, t_concat, t_head_outs[h], 0) h = h + 1 end # Output projection + residual. t_out_proj = TinyNNCuda.tnn_matmul(@sess, blk.t_w_o, t_concat) t_x_attn = TinyNNCuda.tnn_add(@sess, t_x, t_out_proj) # Pre-norm before FFN. t_h2 = TinyNNCuda.tnn_rms_norm(@sess, t_x_attn, blk.t_norm2_gamma, eps) # FFN (matches FFNFFICache's chained design). t_pre = TinyNNCuda.tnn_matmul(@sess, blk.t_w_ff1, t_h2) t_hidden = TinyNNCuda.tnn_gelu(@sess, t_pre) t_ffn = TinyNNCuda.tnn_matmul(@sess, blk.t_w_ff2, t_hidden) # Second residual. TinyNNCuda.tnn_add(@sess, t_x_attn, t_ffn) end |
#realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ Object
940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 |
# File 'lib/toy/ffi/tinynn_cuda.rb', line 940 def realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) @t_seq = t_seq @d_model = d_model @d_ff = d_ff @n_heads = n_heads @d_head = d_model / n_heads @n_layers = n_layers @vocab_size = vocab_size @sess = TinyNNCuda.tnn_session_new(1) # === Persistent weights (ctx_w) === @t_token_embed = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, vocab_size, d_model) @t_pos_slice = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, t_seq, d_model) @t_final_norm_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model) # Build per-block tensor handles (seed-then-push for Spinel's # Array<BlockFFICache> inference). @blocks_ffi = [BlockFFICacheCuda.new] li = 1 while li < n_layers @blocks_ffi.push(BlockFFICacheCuda.new) li = li + 1 end li = 0 while li < n_layers blk = @blocks_ffi[li] blk.t_norm1_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model) blk.t_norm2_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model) # Per-head Q/K/V: shape (d_model, d_head). Uploaded TRANSPOSED so # ggml ne=[d_model, d_head] holds w.elem(r, c) = w[r][c]. blk.t_w_q = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)] blk.t_w_k = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)] blk.t_w_v = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)] h = 1 while h < n_heads blk.t_w_q.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)) blk.t_w_k.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)) blk.t_w_v.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)) h = h + 1 end blk.t_w_o = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_model, d_model) blk.t_w_ff1 = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_ff, d_model) blk.t_w_ff2 = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_model, d_ff) li = li + 1 end TinyNNCuda.tnn_finalize_weights(@sess) # === Compute input === @t_token_ids = TinyNNCuda.tnn_input_1d_i32(@sess, t_seq) # === Forward graph === # x_embed = token_embed[ids] + pos_slice (ne=[d_model, T]) = TinyNNCuda.tnn_get_rows(@sess, @t_token_embed, @t_token_ids) @t_x_embed = TinyNNCuda.tnn_add(@sess, , @t_pos_slice) TinyNNCuda.tnn_set_output(@t_x_embed) # Through each block. t_cur = @t_x_embed eps = 1.0e-5 scale = 1.0 / Math.sqrt(d_head.to_f) li = 0 while li < n_layers t_cur = build_block(t_cur, @blocks_ffi[li], eps, scale) li = li + 1 end # Final RMSNorm on the post-blocks x. @t_x_final = TinyNNCuda.tnn_rms_norm(@sess, t_cur, @t_final_norm_gamma, eps) TinyNNCuda.tnn_set_output(@t_x_final) # Tied unembed: logits = mul_mat(token_embed, x_final) ne=[vocab, T] @t_logits = TinyNNCuda.tnn_matmul(@sess, @t_token_embed, @t_x_final) TinyNNCuda.tnn_set_output(@t_logits) TinyNNCuda.tnn_realize(@sess, @t_logits) @realized = true end |