Class: FullForwardFFICache
- Inherits:
-
Object
- Object
- FullForwardFFICache
- Defined in:
- lib/toy/ffi/tinynn.rb
Overview
Full forward of a TransformerLM as one persistent ggml graph. Built incrementally; M1.1 covered embed + positional embedding + tied unembed (the bookends). M1.2 adds one full transformer block: pre-RMSNorm, multi-head causal attention, residual, pre-RMSNorm, FFN, residual. M1.3+ will scale to n_layers blocks.
Layout conventions (see project_chained_ffn_2026_05_14):
- Mat (rows, cols) row-major upload -> ggml ne=[cols, rows]
- Per-block intermediates carry ne=[d_model, T]: elem(d, t) is the
logical value at (row=t, col=d).
Persistent (ctx_w):
- t_token_embed (vocab, d_model)
- t_pos_slice (T, d_model)
- t_final_norm_gamma (d_model)
- per block (in @blocks_ffi):
- t_norm1_gamma, t_norm2_gamma (d_model)
- t_w_q[h], t_w_k[h], t_w_v[h] (d_model, d_head) per head
- t_w_o (d_model, d_model)
- t_w_ff1 (d_model, d_ff), t_w_ff2 (d_ff, d_model)
Compute (ctx): t_token_ids (T int32), intermediates, t_logits
Instance Attribute Summary collapse
-
#blocks_ffi ⇒ Object
Returns the value of attribute blocks_ffi.
-
#d_ff ⇒ Object
Returns the value of attribute d_ff.
-
#d_head ⇒ Object
Returns the value of attribute d_head.
-
#d_model ⇒ Object
Returns the value of attribute d_model.
-
#n_heads ⇒ Object
Returns the value of attribute n_heads.
-
#n_layers ⇒ Object
Returns the value of attribute n_layers.
-
#realized ⇒ Object
Returns the value of attribute realized.
-
#sess ⇒ Object
Returns the value of attribute sess.
-
#t_final_norm_gamma ⇒ Object
Returns the value of attribute t_final_norm_gamma.
-
#t_logits ⇒ Object
Returns the value of attribute t_logits.
-
#t_pos_slice ⇒ Object
Returns the value of attribute t_pos_slice.
-
#t_seq ⇒ Object
Returns the value of attribute t_seq.
-
#t_token_embed ⇒ Object
Returns the value of attribute t_token_embed.
-
#t_token_ids ⇒ Object
Returns the value of attribute t_token_ids.
-
#t_x_embed ⇒ Object
Returns the value of attribute t_x_embed.
-
#t_x_final ⇒ Object
Returns the value of attribute t_x_final.
-
#vocab_size ⇒ Object
Returns the value of attribute vocab_size.
Instance Method Summary collapse
-
#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ Object
Single attention head, given pre-normed x and the head’s persistent Q/K/V weights.
-
#build_block(t_x, blk, eps, scale) ⇒ Object
Build one transformer block’s graph nodes.
-
#initialize ⇒ FullForwardFFICache
constructor
A new instance of FullForwardFFICache.
- #realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ Object
Constructor Details
#initialize ⇒ FullForwardFFICache
Returns a new instance of FullForwardFFICache.
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
# File 'lib/toy/ffi/tinynn.rb', line 138 def initialize @realized = false @t_seq = 0 @d_model = 0 @d_ff = 0 @n_heads = 0 @d_head = 0 @n_layers = 0 @vocab_size = 0 @sess = TinyNN.tnn_null_ptr @t_token_embed = TinyNN.tnn_null_ptr @t_pos_slice = TinyNN.tnn_null_ptr @t_token_ids = TinyNN.tnn_null_ptr @t_final_norm_gamma = TinyNN.tnn_null_ptr @t_x_embed = TinyNN.tnn_null_ptr @t_x_final = TinyNN.tnn_null_ptr @t_logits = TinyNN.tnn_null_ptr @blocks_ffi = [BlockFFICache.new] end |
Instance Attribute Details
#blocks_ffi ⇒ Object
Returns the value of attribute blocks_ffi.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def blocks_ffi @blocks_ffi end |
#d_ff ⇒ Object
Returns the value of attribute d_ff.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def d_ff @d_ff end |
#d_head ⇒ Object
Returns the value of attribute d_head.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def d_head @d_head end |
#d_model ⇒ Object
Returns the value of attribute d_model.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def d_model @d_model end |
#n_heads ⇒ Object
Returns the value of attribute n_heads.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def n_heads @n_heads end |
#n_layers ⇒ Object
Returns the value of attribute n_layers.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def n_layers @n_layers end |
#realized ⇒ Object
Returns the value of attribute realized.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def realized @realized end |
#sess ⇒ Object
Returns the value of attribute sess.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def sess @sess end |
#t_final_norm_gamma ⇒ Object
Returns the value of attribute t_final_norm_gamma.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def t_final_norm_gamma @t_final_norm_gamma end |
#t_logits ⇒ Object
Returns the value of attribute t_logits.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def t_logits @t_logits end |
#t_pos_slice ⇒ Object
Returns the value of attribute t_pos_slice.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def t_pos_slice @t_pos_slice end |
#t_seq ⇒ Object
Returns the value of attribute t_seq.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def t_seq @t_seq end |
#t_token_embed ⇒ Object
Returns the value of attribute t_token_embed.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def @t_token_embed end |
#t_token_ids ⇒ Object
Returns the value of attribute t_token_ids.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def t_token_ids @t_token_ids end |
#t_x_embed ⇒ Object
Returns the value of attribute t_x_embed.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def @t_x_embed end |
#t_x_final ⇒ Object
Returns the value of attribute t_x_final.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def t_x_final @t_x_final end |
#vocab_size ⇒ Object
Returns the value of attribute vocab_size.
131 132 133 |
# File 'lib/toy/ffi/tinynn.rb', line 131 def vocab_size @vocab_size end |
Instance Method Details
#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ Object
Single attention head, given pre-normed x and the head’s persistent Q/K/V weights. See build_block’s docstring for the math.
298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 |
# File 'lib/toy/ffi/tinynn.rb', line 298 def build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) t_q = TinyNN.tnn_matmul(@sess, t_w_q, t_x) # ne=[d_head, T] t_k = TinyNN.tnn_matmul(@sess, t_w_k, t_x) # ne=[d_head, T] # v in Pattern A (ne=[T, d_head]) so head_out's k_dim matches. # mul_mat(x, w_v_t) where x.ne=[d_model, T] and w_v_t.ne=[d_model, d_head] # yields ne=[T, d_head]. ✓ t_v = TinyNN.tnn_matmul(@sess, t_x, t_w_v) t_scores = TinyNN.tnn_matmul(@sess, t_k, t_q) # ne=[T_key, T_query] t_scaled = TinyNN.tnn_scale(@sess, t_scores, scale) t_masked = TinyNN.tnn_diag_mask_inf(@sess, t_scaled, 0) t_attn = TinyNN.tnn_softmax(@sess, t_masked) # softmax along ne0 = key dim TinyNN.tnn_matmul(@sess, t_v, t_attn) # ne=[d_head, T_query] end |
#build_block(t_x, blk, eps, scale) ⇒ Object
Build one transformer block’s graph nodes. Returns the block’s output tensor (post-FFN residual). Mathematics:
h1 = rms_norm(x, norm1_gamma)
per head h:
q_h = w_q[h]^T @ h1 (mul_mat(w_q_t_h, h1) ne=[d_head, T])
k_h = w_k[h]^T @ h1
v_h = h1 @ w_v[h] (mul_mat(h1, w_v_t_h) ne=[T, d_head])
scores_h = mul_mat(k_h, q_h) ne=[T_key, T_query]
scaled_h = scale(scores_h, 1/sqrt(d_head))
masked_h = diag_mask_inf(scaled_h, 0) -- causal
attn_h = soft_max(masked_h) -- per-query softmax over keys
head_out_h = mul_mat(v_h, attn_h) ne=[d_head, T_query]
concat = concat_along_d(head_out_h for h in heads) ne=[d_model, T]
out_proj = mul_mat(w_o_t, concat) ne=[d_model, T]
x_attn = x + out_proj
h2 = rms_norm(x_attn, norm2_gamma)
ffn:
pre = mul_mat(w_ff1_t, h2) ne=[d_ff, T]
hidden = gelu(pre)
ffn_out= mul_mat(w_ff2_t, hidden) ne=[d_model, T]
x_out = x_attn + ffn_out
260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 |
# File 'lib/toy/ffi/tinynn.rb', line 260 def build_block(t_x, blk, eps, scale) # Pre-norm before attention. t_h1 = TinyNN.tnn_rms_norm(@sess, t_x, blk.t_norm1_gamma, eps) # Per-head attention. Build each head's output, then concat. t_head_outs = [build_attention_head(t_h1, blk.t_w_q[0], blk.t_w_k[0], blk.t_w_v[0], scale)] h = 1 while h < @n_heads t_head_outs.push(build_attention_head(t_h1, blk.t_w_q[h], blk.t_w_k[h], blk.t_w_v[h], scale)) h = h + 1 end # Concat along ne0 (d_head -> d_model). t_concat = t_head_outs[0] h = 1 while h < @n_heads t_concat = TinyNN.tnn_concat(@sess, t_concat, t_head_outs[h], 0) h = h + 1 end # Output projection + residual. t_out_proj = TinyNN.tnn_matmul(@sess, blk.t_w_o, t_concat) t_x_attn = TinyNN.tnn_add(@sess, t_x, t_out_proj) # Pre-norm before FFN. t_h2 = TinyNN.tnn_rms_norm(@sess, t_x_attn, blk.t_norm2_gamma, eps) # FFN (matches FFNFFICache's chained design). t_pre = TinyNN.tnn_matmul(@sess, blk.t_w_ff1, t_h2) t_hidden = TinyNN.tnn_gelu(@sess, t_pre) t_ffn = TinyNN.tnn_matmul(@sess, blk.t_w_ff2, t_hidden) # Second residual. TinyNN.tnn_add(@sess, t_x_attn, t_ffn) end |
#realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ Object
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 |
# File 'lib/toy/ffi/tinynn.rb', line 158 def realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) @t_seq = t_seq @d_model = d_model @d_ff = d_ff @n_heads = n_heads @d_head = d_model / n_heads @n_layers = n_layers @vocab_size = vocab_size @sess = TinyNN.tnn_session_new(0) # === Persistent weights (ctx_w) === @t_token_embed = TinyNN.tnn_input_2d_f32_persistent(@sess, vocab_size, d_model) @t_pos_slice = TinyNN.tnn_input_2d_f32_persistent(@sess, t_seq, d_model) @t_final_norm_gamma = TinyNN.tnn_input_1d_f32_persistent(@sess, d_model) # Build per-block tensor handles (seed-then-push for Spinel's # Array<BlockFFICache> inference). @blocks_ffi = [BlockFFICache.new] li = 1 while li < n_layers @blocks_ffi.push(BlockFFICache.new) li = li + 1 end li = 0 while li < n_layers blk = @blocks_ffi[li] blk.t_norm1_gamma = TinyNN.tnn_input_1d_f32_persistent(@sess, d_model) blk.t_norm2_gamma = TinyNN.tnn_input_1d_f32_persistent(@sess, d_model) # Per-head Q/K/V: shape (d_model, d_head). Uploaded TRANSPOSED so # ggml ne=[d_model, d_head] holds w.elem(r, c) = w[r][c]. blk.t_w_q = [TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model)] blk.t_w_k = [TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model)] blk.t_w_v = [TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model)] h = 1 while h < n_heads blk.t_w_q.push(TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model)) blk.t_w_k.push(TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model)) blk.t_w_v.push(TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model)) h = h + 1 end blk.t_w_o = TinyNN.tnn_input_2d_f32_persistent(@sess, d_model, d_model) blk.t_w_ff1 = TinyNN.tnn_input_2d_f32_persistent(@sess, d_ff, d_model) blk.t_w_ff2 = TinyNN.tnn_input_2d_f32_persistent(@sess, d_model, d_ff) li = li + 1 end TinyNN.tnn_finalize_weights(@sess) # === Compute input === @t_token_ids = TinyNN.tnn_input_1d_i32(@sess, t_seq) # === Forward graph === # x_embed = token_embed[ids] + pos_slice (ne=[d_model, T]) = TinyNN.tnn_get_rows(@sess, @t_token_embed, @t_token_ids) @t_x_embed = TinyNN.tnn_add(@sess, , @t_pos_slice) TinyNN.tnn_set_output(@t_x_embed) # Through each block. t_cur = @t_x_embed eps = 1.0e-5 scale = 1.0 / Math.sqrt(d_head.to_f) li = 0 while li < n_layers t_cur = build_block(t_cur, @blocks_ffi[li], eps, scale) li = li + 1 end # Final RMSNorm on the post-blocks x. @t_x_final = TinyNN.tnn_rms_norm(@sess, t_cur, @t_final_norm_gamma, eps) TinyNN.tnn_set_output(@t_x_final) # Tied unembed: logits = mul_mat(token_embed, x_final) ne=[vocab, T] @t_logits = TinyNN.tnn_matmul(@sess, @t_token_embed, @t_x_final) TinyNN.tnn_set_output(@t_logits) TinyNN.tnn_realize(@sess, @t_logits) @realized = true end |