Class: FullForwardFFICache

Inherits:
Object
  • Object
show all
Defined in:
lib/toy/ffi/tinynn.rb

Overview

Full forward of a TransformerLM as one persistent ggml graph. Built incrementally; M1.1 covered embed + positional embedding + tied unembed (the bookends). M1.2 adds one full transformer block: pre-RMSNorm, multi-head causal attention, residual, pre-RMSNorm, FFN, residual. M1.3+ will scale to n_layers blocks.

Layout conventions (see project_chained_ffn_2026_05_14):

- Mat (rows, cols) row-major upload  -> ggml ne=[cols, rows]
- Per-block intermediates carry ne=[d_model, T]: elem(d, t) is the
  logical value at (row=t, col=d).

Persistent (ctx_w):

- t_token_embed (vocab, d_model)
- t_pos_slice   (T, d_model)
- t_final_norm_gamma (d_model)
- per block (in @blocks_ffi):
  - t_norm1_gamma, t_norm2_gamma (d_model)
  - t_w_q[h], t_w_k[h], t_w_v[h] (d_model, d_head) per head
  - t_w_o   (d_model, d_model)
  - t_w_ff1 (d_model, d_ff), t_w_ff2 (d_ff, d_model)

Compute (ctx): t_token_ids (T int32), intermediates, t_logits

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeFullForwardFFICache

Returns a new instance of FullForwardFFICache.



138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# File 'lib/toy/ffi/tinynn.rb', line 138

def initialize
  @realized   = false
  @t_seq      = 0
  @d_model    = 0
  @d_ff       = 0
  @n_heads    = 0
  @d_head     = 0
  @n_layers   = 0
  @vocab_size = 0
  @sess               = TinyNN.tnn_null_ptr
  @t_token_embed      = TinyNN.tnn_null_ptr
  @t_pos_slice        = TinyNN.tnn_null_ptr
  @t_token_ids        = TinyNN.tnn_null_ptr
  @t_final_norm_gamma = TinyNN.tnn_null_ptr
  @t_x_embed          = TinyNN.tnn_null_ptr
  @t_x_final          = TinyNN.tnn_null_ptr
  @t_logits           = TinyNN.tnn_null_ptr
  @blocks_ffi         = [BlockFFICache.new]
end

Instance Attribute Details

#blocks_ffiObject

Returns the value of attribute blocks_ffi.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def blocks_ffi
  @blocks_ffi
end

#d_ffObject

Returns the value of attribute d_ff.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def d_ff
  @d_ff
end

#d_headObject

Returns the value of attribute d_head.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def d_head
  @d_head
end

#d_modelObject

Returns the value of attribute d_model.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def d_model
  @d_model
end

#n_headsObject

Returns the value of attribute n_heads.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def n_heads
  @n_heads
end

#n_layersObject

Returns the value of attribute n_layers.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def n_layers
  @n_layers
end

#realizedObject

Returns the value of attribute realized.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def realized
  @realized
end

#sessObject

Returns the value of attribute sess.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def sess
  @sess
end

#t_final_norm_gammaObject

Returns the value of attribute t_final_norm_gamma.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def t_final_norm_gamma
  @t_final_norm_gamma
end

#t_logitsObject

Returns the value of attribute t_logits.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def t_logits
  @t_logits
end

#t_pos_sliceObject

Returns the value of attribute t_pos_slice.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def t_pos_slice
  @t_pos_slice
end

#t_seqObject

Returns the value of attribute t_seq.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def t_seq
  @t_seq
end

#t_token_embedObject

Returns the value of attribute t_token_embed.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def t_token_embed
  @t_token_embed
end

#t_token_idsObject

Returns the value of attribute t_token_ids.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def t_token_ids
  @t_token_ids
end

#t_x_embedObject

Returns the value of attribute t_x_embed.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def t_x_embed
  @t_x_embed
end

#t_x_finalObject

Returns the value of attribute t_x_final.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def t_x_final
  @t_x_final
end

#vocab_sizeObject

Returns the value of attribute vocab_size.



131
132
133
# File 'lib/toy/ffi/tinynn.rb', line 131

def vocab_size
  @vocab_size
end

Instance Method Details

#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ Object

Single attention head, given pre-normed x and the head’s persistent Q/K/V weights. See build_block’s docstring for the math.



298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
# File 'lib/toy/ffi/tinynn.rb', line 298

def build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale)
  t_q = TinyNN.tnn_matmul(@sess, t_w_q, t_x)   # ne=[d_head, T]
  t_k = TinyNN.tnn_matmul(@sess, t_w_k, t_x)   # ne=[d_head, T]
  # v in Pattern A (ne=[T, d_head]) so head_out's k_dim matches.
  # mul_mat(x, w_v_t) where x.ne=[d_model, T] and w_v_t.ne=[d_model, d_head]
  # yields ne=[T, d_head]. ✓
  t_v = TinyNN.tnn_matmul(@sess, t_x, t_w_v)

  t_scores = TinyNN.tnn_matmul(@sess, t_k, t_q)            # ne=[T_key, T_query]
  t_scaled = TinyNN.tnn_scale(@sess, t_scores, scale)
  t_masked = TinyNN.tnn_diag_mask_inf(@sess, t_scaled, 0)
  t_attn   = TinyNN.tnn_softmax(@sess, t_masked)           # softmax along ne0 = key dim

  TinyNN.tnn_matmul(@sess, t_v, t_attn)                    # ne=[d_head, T_query]
end

#build_block(t_x, blk, eps, scale) ⇒ Object

Build one transformer block’s graph nodes. Returns the block’s output tensor (post-FFN residual). Mathematics:

h1 = rms_norm(x, norm1_gamma)
per head h:
  q_h = w_q[h]^T @ h1     (mul_mat(w_q_t_h, h1)  ne=[d_head, T])
  k_h = w_k[h]^T @ h1
  v_h = h1 @ w_v[h]       (mul_mat(h1, w_v_t_h)  ne=[T, d_head])
  scores_h = mul_mat(k_h, q_h)   ne=[T_key, T_query]
  scaled_h = scale(scores_h, 1/sqrt(d_head))
  masked_h = diag_mask_inf(scaled_h, 0)         -- causal
  attn_h   = soft_max(masked_h)  -- per-query softmax over keys
  head_out_h = mul_mat(v_h, attn_h)  ne=[d_head, T_query]
concat = concat_along_d(head_out_h for h in heads)  ne=[d_model, T]
out_proj = mul_mat(w_o_t, concat)  ne=[d_model, T]
x_attn = x + out_proj
h2 = rms_norm(x_attn, norm2_gamma)
ffn:
  pre    = mul_mat(w_ff1_t, h2)   ne=[d_ff,    T]
  hidden = gelu(pre)
  ffn_out= mul_mat(w_ff2_t, hidden) ne=[d_model, T]
x_out = x_attn + ffn_out


260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
# File 'lib/toy/ffi/tinynn.rb', line 260

def build_block(t_x, blk, eps, scale)
  # Pre-norm before attention.
  t_h1 = TinyNN.tnn_rms_norm(@sess, t_x, blk.t_norm1_gamma, eps)

  # Per-head attention. Build each head's output, then concat.
  t_head_outs = [build_attention_head(t_h1, blk.t_w_q[0], blk.t_w_k[0], blk.t_w_v[0], scale)]
  h = 1
  while h < @n_heads
    t_head_outs.push(build_attention_head(t_h1, blk.t_w_q[h], blk.t_w_k[h], blk.t_w_v[h], scale))
    h = h + 1
  end

  # Concat along ne0 (d_head -> d_model).
  t_concat = t_head_outs[0]
  h = 1
  while h < @n_heads
    t_concat = TinyNN.tnn_concat(@sess, t_concat, t_head_outs[h], 0)
    h = h + 1
  end

  # Output projection + residual.
  t_out_proj = TinyNN.tnn_matmul(@sess, blk.t_w_o, t_concat)
  t_x_attn   = TinyNN.tnn_add(@sess, t_x, t_out_proj)

  # Pre-norm before FFN.
  t_h2 = TinyNN.tnn_rms_norm(@sess, t_x_attn, blk.t_norm2_gamma, eps)

  # FFN (matches FFNFFICache's chained design).
  t_pre    = TinyNN.tnn_matmul(@sess, blk.t_w_ff1, t_h2)
  t_hidden = TinyNN.tnn_gelu(@sess, t_pre)
  t_ffn    = TinyNN.tnn_matmul(@sess, blk.t_w_ff2, t_hidden)

  # Second residual.
  TinyNN.tnn_add(@sess, t_x_attn, t_ffn)
end

#realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ Object



158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
# File 'lib/toy/ffi/tinynn.rb', line 158

def realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size)
  @t_seq      = t_seq
  @d_model    = d_model
  @d_ff       = d_ff
  @n_heads    = n_heads
  @d_head     = d_model / n_heads
  @n_layers   = n_layers
  @vocab_size = vocab_size

  @sess = TinyNN.tnn_session_new(0)

  # === Persistent weights (ctx_w) ===
  @t_token_embed      = TinyNN.tnn_input_2d_f32_persistent(@sess, vocab_size, d_model)
  @t_pos_slice        = TinyNN.tnn_input_2d_f32_persistent(@sess, t_seq,      d_model)
  @t_final_norm_gamma = TinyNN.tnn_input_1d_f32_persistent(@sess, d_model)

  # Build per-block tensor handles (seed-then-push for Spinel's
  # Array<BlockFFICache> inference).
  @blocks_ffi = [BlockFFICache.new]
  li = 1
  while li < n_layers
    @blocks_ffi.push(BlockFFICache.new)
    li = li + 1
  end

  li = 0
  while li < n_layers
    blk = @blocks_ffi[li]
    blk.t_norm1_gamma = TinyNN.tnn_input_1d_f32_persistent(@sess, d_model)
    blk.t_norm2_gamma = TinyNN.tnn_input_1d_f32_persistent(@sess, d_model)
    # Per-head Q/K/V: shape (d_model, d_head). Uploaded TRANSPOSED so
    # ggml ne=[d_model, d_head] holds w.elem(r, c) = w[r][c].
    blk.t_w_q = [TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model)]
    blk.t_w_k = [TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model)]
    blk.t_w_v = [TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model)]
    h = 1
    while h < n_heads
      blk.t_w_q.push(TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model))
      blk.t_w_k.push(TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model))
      blk.t_w_v.push(TinyNN.tnn_input_2d_f32_persistent(@sess, d_head, d_model))
      h = h + 1
    end
    blk.t_w_o   = TinyNN.tnn_input_2d_f32_persistent(@sess, d_model, d_model)
    blk.t_w_ff1 = TinyNN.tnn_input_2d_f32_persistent(@sess, d_ff,    d_model)
    blk.t_w_ff2 = TinyNN.tnn_input_2d_f32_persistent(@sess, d_model, d_ff)
    li = li + 1
  end

  TinyNN.tnn_finalize_weights(@sess)

  # === Compute input ===
  @t_token_ids = TinyNN.tnn_input_1d_i32(@sess, t_seq)

  # === Forward graph ===
  # x_embed = token_embed[ids] + pos_slice  (ne=[d_model, T])
  t_embedded = TinyNN.tnn_get_rows(@sess, @t_token_embed, @t_token_ids)
  @t_x_embed = TinyNN.tnn_add(@sess, t_embedded, @t_pos_slice)
  TinyNN.tnn_set_output(@t_x_embed)

  # Through each block.
  t_cur = @t_x_embed
  eps   = 1.0e-5
  scale = 1.0 / Math.sqrt(d_head.to_f)
  li = 0
  while li < n_layers
    t_cur = build_block(t_cur, @blocks_ffi[li], eps, scale)
    li = li + 1
  end

  # Final RMSNorm on the post-blocks x.
  @t_x_final = TinyNN.tnn_rms_norm(@sess, t_cur, @t_final_norm_gamma, eps)
  TinyNN.tnn_set_output(@t_x_final)

  # Tied unembed: logits = mul_mat(token_embed, x_final)  ne=[vocab, T]
  @t_logits = TinyNN.tnn_matmul(@sess, @t_token_embed, @t_x_final)
  TinyNN.tnn_set_output(@t_logits)

  TinyNN.tnn_realize(@sess, @t_logits)
  @realized = true
end