Class: FullForwardFFICacheCuda

Inherits:

Object

Object
FullForwardFFICacheCuda

show all

Defined in:: lib/toy/ffi/tinynn_cuda.rb

Overview

Full forward of a TransformerLM as one persistent ggml graph. Built incrementally; M1.1 covered embed + positional embedding + tied unembed (the bookends). M1.2 adds one full transformer block: pre-RMSNorm, multi-head causal attention, residual, pre-RMSNorm, FFN, residual. M1.3+ will scale to n_layers blocks.

Layout conventions (see project_chained_ffn_2026_05_14):

- Mat (rows, cols) row-major upload  -> ggml ne=[cols, rows]
- Per-block intermediates carry ne=[d_model, T]: elem(d, t) is the
  logical value at (row=t, col=d).

Persistent (ctx_w):

- t_token_embed (vocab, d_model)
- t_pos_slice   (T, d_model)
- t_final_norm_gamma (d_model)
- per block (in @blocks_ffi):
  - t_norm1_gamma, t_norm2_gamma (d_model)
  - t_w_q[h], t_w_k[h], t_w_v[h] (d_model, d_head) per head
  - t_w_o   (d_model, d_model)
  - t_w_ff1 (d_model, d_ff), t_w_ff2 (d_ff, d_model)

Compute (ctx): t_token_ids (T int32), intermediates, t_logits

Instance Attribute Summary collapse

#blocks_ffi ⇒ Object

Returns the value of attribute blocks_ffi.
#d_ff ⇒ Object

Returns the value of attribute d_ff.
#d_head ⇒ Object

Returns the value of attribute d_head.
#d_model ⇒ Object

Returns the value of attribute d_model.
#n_heads ⇒ Object

Returns the value of attribute n_heads.
#n_layers ⇒ Object

Returns the value of attribute n_layers.
#realized ⇒ Object

Returns the value of attribute realized.
#sess ⇒ Object

Returns the value of attribute sess.
#t_final_norm_gamma ⇒ Object

Returns the value of attribute t_final_norm_gamma.
#t_logits ⇒ Object

Returns the value of attribute t_logits.
#t_pos_slice ⇒ Object

Returns the value of attribute t_pos_slice.
#t_seq ⇒ Object

Returns the value of attribute t_seq.
#t_token_embed ⇒ Object

Returns the value of attribute t_token_embed.
#t_token_ids ⇒ Object

Returns the value of attribute t_token_ids.
#t_x_embed ⇒ Object

Returns the value of attribute t_x_embed.
#t_x_final ⇒ Object

Returns the value of attribute t_x_final.
#vocab_size ⇒ Object

Returns the value of attribute vocab_size.

Instance Method Summary collapse

#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ Object

Single attention head, given pre-normed x and the head’s persistent Q/K/V weights.
#build_block(t_x, blk, eps, scale) ⇒ Object

Build one transformer block’s graph nodes.
#initialize ⇒ FullForwardFFICacheCuda constructor

A new instance of FullForwardFFICacheCuda.
#realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ Object

Constructor Details

#initialize ⇒ `FullForwardFFICacheCuda`

Returns a new instance of FullForwardFFICacheCuda.

# File 'lib/toy/ffi/tinynn_cuda.rb', line 920

def initialize
  @realized   = false
  @t_seq      = 0
  @d_model    = 0
  @d_ff       = 0
  @n_heads    = 0
  @d_head     = 0
  @n_layers   = 0
  @vocab_size = 0
  @sess               = TinyNNCuda.tnn_null_ptr
  @t_token_embed      = TinyNNCuda.tnn_null_ptr
  @t_pos_slice        = TinyNNCuda.tnn_null_ptr
  @t_token_ids        = TinyNNCuda.tnn_null_ptr
  @t_final_norm_gamma = TinyNNCuda.tnn_null_ptr
  @t_x_embed          = TinyNNCuda.tnn_null_ptr
  @t_x_final          = TinyNNCuda.tnn_null_ptr
  @t_logits           = TinyNNCuda.tnn_null_ptr
  @blocks_ffi         = [BlockFFICacheCuda.new]
end

Instance Attribute Details

#blocks_ffi ⇒ `Object`

Returns the value of attribute blocks_ffi.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def blocks_ffi
  @blocks_ffi
end

#d_ff ⇒ `Object`

Returns the value of attribute d_ff.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def d_ff
  @d_ff
end

#d_head ⇒ `Object`

Returns the value of attribute d_head.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def d_head
  @d_head
end

#d_model ⇒ `Object`

Returns the value of attribute d_model.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def d_model
  @d_model
end

#n_heads ⇒ `Object`

Returns the value of attribute n_heads.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def n_heads
  @n_heads
end

#n_layers ⇒ `Object`

Returns the value of attribute n_layers.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def n_layers
  @n_layers
end

#realized ⇒ `Object`

Returns the value of attribute realized.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def realized
  @realized
end

#sess ⇒ `Object`

Returns the value of attribute sess.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def sess
  @sess
end

#t_final_norm_gamma ⇒ `Object`

Returns the value of attribute t_final_norm_gamma.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_final_norm_gamma
  @t_final_norm_gamma
end

#t_logits ⇒ `Object`

Returns the value of attribute t_logits.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_logits
  @t_logits
end

#t_pos_slice ⇒ `Object`

Returns the value of attribute t_pos_slice.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_pos_slice
  @t_pos_slice
end

#t_seq ⇒ `Object`

Returns the value of attribute t_seq.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_seq
  @t_seq
end

#t_token_embed ⇒ `Object`

Returns the value of attribute t_token_embed.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_token_embed
  @t_token_embed
end

#t_token_ids ⇒ `Object`

Returns the value of attribute t_token_ids.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_token_ids
  @t_token_ids
end

#t_x_embed ⇒ `Object`

Returns the value of attribute t_x_embed.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_x_embed
  @t_x_embed
end

#t_x_final ⇒ `Object`

Returns the value of attribute t_x_final.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_x_final
  @t_x_final
end

#vocab_size ⇒ `Object`

Returns the value of attribute vocab_size.



913
914
915

# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def vocab_size
  @vocab_size
end

Instance Method Details

#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ `Object`

Single attention head, given pre-normed x and the head’s persistent Q/K/V weights. See build_block’s docstring for the math.

# File 'lib/toy/ffi/tinynn_cuda.rb', line 1080

def build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale)
  t_q = TinyNNCuda.tnn_matmul(@sess, t_w_q, t_x)   # ne=[d_head, T]
  t_k = TinyNNCuda.tnn_matmul(@sess, t_w_k, t_x)   # ne=[d_head, T]
  # v in Pattern A (ne=[T, d_head]) so head_out's k_dim matches.
  # mul_mat(x, w_v_t) where x.ne=[d_model, T] and w_v_t.ne=[d_model, d_head]
  # yields ne=[T, d_head]. ✓
  t_v = TinyNNCuda.tnn_matmul(@sess, t_x, t_w_v)

  t_scores = TinyNNCuda.tnn_matmul(@sess, t_k, t_q)            # ne=[T_key, T_query]
  t_scaled = TinyNNCuda.tnn_scale(@sess, t_scores, scale)
  t_masked = TinyNNCuda.tnn_diag_mask_inf(@sess, t_scaled, 0)
  t_attn   = TinyNNCuda.tnn_softmax(@sess, t_masked)           # softmax along ne0 = key dim

  TinyNNCuda.tnn_matmul(@sess, t_v, t_attn)                    # ne=[d_head, T_query]
end

#build_block(t_x, blk, eps, scale) ⇒ `Object`

Build one transformer block’s graph nodes. Returns the block’s output tensor (post-FFN residual). Mathematics:

h1 = rms_norm(x, norm1_gamma)
per head h:
  q_h = w_q[h]^T @ h1     (mul_mat(w_q_t_h, h1)  ne=[d_head, T])
  k_h = w_k[h]^T @ h1
  v_h = h1 @ w_v[h]       (mul_mat(h1, w_v_t_h)  ne=[T, d_head])
  scores_h = mul_mat(k_h, q_h)   ne=[T_key, T_query]
  scaled_h = scale(scores_h, 1/sqrt(d_head))
  masked_h = diag_mask_inf(scaled_h, 0)         -- causal
  attn_h   = soft_max(masked_h)  -- per-query softmax over keys
  head_out_h = mul_mat(v_h, attn_h)  ne=[d_head, T_query]
concat = concat_along_d(head_out_h for h in heads)  ne=[d_model, T]
out_proj = mul_mat(w_o_t, concat)  ne=[d_model, T]
x_attn = x + out_proj
h2 = rms_norm(x_attn, norm2_gamma)
ffn:
  pre    = mul_mat(w_ff1_t, h2)   ne=[d_ff,    T]
  hidden = gelu(pre)
  ffn_out= mul_mat(w_ff2_t, hidden) ne=[d_model, T]
x_out = x_attn + ffn_out

# File 'lib/toy/ffi/tinynn_cuda.rb', line 1042

def build_block(t_x, blk, eps, scale)
  # Pre-norm before attention.
  t_h1 = TinyNNCuda.tnn_rms_norm(@sess, t_x, blk.t_norm1_gamma, eps)

  # Per-head attention. Build each head's output, then concat.
  t_head_outs = [build_attention_head(t_h1, blk.t_w_q[0], blk.t_w_k[0], blk.t_w_v[0], scale)]
  h = 1
  while h < @n_heads
    t_head_outs.push(build_attention_head(t_h1, blk.t_w_q[h], blk.t_w_k[h], blk.t_w_v[h], scale))
    h = h + 1
  end

  # Concat along ne0 (d_head -> d_model).
  t_concat = t_head_outs[0]
  h = 1
  while h < @n_heads
    t_concat = TinyNNCuda.tnn_concat(@sess, t_concat, t_head_outs[h], 0)
    h = h + 1
  end

  # Output projection + residual.
  t_out_proj = TinyNNCuda.tnn_matmul(@sess, blk.t_w_o, t_concat)
  t_x_attn   = TinyNNCuda.tnn_add(@sess, t_x, t_out_proj)

  # Pre-norm before FFN.
  t_h2 = TinyNNCuda.tnn_rms_norm(@sess, t_x_attn, blk.t_norm2_gamma, eps)

  # FFN (matches FFNFFICache's chained design).
  t_pre    = TinyNNCuda.tnn_matmul(@sess, blk.t_w_ff1, t_h2)
  t_hidden = TinyNNCuda.tnn_gelu(@sess, t_pre)
  t_ffn    = TinyNNCuda.tnn_matmul(@sess, blk.t_w_ff2, t_hidden)

  # Second residual.
  TinyNNCuda.tnn_add(@sess, t_x_attn, t_ffn)
end

#realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ `Object`

# File 'lib/toy/ffi/tinynn_cuda.rb', line 940

def realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size)
  @t_seq      = t_seq
  @d_model    = d_model
  @d_ff       = d_ff
  @n_heads    = n_heads
  @d_head     = d_model / n_heads
  @n_layers   = n_layers
  @vocab_size = vocab_size

  @sess = TinyNNCuda.tnn_session_new(1)

  # === Persistent weights (ctx_w) ===
  @t_token_embed      = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, vocab_size, d_model)
  @t_pos_slice        = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, t_seq,      d_model)
  @t_final_norm_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model)

  # Build per-block tensor handles (seed-then-push for Spinel's
  # Array<BlockFFICache> inference).
  @blocks_ffi = [BlockFFICacheCuda.new]
  li = 1
  while li < n_layers
    @blocks_ffi.push(BlockFFICacheCuda.new)
    li = li + 1
  end

  li = 0
  while li < n_layers
    blk = @blocks_ffi[li]
    blk.t_norm1_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model)
    blk.t_norm2_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model)
    # Per-head Q/K/V: shape (d_model, d_head). Uploaded TRANSPOSED so
    # ggml ne=[d_model, d_head] holds w.elem(r, c) = w[r][c].
    blk.t_w_q = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)]
    blk.t_w_k = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)]
    blk.t_w_v = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)]
    h = 1
    while h < n_heads
      blk.t_w_q.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model))
      blk.t_w_k.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model))
      blk.t_w_v.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model))
      h = h + 1
    end
    blk.t_w_o   = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_model, d_model)
    blk.t_w_ff1 = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_ff,    d_model)
    blk.t_w_ff2 = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_model, d_ff)
    li = li + 1
  end

  TinyNNCuda.tnn_finalize_weights(@sess)

  # === Compute input ===
  @t_token_ids = TinyNNCuda.tnn_input_1d_i32(@sess, t_seq)

  # === Forward graph ===
  # x_embed = token_embed[ids] + pos_slice  (ne=[d_model, T])
  t_embedded = TinyNNCuda.tnn_get_rows(@sess, @t_token_embed, @t_token_ids)
  @t_x_embed = TinyNNCuda.tnn_add(@sess, t_embedded, @t_pos_slice)
  TinyNNCuda.tnn_set_output(@t_x_embed)

  # Through each block.
  t_cur = @t_x_embed
  eps   = 1.0e-5
  scale = 1.0 / Math.sqrt(d_head.to_f)
  li = 0
  while li < n_layers
    t_cur = build_block(t_cur, @blocks_ffi[li], eps, scale)
    li = li + 1
  end

  # Final RMSNorm on the post-blocks x.
  @t_x_final = TinyNNCuda.tnn_rms_norm(@sess, t_cur, @t_final_norm_gamma, eps)
  TinyNNCuda.tnn_set_output(@t_x_final)

  # Tied unembed: logits = mul_mat(token_embed, x_final)  ne=[vocab, T]
  @t_logits = TinyNNCuda.tnn_matmul(@sess, @t_token_embed, @t_x_final)
  TinyNNCuda.tnn_set_output(@t_logits)

  TinyNNCuda.tnn_realize(@sess, @t_logits)
  @realized = true
end

Class: FullForwardFFICacheCuda

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize ⇒ FullForwardFFICacheCuda

Instance Attribute Details

#blocks_ffi ⇒ Object

#d_ff ⇒ Object

#d_head ⇒ Object

#d_model ⇒ Object

#n_heads ⇒ Object

#n_layers ⇒ Object

#realized ⇒ Object

#sess ⇒ Object

#t_final_norm_gamma ⇒ Object

#t_logits ⇒ Object

#t_pos_slice ⇒ Object

#t_seq ⇒ Object

#t_token_embed ⇒ Object

#t_token_ids ⇒ Object

#t_x_embed ⇒ Object

#t_x_final ⇒ Object

#vocab_size ⇒ Object