Class: FullForwardFFICacheCuda

Inherits:
Object
  • Object
show all
Defined in:
lib/toy/ffi/tinynn_cuda.rb

Overview

Full forward of a TransformerLM as one persistent ggml graph. Built incrementally; M1.1 covered embed + positional embedding + tied unembed (the bookends). M1.2 adds one full transformer block: pre-RMSNorm, multi-head causal attention, residual, pre-RMSNorm, FFN, residual. M1.3+ will scale to n_layers blocks.

Layout conventions (see project_chained_ffn_2026_05_14):

- Mat (rows, cols) row-major upload  -> ggml ne=[cols, rows]
- Per-block intermediates carry ne=[d_model, T]: elem(d, t) is the
  logical value at (row=t, col=d).

Persistent (ctx_w):

- t_token_embed (vocab, d_model)
- t_pos_slice   (T, d_model)
- t_final_norm_gamma (d_model)
- per block (in @blocks_ffi):
  - t_norm1_gamma, t_norm2_gamma (d_model)
  - t_w_q[h], t_w_k[h], t_w_v[h] (d_model, d_head) per head
  - t_w_o   (d_model, d_model)
  - t_w_ff1 (d_model, d_ff), t_w_ff2 (d_ff, d_model)

Compute (ctx): t_token_ids (T int32), intermediates, t_logits

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeFullForwardFFICacheCuda

Returns a new instance of FullForwardFFICacheCuda.



920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
# File 'lib/toy/ffi/tinynn_cuda.rb', line 920

def initialize
  @realized   = false
  @t_seq      = 0
  @d_model    = 0
  @d_ff       = 0
  @n_heads    = 0
  @d_head     = 0
  @n_layers   = 0
  @vocab_size = 0
  @sess               = TinyNNCuda.tnn_null_ptr
  @t_token_embed      = TinyNNCuda.tnn_null_ptr
  @t_pos_slice        = TinyNNCuda.tnn_null_ptr
  @t_token_ids        = TinyNNCuda.tnn_null_ptr
  @t_final_norm_gamma = TinyNNCuda.tnn_null_ptr
  @t_x_embed          = TinyNNCuda.tnn_null_ptr
  @t_x_final          = TinyNNCuda.tnn_null_ptr
  @t_logits           = TinyNNCuda.tnn_null_ptr
  @blocks_ffi         = [BlockFFICacheCuda.new]
end

Instance Attribute Details

#blocks_ffiObject

Returns the value of attribute blocks_ffi.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def blocks_ffi
  @blocks_ffi
end

#d_ffObject

Returns the value of attribute d_ff.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def d_ff
  @d_ff
end

#d_headObject

Returns the value of attribute d_head.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def d_head
  @d_head
end

#d_modelObject

Returns the value of attribute d_model.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def d_model
  @d_model
end

#n_headsObject

Returns the value of attribute n_heads.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def n_heads
  @n_heads
end

#n_layersObject

Returns the value of attribute n_layers.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def n_layers
  @n_layers
end

#realizedObject

Returns the value of attribute realized.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def realized
  @realized
end

#sessObject

Returns the value of attribute sess.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def sess
  @sess
end

#t_final_norm_gammaObject

Returns the value of attribute t_final_norm_gamma.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_final_norm_gamma
  @t_final_norm_gamma
end

#t_logitsObject

Returns the value of attribute t_logits.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_logits
  @t_logits
end

#t_pos_sliceObject

Returns the value of attribute t_pos_slice.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_pos_slice
  @t_pos_slice
end

#t_seqObject

Returns the value of attribute t_seq.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_seq
  @t_seq
end

#t_token_embedObject

Returns the value of attribute t_token_embed.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_token_embed
  @t_token_embed
end

#t_token_idsObject

Returns the value of attribute t_token_ids.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_token_ids
  @t_token_ids
end

#t_x_embedObject

Returns the value of attribute t_x_embed.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_x_embed
  @t_x_embed
end

#t_x_finalObject

Returns the value of attribute t_x_final.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def t_x_final
  @t_x_final
end

#vocab_sizeObject

Returns the value of attribute vocab_size.



913
914
915
# File 'lib/toy/ffi/tinynn_cuda.rb', line 913

def vocab_size
  @vocab_size
end

Instance Method Details

#build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale) ⇒ Object

Single attention head, given pre-normed x and the head’s persistent Q/K/V weights. See build_block’s docstring for the math.



1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
# File 'lib/toy/ffi/tinynn_cuda.rb', line 1080

def build_attention_head(t_x, t_w_q, t_w_k, t_w_v, scale)
  t_q = TinyNNCuda.tnn_matmul(@sess, t_w_q, t_x)   # ne=[d_head, T]
  t_k = TinyNNCuda.tnn_matmul(@sess, t_w_k, t_x)   # ne=[d_head, T]
  # v in Pattern A (ne=[T, d_head]) so head_out's k_dim matches.
  # mul_mat(x, w_v_t) where x.ne=[d_model, T] and w_v_t.ne=[d_model, d_head]
  # yields ne=[T, d_head]. ✓
  t_v = TinyNNCuda.tnn_matmul(@sess, t_x, t_w_v)

  t_scores = TinyNNCuda.tnn_matmul(@sess, t_k, t_q)            # ne=[T_key, T_query]
  t_scaled = TinyNNCuda.tnn_scale(@sess, t_scores, scale)
  t_masked = TinyNNCuda.tnn_diag_mask_inf(@sess, t_scaled, 0)
  t_attn   = TinyNNCuda.tnn_softmax(@sess, t_masked)           # softmax along ne0 = key dim

  TinyNNCuda.tnn_matmul(@sess, t_v, t_attn)                    # ne=[d_head, T_query]
end

#build_block(t_x, blk, eps, scale) ⇒ Object

Build one transformer block’s graph nodes. Returns the block’s output tensor (post-FFN residual). Mathematics:

h1 = rms_norm(x, norm1_gamma)
per head h:
  q_h = w_q[h]^T @ h1     (mul_mat(w_q_t_h, h1)  ne=[d_head, T])
  k_h = w_k[h]^T @ h1
  v_h = h1 @ w_v[h]       (mul_mat(h1, w_v_t_h)  ne=[T, d_head])
  scores_h = mul_mat(k_h, q_h)   ne=[T_key, T_query]
  scaled_h = scale(scores_h, 1/sqrt(d_head))
  masked_h = diag_mask_inf(scaled_h, 0)         -- causal
  attn_h   = soft_max(masked_h)  -- per-query softmax over keys
  head_out_h = mul_mat(v_h, attn_h)  ne=[d_head, T_query]
concat = concat_along_d(head_out_h for h in heads)  ne=[d_model, T]
out_proj = mul_mat(w_o_t, concat)  ne=[d_model, T]
x_attn = x + out_proj
h2 = rms_norm(x_attn, norm2_gamma)
ffn:
  pre    = mul_mat(w_ff1_t, h2)   ne=[d_ff,    T]
  hidden = gelu(pre)
  ffn_out= mul_mat(w_ff2_t, hidden) ne=[d_model, T]
x_out = x_attn + ffn_out


1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
# File 'lib/toy/ffi/tinynn_cuda.rb', line 1042

def build_block(t_x, blk, eps, scale)
  # Pre-norm before attention.
  t_h1 = TinyNNCuda.tnn_rms_norm(@sess, t_x, blk.t_norm1_gamma, eps)

  # Per-head attention. Build each head's output, then concat.
  t_head_outs = [build_attention_head(t_h1, blk.t_w_q[0], blk.t_w_k[0], blk.t_w_v[0], scale)]
  h = 1
  while h < @n_heads
    t_head_outs.push(build_attention_head(t_h1, blk.t_w_q[h], blk.t_w_k[h], blk.t_w_v[h], scale))
    h = h + 1
  end

  # Concat along ne0 (d_head -> d_model).
  t_concat = t_head_outs[0]
  h = 1
  while h < @n_heads
    t_concat = TinyNNCuda.tnn_concat(@sess, t_concat, t_head_outs[h], 0)
    h = h + 1
  end

  # Output projection + residual.
  t_out_proj = TinyNNCuda.tnn_matmul(@sess, blk.t_w_o, t_concat)
  t_x_attn   = TinyNNCuda.tnn_add(@sess, t_x, t_out_proj)

  # Pre-norm before FFN.
  t_h2 = TinyNNCuda.tnn_rms_norm(@sess, t_x_attn, blk.t_norm2_gamma, eps)

  # FFN (matches FFNFFICache's chained design).
  t_pre    = TinyNNCuda.tnn_matmul(@sess, blk.t_w_ff1, t_h2)
  t_hidden = TinyNNCuda.tnn_gelu(@sess, t_pre)
  t_ffn    = TinyNNCuda.tnn_matmul(@sess, blk.t_w_ff2, t_hidden)

  # Second residual.
  TinyNNCuda.tnn_add(@sess, t_x_attn, t_ffn)
end

#realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size) ⇒ Object



940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
# File 'lib/toy/ffi/tinynn_cuda.rb', line 940

def realize_for(t_seq, d_model, d_ff, n_heads, n_layers, vocab_size)
  @t_seq      = t_seq
  @d_model    = d_model
  @d_ff       = d_ff
  @n_heads    = n_heads
  @d_head     = d_model / n_heads
  @n_layers   = n_layers
  @vocab_size = vocab_size

  @sess = TinyNNCuda.tnn_session_new(1)

  # === Persistent weights (ctx_w) ===
  @t_token_embed      = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, vocab_size, d_model)
  @t_pos_slice        = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, t_seq,      d_model)
  @t_final_norm_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model)

  # Build per-block tensor handles (seed-then-push for Spinel's
  # Array<BlockFFICache> inference).
  @blocks_ffi = [BlockFFICacheCuda.new]
  li = 1
  while li < n_layers
    @blocks_ffi.push(BlockFFICacheCuda.new)
    li = li + 1
  end

  li = 0
  while li < n_layers
    blk = @blocks_ffi[li]
    blk.t_norm1_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model)
    blk.t_norm2_gamma = TinyNNCuda.tnn_input_1d_f32_persistent(@sess, d_model)
    # Per-head Q/K/V: shape (d_model, d_head). Uploaded TRANSPOSED so
    # ggml ne=[d_model, d_head] holds w.elem(r, c) = w[r][c].
    blk.t_w_q = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)]
    blk.t_w_k = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)]
    blk.t_w_v = [TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model)]
    h = 1
    while h < n_heads
      blk.t_w_q.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model))
      blk.t_w_k.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model))
      blk.t_w_v.push(TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_head, d_model))
      h = h + 1
    end
    blk.t_w_o   = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_model, d_model)
    blk.t_w_ff1 = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_ff,    d_model)
    blk.t_w_ff2 = TinyNNCuda.tnn_input_2d_f32_persistent(@sess, d_model, d_ff)
    li = li + 1
  end

  TinyNNCuda.tnn_finalize_weights(@sess)

  # === Compute input ===
  @t_token_ids = TinyNNCuda.tnn_input_1d_i32(@sess, t_seq)

  # === Forward graph ===
  # x_embed = token_embed[ids] + pos_slice  (ne=[d_model, T])
  t_embedded = TinyNNCuda.tnn_get_rows(@sess, @t_token_embed, @t_token_ids)
  @t_x_embed = TinyNNCuda.tnn_add(@sess, t_embedded, @t_pos_slice)
  TinyNNCuda.tnn_set_output(@t_x_embed)

  # Through each block.
  t_cur = @t_x_embed
  eps   = 1.0e-5
  scale = 1.0 / Math.sqrt(d_head.to_f)
  li = 0
  while li < n_layers
    t_cur = build_block(t_cur, @blocks_ffi[li], eps, scale)
    li = li + 1
  end

  # Final RMSNorm on the post-blocks x.
  @t_x_final = TinyNNCuda.tnn_rms_norm(@sess, t_cur, @t_final_norm_gamma, eps)
  TinyNNCuda.tnn_set_output(@t_x_final)

  # Tied unembed: logits = mul_mat(token_embed, x_final)  ne=[vocab, T]
  @t_logits = TinyNNCuda.tnn_matmul(@sess, @t_token_embed, @t_x_final)
  TinyNNCuda.tnn_set_output(@t_logits)

  TinyNNCuda.tnn_realize(@sess, @t_logits)
  @realized = true
end