Module: ToyGGUFFuser

Defined in:
lib/toy/train/toy_gguf_fuse.rb

Overview

P2.6 — head-fusing GGUF writer helper. ToyGGUFFuser converts a random_init Toy::LLM::Engine::LlamaSeqEngine (whose attention weights are named PER-HEAD: “blk.N.attn_q.head_H.weight”, each a contiguous

d_head, d_model

F32 tensor) into the FUSED llama.cpp naming

(“blk.N.attn_q.weight”, a single [n_heads*d_head, d_model] tensor) that realize_for_mmap expects.

Why this is the identity layout (NOT a reorder):

Each per-head tensor is allocated tnn_input_2d_f32_persistent(sess,
rows=d_head, cols=d_model): a fully-contiguous ggml tensor ne0=d_model,
ne1=d_head, i.e. d_head*d_model contiguous f32 in storage-element order.
On reload, realize_for_mmap reads head h at q_off_base +
h*head_nbytes(F32) where head_nbytes == d_head*d_model*4, and rebuilds a
view ne=[d_model,d_head] at that address. So the fused tensor on disk
must be head-0's d_head*d_model f32 block, then head-1's, ... — which is
exactly a single contiguous tensor ne0=d_model, ne1=n_heads*d_head
(Ruby rows=n_heads*d_head, cols=d_model). No transpose, no reorder.

Lossless f32 round-trip: tnn_download_to_f64_array does dst = (double)f32_storage (exact f32->f64 widening); tnn_upload_from_float_array does scratch = (float)data (f64->f32 narrowing of an exactly-widened f32 returns the identical f32 bits). Both walk the LINEAR data buffer in storage-element order, so no transpose is introduced by the round-trip.

F32-ONLY: this helper serialises the F32 params the random_init path produces. Q8 (head_nbytes type-8 branch) needs quantize-on-write the writer lacks and is explicitly out of scope.

Spinel notes:

- No Struct.new (landmine #16); positional methods, no default args.
- The returned plist is built by pushing :ptr handles onto an array
  seeded `[TinyNN.tnn_null_ptr]; pop` — the same pattern ToyDriftGrad
  uses; Spinel infers sp_*_ptr_array. We do NOT construct an Array<:ptr>
  literal inside the module (landmine #1).
- tnn_tensor_set_name (:str) is only ever called at runtime against a
  passed session's finalized tensor, never at class-load scope
  (project_step_bind_landmine_2026_05_28).
- Uniquely-prefixed locals (tgf_*) to dodge type-inference collisions.

Class Method Summary collapse

Class Method Details

.build_fused_into_write_session(src_cache, write_sess, untied) ⇒ Object

Allocate every FUSED-name tensor in ‘write_sess`, finalize the write session, then copy the F32 values across from `src_cache` (head-major concat for attention weights, verbatim for everything else). Returns the param-ordered Array<:ptr> of FUSED tensors living in `write_sess`, ready to hand to ToyGGUFWriter.write.

Args (no default args — Spinel):

src_cache  : a realized Toy::LLM::Engine::LlamaSeqEngine (random_init, F32).
write_sess : a fresh TinyNN.tnn_session_new(0); MUST stay alive
             until ToyGGUFWriter.write finalizes (gguf_add_tensor
             reads host data ptrs at finalize time).
untied     : true => emit "output.weight"; false => tied.

NOTE: src_cache.sess must ALSO stay alive across the whole call (we download from it after write_sess is finalized). Both sessions are held by the caller; we only read handles here.



56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# File 'lib/toy/train/toy_gguf_fuse.rb', line 56

def self.build_fused_into_write_session(src_cache, write_sess, untied)
  tgf_d_model  = src_cache.seq_d_model
  tgf_d_ff     = src_cache.seq_d_ff
  tgf_d_head   = src_cache.seq_d_head
  tgf_n_heads  = src_cache.seq_n_heads
  tgf_n_kv     = src_cache.seq_n_kv
  tgf_vocab    = src_cache.seq_vocab_size
  tgf_layers   = src_cache.seq_n_layers

  # --- Phase 1: ALLOCATE fused tensors in write_sess (pre-finalize) ---
  # Arch-level globals first (mirrors realize_for_random_init order).
  tgf_w_embed = TinyNN.tnn_input_2d_f32_persistent(write_sess, tgf_vocab, tgf_d_model)
  tgf_w_fnorm = TinyNN.tnn_input_1d_f32_persistent(write_sess, tgf_d_model)
  tgf_w_out   = TinyNN.tnn_null_ptr
  if untied
    tgf_w_out = TinyNN.tnn_input_2d_f32_persistent(write_sess, tgf_vocab, tgf_d_model)
  end

  # Per-block fused tensors. Q is [n_heads*d_head, d_model]; K/V are
  # [n_kv*d_head, d_model]; o/gate/up/down keep their full 2D shapes.
  tgf_blk_rn1  = [TinyNN.tnn_null_ptr]; tgf_blk_rn1.pop
  tgf_blk_rn2  = [TinyNN.tnn_null_ptr]; tgf_blk_rn2.pop
  tgf_blk_q    = [TinyNN.tnn_null_ptr]; tgf_blk_q.pop
  tgf_blk_k    = [TinyNN.tnn_null_ptr]; tgf_blk_k.pop
  tgf_blk_v    = [TinyNN.tnn_null_ptr]; tgf_blk_v.pop
  tgf_blk_o    = [TinyNN.tnn_null_ptr]; tgf_blk_o.pop
  tgf_blk_gate = [TinyNN.tnn_null_ptr]; tgf_blk_gate.pop
  tgf_blk_up   = [TinyNN.tnn_null_ptr]; tgf_blk_up.pop
  tgf_blk_down = [TinyNN.tnn_null_ptr]; tgf_blk_down.pop

  tgf_li = 0
  while tgf_li < tgf_layers
    tgf_blk_rn1.push(TinyNN.tnn_input_1d_f32_persistent(write_sess, tgf_d_model))
    tgf_blk_rn2.push(TinyNN.tnn_input_1d_f32_persistent(write_sess, tgf_d_model))
    tgf_blk_q.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                     tgf_n_heads * tgf_d_head, tgf_d_model))
    tgf_blk_k.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                     tgf_n_kv * tgf_d_head, tgf_d_model))
    tgf_blk_v.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                     tgf_n_kv * tgf_d_head, tgf_d_model))
    tgf_blk_o.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                     tgf_d_model, tgf_d_model))
    tgf_blk_gate.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                        tgf_d_ff, tgf_d_model))
    tgf_blk_up.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                      tgf_d_ff, tgf_d_model))
    tgf_blk_down.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                        tgf_d_model, tgf_d_ff))
    tgf_li = tgf_li + 1
  end

  TinyNN.tnn_finalize_weights(write_sess)

  # --- Phase 2: COPY values across + set FUSED names ---
  # Globals — verbatim element-for-element (same shape both sides).
  copy_verbatim(src_cache.sess, src_cache.t_seq_token_embed,
                write_sess, tgf_w_embed, tgf_vocab * tgf_d_model)
  TinyNN.tnn_tensor_set_name(tgf_w_embed, "token_embd.weight")

  copy_verbatim(src_cache.sess, src_cache.t_seq_final_norm_gamma,
                write_sess, tgf_w_fnorm, tgf_d_model)
  TinyNN.tnn_tensor_set_name(tgf_w_fnorm, "output_norm.weight")

  if untied
    copy_verbatim(src_cache.sess, src_cache.t_seq_output,
                  write_sess, tgf_w_out, tgf_vocab * tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_w_out, "output.weight")
  end

  tgf_li2 = 0
  while tgf_li2 < tgf_layers
    tgf_src_blk = src_cache.seq_blocks_ffi[tgf_li2]
    tgf_prefix  = "blk." + tgf_li2.to_s + "."

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_rn1_gamma,
                  write_sess, tgf_blk_rn1[tgf_li2], tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_rn1[tgf_li2], tgf_prefix + "attn_norm.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_rn2_gamma,
                  write_sess, tgf_blk_rn2[tgf_li2], tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_rn2[tgf_li2], tgf_prefix + "ffn_norm.weight")

    # Head-major concat: head h's d_head*d_model block lands at element
    # offset h*d_head*d_model == byte offset h*head_nbytes(F32) — exactly
    # the slice offset realize_for_mmap re-reads.
    copy_heads_concat(src_cache.sess, tgf_src_blk.t_seq_w_q, tgf_n_heads,
                      write_sess, tgf_blk_q[tgf_li2], tgf_d_head, tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_q[tgf_li2], tgf_prefix + "attn_q.weight")

    copy_heads_concat(src_cache.sess, tgf_src_blk.t_seq_w_k, tgf_n_kv,
                      write_sess, tgf_blk_k[tgf_li2], tgf_d_head, tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_k[tgf_li2], tgf_prefix + "attn_k.weight")

    copy_heads_concat(src_cache.sess, tgf_src_blk.t_seq_w_v, tgf_n_kv,
                      write_sess, tgf_blk_v[tgf_li2], tgf_d_head, tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_v[tgf_li2], tgf_prefix + "attn_v.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_w_o,
                  write_sess, tgf_blk_o[tgf_li2], tgf_d_model * tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_o[tgf_li2], tgf_prefix + "attn_output.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_w_gate,
                  write_sess, tgf_blk_gate[tgf_li2], tgf_d_ff * tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_gate[tgf_li2], tgf_prefix + "ffn_gate.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_w_up,
                  write_sess, tgf_blk_up[tgf_li2], tgf_d_ff * tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_up[tgf_li2], tgf_prefix + "ffn_up.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_w_down,
                  write_sess, tgf_blk_down[tgf_li2], tgf_d_model * tgf_d_ff)
    TinyNN.tnn_tensor_set_name(tgf_blk_down[tgf_li2], tgf_prefix + "ffn_down.weight")

    tgf_li2 = tgf_li2 + 1
  end

  # --- Phase 3: build the param-ordered plist (push, never literal) ---
  tgf_plist = [TinyNN.tnn_null_ptr]; tgf_plist.pop
  tgf_plist.push(tgf_w_embed)
  tgf_plist.push(tgf_w_fnorm)
  if untied
    tgf_plist.push(tgf_w_out)
  end
  tgf_li3 = 0
  while tgf_li3 < tgf_layers
    tgf_plist.push(tgf_blk_rn1[tgf_li3])
    tgf_plist.push(tgf_blk_rn2[tgf_li3])
    tgf_plist.push(tgf_blk_q[tgf_li3])
    tgf_plist.push(tgf_blk_k[tgf_li3])
    tgf_plist.push(tgf_blk_v[tgf_li3])
    tgf_plist.push(tgf_blk_o[tgf_li3])
    tgf_plist.push(tgf_blk_gate[tgf_li3])
    tgf_plist.push(tgf_blk_up[tgf_li3])
    tgf_plist.push(tgf_blk_down[tgf_li3])
    tgf_li3 = tgf_li3 + 1
  end
  tgf_plist
end

.build_lens_folded_into_write_session(src_cache, write_sess, untied) ⇒ Object

P4 — projection-lens variant of build_fused_into_write_session, for the from-scratch / warm-start RANDOM-INIT recipes that train under a projection lens (cfg.donor_d_in > 0). In that recipe the on-session token_embed is a FROZEN donor table [vocab, donor_d_in] and the TRAINABLE lens.proj.weight [donor_d_in, d_model] sits between get_rows and the first block (matmul(W_proj, embed) → d_model). The plain fuser would emit a [vocab, donor_d_in] embed + a lens.proj tensor that realize_for_mmap does not know how to load.

This method FOLDS the lens into the embedding at write time so the checkpoint is a STANDARD fused-llama GGUF (token_embd.weight is the already-projected [vocab, d_model] table, NO lens.proj). The fold is mathematically EXACT and matches the train-forward lens:

ggml matmul(W_proj, x) with W_proj ne=[donor, d_model] and
x=embed_donor ne=[donor, T] gives out[r,t] = sum_c W_proj[c,r]*embed[c,t]
(contraction on ne[0]=donor). Per-row v:
  embed_eff[v, r] = sum_c embed_donor[v, c] * W_proj[c, r]
In ggml storage order (ne0 = inner contiguous):
  embed_donor element [v*donor + c]   (ne0=donor, ne1=vocab)
  W_proj      element [r*donor + c]   (ne0=donor, ne1=d_model)
  embed_eff   element [v*d_model + r] (ne0=d_model, ne1=vocab)

Everything ELSE (per-block fused attention + FFN + norms + untied output) is byte-identical to build_fused_into_write_session — only the embed copy is replaced by the fold, and lens.proj is dropped.

Args (no default args — Spinel):

src_cache  : a realized Toy::LLM::Engine::LlamaSeqEngine, donor_d_in > 0, F32.
write_sess : fresh TinyNN.tnn_session_new(0); MUST stay alive until
             ToyGGUFWriter.write finalizes.
untied     : true => emit "output.weight" (required when donor>0).


226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
# File 'lib/toy/train/toy_gguf_fuse.rb', line 226

def self.build_lens_folded_into_write_session(src_cache, write_sess, untied)
  tgf_d_model  = src_cache.seq_d_model
  tgf_d_ff     = src_cache.seq_d_ff
  tgf_d_head   = src_cache.seq_d_head
  tgf_n_heads  = src_cache.seq_n_heads
  tgf_n_kv     = src_cache.seq_n_kv
  tgf_vocab    = src_cache.seq_vocab_size
  tgf_layers   = src_cache.seq_n_layers
  tgf_donor    = src_cache.seq_donor_d_in

  # --- Fold the lens into an effective [vocab, d_model] embedding ---
  # Download the donor table (ne0=donor, ne1=vocab) and the lens
  # (ne0=donor, ne1=d_model), both f32->f64 (exact), linear storage.
  tgf_embed_n = tgf_vocab * tgf_donor
  tgf_proj_n  = tgf_d_model * tgf_donor
  tgf_embed_donor = Mat.new(1, tgf_embed_n)
  tgf_proj        = Mat.new(1, tgf_proj_n)
  TinyNN.tnn_download_to_f64_array(src_cache.sess, src_cache.t_seq_token_embed,
                                   tgf_embed_donor.flat, tgf_embed_n)
  TinyNN.tnn_download_to_f64_array(src_cache.sess, src_cache.t_seq_w_proj,
                                   tgf_proj.flat, tgf_proj_n)

  # embed_eff[v*d_model + r] = sum_c donor[v*donor+c] * proj[r*donor+c]
  tgf_eff_n  = tgf_vocab * tgf_d_model
  tgf_embed_eff = Mat.new(1, tgf_eff_n)
  tgf_v = 0
  while tgf_v < tgf_vocab
    tgf_vbase = tgf_v * tgf_donor
    tgf_obase = tgf_v * tgf_d_model
    tgf_r = 0
    while tgf_r < tgf_d_model
      tgf_rbase = tgf_r * tgf_donor
      tgf_acc = 0.0
      tgf_c = 0
      while tgf_c < tgf_donor
        tgf_acc = tgf_acc + tgf_embed_donor.flat[tgf_vbase + tgf_c] *
                            tgf_proj.flat[tgf_rbase + tgf_c]
        tgf_c = tgf_c + 1
      end
      tgf_embed_eff.flat[tgf_obase + tgf_r] = tgf_acc
      tgf_r = tgf_r + 1
    end
    tgf_v = tgf_v + 1
  end

  # --- Phase 1: ALLOCATE fused tensors in write_sess (pre-finalize) ---
  # token_embd is now the STANDARD [vocab, d_model] table — NO lens.
  tgf_w_embed = TinyNN.tnn_input_2d_f32_persistent(write_sess, tgf_vocab, tgf_d_model)
  tgf_w_fnorm = TinyNN.tnn_input_1d_f32_persistent(write_sess, tgf_d_model)
  tgf_w_out   = TinyNN.tnn_null_ptr
  if untied
    tgf_w_out = TinyNN.tnn_input_2d_f32_persistent(write_sess, tgf_vocab, tgf_d_model)
  end

  tgf_blk_rn1  = [TinyNN.tnn_null_ptr]; tgf_blk_rn1.pop
  tgf_blk_rn2  = [TinyNN.tnn_null_ptr]; tgf_blk_rn2.pop
  tgf_blk_q    = [TinyNN.tnn_null_ptr]; tgf_blk_q.pop
  tgf_blk_k    = [TinyNN.tnn_null_ptr]; tgf_blk_k.pop
  tgf_blk_v    = [TinyNN.tnn_null_ptr]; tgf_blk_v.pop
  tgf_blk_o    = [TinyNN.tnn_null_ptr]; tgf_blk_o.pop
  tgf_blk_gate = [TinyNN.tnn_null_ptr]; tgf_blk_gate.pop
  tgf_blk_up   = [TinyNN.tnn_null_ptr]; tgf_blk_up.pop
  tgf_blk_down = [TinyNN.tnn_null_ptr]; tgf_blk_down.pop

  tgf_li = 0
  while tgf_li < tgf_layers
    tgf_blk_rn1.push(TinyNN.tnn_input_1d_f32_persistent(write_sess, tgf_d_model))
    tgf_blk_rn2.push(TinyNN.tnn_input_1d_f32_persistent(write_sess, tgf_d_model))
    tgf_blk_q.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                     tgf_n_heads * tgf_d_head, tgf_d_model))
    tgf_blk_k.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                     tgf_n_kv * tgf_d_head, tgf_d_model))
    tgf_blk_v.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                     tgf_n_kv * tgf_d_head, tgf_d_model))
    tgf_blk_o.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                     tgf_d_model, tgf_d_model))
    tgf_blk_gate.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                        tgf_d_ff, tgf_d_model))
    tgf_blk_up.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                      tgf_d_ff, tgf_d_model))
    tgf_blk_down.push(TinyNN.tnn_input_2d_f32_persistent(write_sess,
                        tgf_d_model, tgf_d_ff))
    tgf_li = tgf_li + 1
  end

  TinyNN.tnn_finalize_weights(write_sess)

  # --- Phase 2: COPY values across + set FUSED names ---
  # token_embd is the FOLDED embed_eff (upload directly, NOT verbatim).
  TinyNN.tnn_upload_from_float_array(write_sess, tgf_w_embed,
                                     tgf_embed_eff.flat, tgf_eff_n)
  TinyNN.tnn_tensor_set_name(tgf_w_embed, "token_embd.weight")

  copy_verbatim(src_cache.sess, src_cache.t_seq_final_norm_gamma,
                write_sess, tgf_w_fnorm, tgf_d_model)
  TinyNN.tnn_tensor_set_name(tgf_w_fnorm, "output_norm.weight")

  if untied
    copy_verbatim(src_cache.sess, src_cache.t_seq_output,
                  write_sess, tgf_w_out, tgf_vocab * tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_w_out, "output.weight")
  end

  tgf_li2 = 0
  while tgf_li2 < tgf_layers
    tgf_src_blk = src_cache.seq_blocks_ffi[tgf_li2]
    tgf_prefix  = "blk." + tgf_li2.to_s + "."

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_rn1_gamma,
                  write_sess, tgf_blk_rn1[tgf_li2], tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_rn1[tgf_li2], tgf_prefix + "attn_norm.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_rn2_gamma,
                  write_sess, tgf_blk_rn2[tgf_li2], tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_rn2[tgf_li2], tgf_prefix + "ffn_norm.weight")

    copy_heads_concat(src_cache.sess, tgf_src_blk.t_seq_w_q, tgf_n_heads,
                      write_sess, tgf_blk_q[tgf_li2], tgf_d_head, tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_q[tgf_li2], tgf_prefix + "attn_q.weight")

    copy_heads_concat(src_cache.sess, tgf_src_blk.t_seq_w_k, tgf_n_kv,
                      write_sess, tgf_blk_k[tgf_li2], tgf_d_head, tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_k[tgf_li2], tgf_prefix + "attn_k.weight")

    copy_heads_concat(src_cache.sess, tgf_src_blk.t_seq_w_v, tgf_n_kv,
                      write_sess, tgf_blk_v[tgf_li2], tgf_d_head, tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_v[tgf_li2], tgf_prefix + "attn_v.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_w_o,
                  write_sess, tgf_blk_o[tgf_li2], tgf_d_model * tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_o[tgf_li2], tgf_prefix + "attn_output.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_w_gate,
                  write_sess, tgf_blk_gate[tgf_li2], tgf_d_ff * tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_gate[tgf_li2], tgf_prefix + "ffn_gate.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_w_up,
                  write_sess, tgf_blk_up[tgf_li2], tgf_d_ff * tgf_d_model)
    TinyNN.tnn_tensor_set_name(tgf_blk_up[tgf_li2], tgf_prefix + "ffn_up.weight")

    copy_verbatim(src_cache.sess, tgf_src_blk.t_seq_w_down,
                  write_sess, tgf_blk_down[tgf_li2], tgf_d_model * tgf_d_ff)
    TinyNN.tnn_tensor_set_name(tgf_blk_down[tgf_li2], tgf_prefix + "ffn_down.weight")

    tgf_li2 = tgf_li2 + 1
  end

  # --- Phase 3: build the param-ordered plist (push, never literal) ---
  tgf_plist = [TinyNN.tnn_null_ptr]; tgf_plist.pop
  tgf_plist.push(tgf_w_embed)
  tgf_plist.push(tgf_w_fnorm)
  if untied
    tgf_plist.push(tgf_w_out)
  end
  tgf_li3 = 0
  while tgf_li3 < tgf_layers
    tgf_plist.push(tgf_blk_rn1[tgf_li3])
    tgf_plist.push(tgf_blk_rn2[tgf_li3])
    tgf_plist.push(tgf_blk_q[tgf_li3])
    tgf_plist.push(tgf_blk_k[tgf_li3])
    tgf_plist.push(tgf_blk_v[tgf_li3])
    tgf_plist.push(tgf_blk_o[tgf_li3])
    tgf_plist.push(tgf_blk_gate[tgf_li3])
    tgf_plist.push(tgf_blk_up[tgf_li3])
    tgf_plist.push(tgf_blk_down[tgf_li3])
    tgf_li3 = tgf_li3 + 1
  end
  tgf_plist
end

.copy_heads_concat(src_sess, src_head_arr, n_heads, dst_sess, dst_t, d_head, d_model) ⇒ Object

Concatenate ‘n_heads` per-head [d_head, d_model] tensors (head order 0..n_heads-1) into one linear buffer, then upload into the fused dst tensor [n_heads*d_head, d_model]. head h’s d_head*d_model block lands at element offset h*d_head*d_model.



410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
# File 'lib/toy/train/toy_gguf_fuse.rb', line 410

def self.copy_heads_concat(src_sess, src_head_arr, n_heads, dst_sess, dst_t, d_head, d_model)
  tgf_per   = d_head * d_model
  tgf_total = n_heads * tgf_per
  tgf_buf   = Mat.new(1, tgf_total)
  tgf_tmp   = Mat.new(1, tgf_per)
  tgf_h = 0
  while tgf_h < n_heads
    TinyNN.tnn_download_to_f64_array(src_sess, src_head_arr[tgf_h], tgf_tmp.flat, tgf_per)
    tgf_base = tgf_h * tgf_per
    tgf_e = 0
    while tgf_e < tgf_per
      tgf_buf.flat[tgf_base + tgf_e] = tgf_tmp.flat[tgf_e]
      tgf_e = tgf_e + 1
    end
    tgf_h = tgf_h + 1
  end
  TinyNN.tnn_upload_from_float_array(dst_sess, dst_t, tgf_buf.flat, tgf_total)
end

.copy_verbatim(src_sess, src_t, dst_sess, dst_t, n) ⇒ Object

Download ‘n` f32 elements from src tensor (f32->f64), upload them into dst (f64->f32). Both walk linear storage order, so this is an exact element-for-element copy when src and dst have the same total element count.



400
401
402
403
404
# File 'lib/toy/train/toy_gguf_fuse.rb', line 400

def self.copy_verbatim(src_sess, src_t, dst_sess, dst_t, n)
  tgf_buf = Mat.new(1, n)
  TinyNN.tnn_download_to_f64_array(src_sess, src_t, tgf_buf.flat, n)
  TinyNN.tnn_upload_from_float_array(dst_sess, dst_t, tgf_buf.flat, n)
end