Module: GGUFLoad

Defined in:
lib/toy/io/loaders/toy_smollm2_loader.rb,
lib/toy/io/gguf_load.rb,
lib/toy/io/loaders/toy_gpt2_loader.rb,
lib/toy/io/loaders/toy_smollm2_loader.rb

Overview

Read llama-family hyperparameters from a GGUF’s kv metadata. Mirrors GPT2ConfigLoader but for ‘llama.*` keys (set by convert_smollm2_to_gguf.py).

Defined Under Namespace

Classes: SmolLM2Flags

Class Method Summary collapse

Class Method Details

.detect_smollm2_flags(path) ⇒ Object



177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 177

def self.detect_smollm2_flags(path)
  handle = TinyNN.tnn_gguf_load(path)
  if handle == nil
    return SmolLM2Flags.new(false, false, false, 0, false, 0, 0, 0)
  end
  # Gemma 2 ties embeddings (no separate output.weight), but the
  # convention varies. We detect tie via tensor presence, not arch.
  untied   = TinyNN.tnn_gguf_find_index(handle, "output.weight")       >= 0
  qkv_bias = TinyNN.tnn_gguf_find_index(handle, "blk.0.attn_q.bias")   >= 0
  # I-Gemma (#113): post-norm tensors. Their presence is the
  # sentinel for "Gemma 2-shaped block" even if the metadata arch
  # name varies. attn_q_norm-style models (Qwen3) don't have these.
  has_post_norms = TinyNN.tnn_gguf_find_index(handle, "blk.0.post_attention_norm.weight") >= 0
  # M1 + #110: QK-norm — presence of attn_q_norm tensors signals
  # "apply RMSNorm to Q,K before RoPE". The gamma shape distinguishes
  # the two known dialects:
  #   ne[0] == d_head  → Qwen3-style (shared per-head gamma, applied
  #                      after the head split; equivalent across heads).
  #   ne[0] == d_model → OLMoE / Granite-style (full-Q gamma, applied
  #                      to the concatenated d_model Q vector BEFORE
  #                      the head split; variance is over d_model dims).
  # These are mathematically distinct: in the full-Q form, RMSNorm
  # variance pools across all heads, so per-head behavior differs.
  qn_idx   = TinyNN.tnn_gguf_find_index(handle, "blk.0.attn_q_norm.weight")
  qk_norm  = qn_idx >= 0
  qk_norm_kind = 0
  if qk_norm
    gamma_ne0 = TinyNN.tnn_gguf_tensor_ne(handle, qn_idx, 0)
    # Probe d_model and the head count to derive d_head. Multi-arch
    # prefix logic — try each known arch in order.
    ap = "llama"
    if TinyNN.tnn_gguf_get_u32(handle, "llama.embedding_length") < 0
      if TinyNN.tnn_gguf_get_u32(handle, "olmoe.embedding_length") >= 0
        ap = "olmoe"
      elsif TinyNN.tnn_gguf_get_u32(handle, "gemma2.embedding_length") >= 0
        ap = "gemma2"
      end
    end
    d_model_v = TinyNN.tnn_gguf_get_u32(handle, ap + ".embedding_length")
    n_heads_v = TinyNN.tnn_gguf_get_u32(handle, ap + ".attention.head_count")
    head_dim  = TinyNN.tnn_gguf_get_u32(handle, ap + ".attention.key_length")
    if head_dim <= 0 && n_heads_v > 0
      head_dim = d_model_v / n_heads_v
    end
    if gamma_ne0 == head_dim
      qk_norm_kind = 1   # per-head shared
    elsif gamma_ne0 == d_model_v
      qk_norm_kind = 2   # full-Q
    else
      # Unknown shape; warn loudly and default to per-head shared.
      # If this fires the model output will be wrong.
      puts "WARN: blk.0.attn_q_norm.weight has ne[0]=" + gamma_ne0.to_s +
           " (expected d_head=" + head_dim.to_s + " or d_model=" +
           d_model_v.to_s + "). Defaulting to per-head shared."
      qk_norm_kind = 1
    end
  end
  # M3 + I-Gemma: sliding-window attention. llama.cpp emits the
  # window size as `<arch>.attention.sliding_window`. Treat -1 /
  # missing as 0. Try each known arch prefix.
  sw = TinyNN.tnn_gguf_get_u32(handle, "llama.attention.sliding_window")
  if sw < 0
    sw = TinyNN.tnn_gguf_get_u32(handle, "olmoe.attention.sliding_window")
  end
  if sw < 0
    sw = TinyNN.tnn_gguf_get_u32(handle, "gemma2.attention.sliding_window")
  end
  if sw < 0; sw = 0; end
  # I-Gemma: Gemma 2 applies SWA on alternating layers (the
  # `sliding_window_pattern=2` HF config; layers alternate between
  # full attention and sliding). llama.cpp encodes this implicitly
  # by setting attention.sliding_window AND using the gemma2 arch
  # prefix — there's no metadata key for the pattern itself, it's
  # inferred from `general.architecture == "gemma2"`.
  swa_alternates = false
  arch_name      = TinyNN.tnn_gguf_get_str(handle, "general.architecture")
  if arch_name == "gemma2" && sw > 0
    swa_alternates = true
  end
  # I-Gemma: soft-cap parameters for attention logits and the final
  # output logits. Read as f32; default 0.0 (no softcap).
  attn_softcap  = TinyNN.tnn_gguf_get_f32(handle, "gemma2.attn_logit_softcapping")
  final_softcap = TinyNN.tnn_gguf_get_f32(handle, "gemma2.final_logit_softcapping")
  if attn_softcap  <  0.0; attn_softcap  = 0.0; end
  if final_softcap <  0.0; final_softcap = 0.0; end
  # I-Gemma: embedding scale. Gemma 2 multiplies token embeddings
  # by sqrt(d_model) post-lookup. Other archs use 1.0.
  embed_scale = 1.0
  if arch_name == "gemma2"
    d_model_g = TinyNN.tnn_gguf_get_u32(handle, "gemma2.embedding_length")
    if d_model_g > 0
      # Newton sqrt avoids the Math.sqrt poly-dispatch landmine.
      x = d_model_g.to_f
      s = x > 1.0 ? x : 1.0
      ni = 0
      while ni < 30
        s = 0.5 * (s + x / s)
        ni = ni + 1
      end
      embed_scale = s
    end
  end
  # M2.3: MoE detection. Presence of ffn_gate_inp.weight on layer 0
  # is the sentinel. n_experts / n_experts_used live in <arch>.*
  # metadata keys; we try llama.* then fall back to olmoe.* (and
  # any future arch the same way). We don't *need* to know the arch
  # name itself — just the values.
  is_moe = TinyNN.tnn_gguf_find_index(handle, "blk.0.ffn_gate_inp.weight") >= 0
  n_experts      = 0
  n_experts_used = 0
  if is_moe
    ne_v = TinyNN.tnn_gguf_get_u32(handle, "llama.expert_count")
    nu_v = TinyNN.tnn_gguf_get_u32(handle, "llama.expert_used_count")
    if ne_v < 0
      ne_v = TinyNN.tnn_gguf_get_u32(handle, "olmoe.expert_count")
      nu_v = TinyNN.tnn_gguf_get_u32(handle, "olmoe.expert_used_count")
    end
    n_experts      = ne_v > 0 ? ne_v : 0
    n_experts_used = nu_v > 0 ? nu_v : 0
  end
  TinyNN.tnn_gguf_free(handle)
  SmolLM2Flags.new(untied, qkv_bias, qk_norm, sw,
                   is_moe, n_experts, n_experts_used, qk_norm_kind,
                   has_post_norms, embed_scale,
                   attn_softcap, final_softcap, swa_alternates)
end

.detect_weight_type(path) ⇒ Object

Detect the GGUF’s 2D linear weight type. Peeks at blk.0.attn_q.weight (always present for llama-family models). Returns the ggml type integer (0=F32, 8=Q8_0). Callers should pass this to kv.set_weight_type before kv.realize_for to enable the Q8-stays-Q8 path.



614
615
616
617
618
619
620
621
622
623
624
625
626
627
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 614

def self.detect_weight_type(path)
  handle = TinyNN.tnn_gguf_load(path)
  if handle == nil
    return 0
  end
  idx = TinyNN.tnn_gguf_find_index(handle, "blk.0.attn_q.weight")
  t   = if idx >= 0
          TinyNN.tnn_gguf_tensor_type(handle, idx)
        else
          0
        end
  TinyNN.tnn_gguf_free(handle)
  t
end

.find_index(handle, name, n_tensors) ⇒ Object

Linear-scan tensor lookup. 100 tensors × ~50 reads = 5000 string compares — fine. A hash map would force Spinel into a polymorphic value type; not worth it.



63
64
65
66
67
68
69
70
71
72
# File 'lib/toy/io/gguf_load.rb', line 63

def self.find_index(handle, name, n_tensors)
  i = 0
  while i < n_tensors
    if TinyNN.tnn_gguf_tensor_name(handle, i) == name
      return i
    end
    i = i + 1
  end
  -1
end

.load_gpt2(model, path) ⇒ Object

Load distilgpt2-shaped GGUF (also fits gpt2-small/medium/large) into a caller-constructed GPT2LM. Returns true on success.



274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# File 'lib/toy/io/gguf_load.rb', line 274

def self.load_gpt2(model, path)
  handle = TinyNN.tnn_gguf_load(path)
  if handle == nil
    puts "open failed: " + path
    return false
  end
  n_tensors = TinyNN.tnn_gguf_n_tensors(handle)
  puts "loading " + path + " (" + n_tensors.to_s + " tensors)"

  d_model = model.d_model
  d_head  = model.d_head
  n_heads = model.n_heads

  # Globals
  read_mat(handle,   "token_embd.weight",    model.token_embed, n_tensors)
  read_mat(handle,   "position_embd.weight", model.pos_embed,   n_tensors)
  read_array(handle, "output_norm.weight",   model.ln_f_gamma,  n_tensors)
  read_array(handle, "output_norm.bias",     model.ln_f_beta,   n_tensors)

  # Per-block
  li = 0
  while li < model.n_layers
    blk    = model.gpt2_blocks[li]
    prefix = "blk." + li.to_s

    read_array(handle, prefix + ".attn_norm.weight", blk.ln1_gamma, n_tensors)
    read_array(handle, prefix + ".attn_norm.bias",   blk.ln1_beta,  n_tensors)
    read_array(handle, prefix + ".ffn_norm.weight",  blk.ln2_gamma, n_tensors)
    read_array(handle, prefix + ".ffn_norm.bias",    blk.ln2_beta,  n_tensors)

    read_split_heads_weight(handle, prefix + ".attn_q.weight",
                             blk.w_q, n_heads, d_model, d_head, n_tensors)
    read_split_heads_weight(handle, prefix + ".attn_k.weight",
                             blk.w_k, n_heads, d_model, d_head, n_tensors)
    read_split_heads_weight(handle, prefix + ".attn_v.weight",
                             blk.w_v, n_heads, d_model, d_head, n_tensors)
    read_split_heads_bias(handle, prefix + ".attn_q.bias",
                           blk.b_q, n_heads, d_head, n_tensors)
    read_split_heads_bias(handle, prefix + ".attn_k.bias",
                           blk.b_k, n_heads, d_head, n_tensors)
    read_split_heads_bias(handle, prefix + ".attn_v.bias",
                           blk.b_v, n_heads, d_head, n_tensors)

    read_mat(handle,   prefix + ".attn_output.weight", blk.w_o, n_tensors)
    read_array(handle, prefix + ".attn_output.bias",   blk.b_o, n_tensors)

    read_mat(handle,   prefix + ".ffn_up.weight",   blk.w_ff1, n_tensors)
    read_array(handle, prefix + ".ffn_up.bias",     blk.b_ff1, n_tensors)
    read_mat(handle,   prefix + ".ffn_down.weight", blk.w_ff2, n_tensors)
    read_array(handle, prefix + ".ffn_down.bias",   blk.b_ff2, n_tensors)

    li = li + 1
  end

  TinyNN.tnn_gguf_free(handle)
  true
end

.load_kv_cache_auto(kv_cache, path) ⇒ Object

Auto-dispatcher: peek at the toy.ggml_native metadata key and pick the matching loader. Keeps callers ignorant of the file layout.



631
632
633
634
635
636
637
638
639
640
641
642
643
644
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 631

def self.load_kv_cache_auto(kv_cache, path)
  handle = TinyNN.tnn_gguf_load(path)
  if handle == nil
    puts "open failed: " + path
    return false
  end
  is_native = TinyNN.tnn_gguf_get_bool(handle, "toy.ggml_native") == 1
  TinyNN.tnn_gguf_free(handle)
  if is_native
    load_kv_cache_directly_native(kv_cache, path)
  else
    load_kv_cache_directly(kv_cache, path)
  end
end

.load_kv_cache_directly(kv_cache, path) ⇒ Object

Inference-only loader: stream GGUF weights directly into the FFI persistent buffers, skipping the Ruby Float64 Mat allocation. 4 B/w vs the Mat-mediated 12 B/w; required for 7B-class models.

The kv_cache MUST already be realized via realize_for. We do not construct Toy::SmolLM2 at all — callers that need ‘describe` / `algorithm_card` should still use the Mat-mediated path on a 1×1-stub config.



312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 312

def self.load_kv_cache_directly(kv_cache, path)
  handle = TinyNN.tnn_gguf_load(path)
  if handle == nil
    puts "open failed: " + path
    return false
  end
  n_tensors = TinyNN.tnn_gguf_n_tensors(handle)
  puts "loading " + path + " → FFI direct (" + n_tensors.to_s + " tensors)"

  sess     = kv_cache.sess
  n_heads  = kv_cache.n_heads
  n_kv     = kv_cache.n_kv
  d_model  = kv_cache.d_model
  d_head   = kv_cache.d_head
  d_ff     = kv_cache.d_ff

  # --- Globals -----
  embed_idx = TinyNN.tnn_gguf_find_index(handle, "token_embd.weight")
  TinyNN.tnn_gguf_copy_to_persistent(handle, embed_idx,
                                      sess, kv_cache.t_token_embed)

  fn_idx = TinyNN.tnn_gguf_find_index(handle, "output_norm.weight")
  TinyNN.tnn_gguf_copy_1d_to_persistent(handle, fn_idx,
                                         sess, kv_cache.t_final_norm_gamma)

  if kv_cache.has_untied_output
    out_idx = TinyNN.tnn_gguf_find_index(handle, "output.weight")
    TinyNN.tnn_gguf_copy_to_persistent(handle, out_idx,
                                        sess, kv_cache.t_output)
  end

  # --- Per-block -----
  li = 0
  while li < kv_cache.n_layers
    blk_f  = kv_cache.kv_blocks_ffi[li]
    prefix = "blk." + li.to_s

    rn1_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_norm.weight")
    rn2_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_norm.weight")
    TinyNN.tnn_gguf_copy_1d_to_persistent(handle, rn1_idx, sess, blk_f.t_rn1_gamma)
    TinyNN.tnn_gguf_copy_1d_to_persistent(handle, rn2_idx, sess, blk_f.t_rn2_gamma)

    # Q (n_heads per-head slices of attn_q.weight)
    q_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_q.weight")
    hq = 0
    while hq < n_heads
      TinyNN.tnn_gguf_copy_head_slice_to_persistent(handle, q_idx, sess,
                                                     blk_f.t_w_q[hq],
                                                     hq, n_heads, d_model, d_head)
      hq = hq + 1
    end

    # K, V (n_kv per-head slices each)
    k_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_k.weight")
    v_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_v.weight")
    hkv = 0
    while hkv < n_kv
      TinyNN.tnn_gguf_copy_head_slice_to_persistent(handle, k_idx, sess,
                                                     blk_f.t_w_k[hkv],
                                                     hkv, n_kv, d_model, d_head)
      TinyNN.tnn_gguf_copy_head_slice_to_persistent(handle, v_idx, sess,
                                                     blk_f.t_w_v[hkv],
                                                     hkv, n_kv, d_model, d_head)
      hkv = hkv + 1
    end

    # Optional Q/K/V biases (Qwen2.x)
    if kv_cache.has_qkv_bias
      qb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_q.bias")
      kb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_k.bias")
      vb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_v.bias")
      hq = 0
      while hq < n_heads
        TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, qb_idx, sess,
                                                            blk_f.t_b_q[hq], hq, d_head)
        hq = hq + 1
      end
      hkv = 0
      while hkv < n_kv
        TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, kb_idx, sess,
                                                            blk_f.t_b_k[hkv], hkv, d_head)
        TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, vb_idx, sess,
                                                            blk_f.t_b_v[hkv], hkv, d_head)
        hkv = hkv + 1
      end
    end

    # O (attn_output.weight) — single transposed
    o_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_output.weight")
    TinyNN.tnn_gguf_copy_transposed_to_persistent(handle, o_idx, sess,
                                                   blk_f.t_w_o, d_model, d_model)

    # FFN — gate, up, down (each single transposed)
    gate_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_gate.weight")
    up_idx   = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_up.weight")
    down_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_down.weight")
    TinyNN.tnn_gguf_copy_transposed_to_persistent(handle, gate_idx, sess,
                                                   blk_f.t_w_gate, d_model, d_ff)
    TinyNN.tnn_gguf_copy_transposed_to_persistent(handle, up_idx,   sess,
                                                   blk_f.t_w_up,   d_model, d_ff)
    TinyNN.tnn_gguf_copy_transposed_to_persistent(handle, down_idx, sess,
                                                   blk_f.t_w_down, d_ff, d_model)

    li = li + 1
  end

  # Zero-init K/V buffers (matches the Mat-mediated path's kv_zero_*
  # uploads — without this the persistent K/V tensors contain
  # garbage from the backend's initial allocation).
  # P5.2: K and V share layout ne=[d_head, max_T] now, so the
  # zero-init Mat is shared too. Same Q8 skip rule for both.
  kv_zero = Mat.new(kv_cache.max_T, d_head)
  li = 0
  while li < kv_cache.n_layers
    blk_f = kv_cache.kv_blocks_ffi[li]
    hkv = 0
    while hkv < n_kv
      if kv_cache.kv_type_k != 8
        TinyNN.upload_row_major(sess, blk_f.t_K[hkv], kv_zero)
      end
      if kv_cache.kv_type_v != 8
        TinyNN.upload_row_major(sess, blk_f.t_V[hkv], kv_zero)
      end
      hkv = hkv + 1
    end
    li = li + 1
  end

  TinyNN.tnn_gguf_free(handle)
  true
end

.load_kv_cache_directly_native(kv_cache, path) ⇒ Object

Native-layout direct loader. Same shape as load_kv_cache_directly but the source GGUF was written with –ggml-native — 2D linear weights are stored in HF-native [out, in] row-major, which already matches ggml’s column-major ne=[in, out] byte order. All transposes are gone; per-head Q/K/V slices are contiguous byte ranges.

See [[project_mmap_phase1_2026_05_18]] / docs/memory-design.md for the rationale.



452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 452

def self.load_kv_cache_directly_native(kv_cache, path)
  handle = TinyNN.tnn_gguf_load(path)
  if handle == nil
    puts "open failed: " + path
    return false
  end
  n_tensors = TinyNN.tnn_gguf_n_tensors(handle)
  puts "loading " + path + " → FFI direct (native, " + n_tensors.to_s + " tensors)"

  sess     = kv_cache.sess
  n_heads  = kv_cache.n_heads
  n_kv     = kv_cache.n_kv
  d_model  = kv_cache.d_model
  d_head   = kv_cache.d_head
  d_ff     = kv_cache.d_ff

  # Globals (token_embd, output_norm, optional untied output) — these
  # were already non-transposed even in the old converter; loader is
  # identical to the legacy path.
  embed_idx = TinyNN.tnn_gguf_find_index(handle, "token_embd.weight")
  TinyNN.tnn_gguf_copy_to_persistent(handle, embed_idx,
                                      sess, kv_cache.t_token_embed)

  fn_idx = TinyNN.tnn_gguf_find_index(handle, "output_norm.weight")
  TinyNN.tnn_gguf_copy_1d_to_persistent(handle, fn_idx,
                                         sess, kv_cache.t_final_norm_gamma)

  if kv_cache.has_untied_output
    out_idx = TinyNN.tnn_gguf_find_index(handle, "output.weight")
    if kv_cache.weight_type != 0
      TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, out_idx,
                                                   sess, kv_cache.t_output)
    else
      TinyNN.tnn_gguf_copy_to_persistent(handle, out_idx,
                                          sess, kv_cache.t_output)
    end
  end

  li = 0
  while li < kv_cache.n_layers
    blk_f  = kv_cache.kv_blocks_ffi[li]
    prefix = "blk." + li.to_s

    rn1_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_norm.weight")
    rn2_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_norm.weight")
    TinyNN.tnn_gguf_copy_1d_to_persistent(handle, rn1_idx, sess, blk_f.t_rn1_gamma)
    TinyNN.tnn_gguf_copy_1d_to_persistent(handle, rn2_idx, sess, blk_f.t_rn2_gamma)

    # Per-head Q/K/V — native layout: contiguous byte range. When the
    # cache is in Q8 mode (Phase 3) we use the verbatim head-slice
    # helper, which is type-agnostic and just memcpys the right
    # contiguous range. For F32 mode the f32 helper does the same
    # plus a dequant fallback (in case the GGUF is Q8 but the cache
    # is F32 — old behavior).
    use_verbatim = kv_cache.weight_type != 0
    q_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_q.weight")
    hq = 0
    while hq < n_heads
      if use_verbatim
        TinyNN.tnn_gguf_copy_verbatim_head_slice_to_persistent(handle, q_idx, sess,
                                                                blk_f.t_w_q[hq],
                                                                hq, n_heads)
      else
        TinyNN.tnn_gguf_copy_head_slice_to_persistent_native(handle, q_idx, sess,
                                                              blk_f.t_w_q[hq],
                                                              hq, n_heads, d_model, d_head)
      end
      hq = hq + 1
    end

    k_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_k.weight")
    v_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_v.weight")
    hkv = 0
    while hkv < n_kv
      if use_verbatim
        TinyNN.tnn_gguf_copy_verbatim_head_slice_to_persistent(handle, k_idx, sess,
                                                                blk_f.t_w_k[hkv], hkv, n_kv)
        TinyNN.tnn_gguf_copy_verbatim_head_slice_to_persistent(handle, v_idx, sess,
                                                                blk_f.t_w_v[hkv], hkv, n_kv)
      else
        TinyNN.tnn_gguf_copy_head_slice_to_persistent_native(handle, k_idx, sess,
                                                              blk_f.t_w_k[hkv],
                                                              hkv, n_kv, d_model, d_head)
        TinyNN.tnn_gguf_copy_head_slice_to_persistent_native(handle, v_idx, sess,
                                                              blk_f.t_w_v[hkv],
                                                              hkv, n_kv, d_model, d_head)
      end
      hkv = hkv + 1
    end

    # Q/K/V biases: 1-D, identical loader (biases were already
    # take()'d untransposed even in the legacy converter).
    if kv_cache.has_qkv_bias
      qb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_q.bias")
      kb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_k.bias")
      vb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_v.bias")
      hq = 0
      while hq < n_heads
        TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, qb_idx, sess,
                                                            blk_f.t_b_q[hq], hq, d_head)
        hq = hq + 1
      end
      hkv = 0
      while hkv < n_kv
        TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, kb_idx, sess,
                                                            blk_f.t_b_k[hkv], hkv, d_head)
        TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, vb_idx, sess,
                                                            blk_f.t_b_v[hkv], hkv, d_head)
        hkv = hkv + 1
      end
    end

    # O / FFN gate / up / down — native: plain memcpy. Q8 mode
    # uses the verbatim primitive (same shape; type-preserving).
    o_idx    = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_output.weight")
    gate_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_gate.weight")
    up_idx   = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_up.weight")
    down_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_down.weight")
    if use_verbatim
      TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, o_idx,    sess, blk_f.t_w_o)
      TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, gate_idx, sess, blk_f.t_w_gate)
      TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, up_idx,   sess, blk_f.t_w_up)
      TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, down_idx, sess, blk_f.t_w_down)
    else
      TinyNN.tnn_gguf_copy_to_persistent(handle, o_idx,    sess, blk_f.t_w_o)
      TinyNN.tnn_gguf_copy_to_persistent(handle, gate_idx, sess, blk_f.t_w_gate)
      TinyNN.tnn_gguf_copy_to_persistent(handle, up_idx,   sess, blk_f.t_w_up)
      TinyNN.tnn_gguf_copy_to_persistent(handle, down_idx, sess, blk_f.t_w_down)
    end

    li = li + 1
  end

  # Zero-init K/V buffers (same as the legacy path).
  # P5.2: K and V share layout ne=[d_head, max_T] now, so the
  # zero-init Mat is shared too. Same Q8 skip rule for both.
  kv_zero = Mat.new(kv_cache.max_T, d_head)
  li = 0
  while li < kv_cache.n_layers
    blk_f = kv_cache.kv_blocks_ffi[li]
    hkv = 0
    while hkv < n_kv
      if kv_cache.kv_type_k != 8
        TinyNN.upload_row_major(sess, blk_f.t_K[hkv], kv_zero)
      end
      if kv_cache.kv_type_v != 8
        TinyNN.upload_row_major(sess, blk_f.t_V[hkv], kv_zero)
      end
      hkv = hkv + 1
    end
    li = li + 1
  end

  TinyNN.tnn_gguf_free(handle)
  true
end

.load_toy_gpt2(model, path) ⇒ Object

Same GGUF layout, loaded into a Toy::GPT2. The weights live under sub-modules now (‘blk.attn.w_q`, `blk.ln1.gamma`, …), so this mirrors load_gpt2 with the new path expressions.



13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# File 'lib/toy/io/loaders/toy_gpt2_loader.rb', line 13

def self.load_toy_gpt2(model, path)
  handle = TinyNN.tnn_gguf_load(path)
  if handle == nil
    puts "open failed: " + path
    return false
  end
  n_tensors = TinyNN.tnn_gguf_n_tensors(handle)
  puts "loading " + path + " (" + n_tensors.to_s + " tensors)"

  cfg     = model.cfg
  d_model = cfg.d_model
  n_heads = cfg.n_heads
  d_head  = d_model / n_heads

  read_mat(handle,   "token_embd.weight",    model.token_embed.weight, n_tensors)
  read_mat(handle,   "position_embd.weight", model.pos_embed.weight,   n_tensors)
  read_array(handle, "output_norm.weight",   model.final_norm.gamma,   n_tensors)
  read_array(handle, "output_norm.bias",     model.final_norm.beta,    n_tensors)

  li = 0
  while li < cfg.n_layers
    blk   = model.stack[li]
    prefix = "blk." + li.to_s

    read_array(handle, prefix + ".attn_norm.weight", blk.ln1.gamma, n_tensors)
    read_array(handle, prefix + ".attn_norm.bias",   blk.ln1.beta,  n_tensors)
    read_array(handle, prefix + ".ffn_norm.weight",  blk.ln2.gamma, n_tensors)
    read_array(handle, prefix + ".ffn_norm.bias",    blk.ln2.beta,  n_tensors)

    read_split_heads_weight(handle, prefix + ".attn_q.weight",
                             blk.attn.w_q, n_heads, d_model, d_head, n_tensors)
    read_split_heads_weight(handle, prefix + ".attn_k.weight",
                             blk.attn.w_k, n_heads, d_model, d_head, n_tensors)
    read_split_heads_weight(handle, prefix + ".attn_v.weight",
                             blk.attn.w_v, n_heads, d_model, d_head, n_tensors)
    read_split_heads_bias(handle, prefix + ".attn_q.bias",
                           blk.attn.b_q, n_heads, d_head, n_tensors)
    read_split_heads_bias(handle, prefix + ".attn_k.bias",
                           blk.attn.b_k, n_heads, d_head, n_tensors)
    read_split_heads_bias(handle, prefix + ".attn_v.bias",
                           blk.attn.b_v, n_heads, d_head, n_tensors)

    read_mat(handle,   prefix + ".attn_output.weight", blk.attn.w_o, n_tensors)
    read_array(handle, prefix + ".attn_output.bias",   blk.attn.b_o, n_tensors)

    read_mat(handle,   prefix + ".ffn_up.weight",   blk.ffn.w1, n_tensors)
    read_array(handle, prefix + ".ffn_up.bias",     blk.ffn.b1, n_tensors)
    read_mat(handle,   prefix + ".ffn_down.weight", blk.ffn.w2, n_tensors)
    read_array(handle, prefix + ".ffn_down.bias",   blk.ffn.b2, n_tensors)

    li = li + 1
  end

  TinyNN.tnn_gguf_free(handle)
  true
end

.load_toy_smollm2(model, path) ⇒ Object

Llama-family weight load into a Toy::SmolLM2.

Tensor name conventions match prep/convert_smollm2_to_gguf.py. The converter has already transposed every nn.Linear weight from HF’s [out, in] to our [in, out] orientation.



15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 15

def self.load_toy_smollm2(model, path)
  handle = TinyNN.tnn_gguf_load(path)
  if handle == nil
    puts "open failed: " + path
    return false
  end
  n_tensors = TinyNN.tnn_gguf_n_tensors(handle)
  puts "loading " + path + " (" + n_tensors.to_s + " tensors)"

  cfg     = model.cfg
  d_model = cfg.d_model
  n_heads = cfg.n_heads
  n_kv    = cfg.n_kv
  d_head  = d_model / n_heads

  read_mat(handle,   "token_embd.weight",  model.token_embed.weight, n_tensors)
  read_array(handle, "output_norm.weight", model.final_norm.gamma,   n_tensors)

  # Untied output (`output.weight`) is present for TinyLlama / Llama-2
  # but not for SmolLM2 / Qwen2.5. Detect via tensor presence; the
  # converter omits it for tied models.
  output_idx = find_index(handle, "output.weight", n_tensors)
  if output_idx >= 0
    puts "  untied output: output.weight present"
    model.enable_untied_output!
    read_mat(handle, "output.weight", model.output_proj, n_tensors)
  end

  # Q/K/V biases are a Qwen2.x trait (Llama / SmolLM2 / TinyLlama lack
  # them). Detect via attn_q.bias in block 0; the converter writes all
  # three when any are present in the HF safetensors. The per-head
  # variant (toy from-scratch checkpoints) carries blk.0.attn_q.head_0.bias.
  has_qkv_bias = (find_index(handle, "blk.0.attn_q.bias", n_tensors) >= 0) ||
                 (find_index(handle, "blk.0.attn_q.head_0.bias", n_tensors) >= 0)
  if has_qkv_bias
    puts "  Q/K/V biases present (Qwen2.x-style)"
  end

  # toy#gguf-checkpoint-reload (#153) — from-scratch checkpoints written
  # by ToyGGUFWriter store one tensor PER HEAD (blk.N.attn_q.head_H.weight)
  # rather than the fused llama.cpp shape. Detect via the head_0 sentinel.
  per_head = find_index(handle, "blk.0.attn_q.head_0.weight", n_tensors) >= 0
  if per_head
    puts "  per-head tensors (toy from-scratch checkpoint format)"
  end

  li = 0
  while li < cfg.n_layers
    blk    = model.stack[li]
    prefix = "blk." + li.to_s

    read_array(handle, prefix + ".attn_norm.weight", blk.rn1.gamma, n_tensors)
    read_array(handle, prefix + ".ffn_norm.weight",  blk.rn2.gamma, n_tensors)

    if per_head
      read_per_head_weight(handle, prefix + ".attn_q",
                            blk.attn.w_q, n_heads, d_model, d_head, n_tensors)
      read_per_head_weight(handle, prefix + ".attn_k",
                            blk.attn.w_k, n_kv,    d_model, d_head, n_tensors)
      read_per_head_weight(handle, prefix + ".attn_v",
                            blk.attn.w_v, n_kv,    d_model, d_head, n_tensors)
    else
      # Q: full [d_model, n_heads * d_head] = [d_model, d_model]
      read_split_heads_weight(handle, prefix + ".attn_q.weight",
                               blk.attn.w_q, n_heads, d_model, d_head, n_tensors)
      # K, V: narrower [d_model, n_kv * d_head] — uses the GQA reader.
      read_split_kv_weight(handle, prefix + ".attn_k.weight",
                            blk.attn.w_k, n_kv, d_model, d_head, n_tensors)
      read_split_kv_weight(handle, prefix + ".attn_v.weight",
                            blk.attn.w_v, n_kv, d_model, d_head, n_tensors)
    end
    read_mat(handle,   prefix + ".attn_output.weight", blk.attn.w_o, n_tensors)

    if has_qkv_bias
      if per_head
        read_per_head_bias(handle, prefix + ".attn_q",
                            blk.attn.b_q, n_heads, d_head, n_tensors)
        read_per_head_bias(handle, prefix + ".attn_k",
                            blk.attn.b_k, n_kv,    d_head, n_tensors)
        read_per_head_bias(handle, prefix + ".attn_v",
                            blk.attn.b_v, n_kv,    d_head, n_tensors)
      else
        # Q bias: [n_heads * d_head] split into per-Q-head arrays.
        read_split_heads_bias(handle, prefix + ".attn_q.bias",
                               blk.attn.b_q, n_heads, d_head, n_tensors)
        # K/V biases: [n_kv * d_head] split into per-KV-head arrays.
        read_split_kv_bias(handle, prefix + ".attn_k.bias",
                            blk.attn.b_k, n_kv, d_head, n_tensors)
        read_split_kv_bias(handle, prefix + ".attn_v.bias",
                            blk.attn.b_v, n_kv, d_head, n_tensors)
      end
      blk.attn.enable_qkv_bias!
    end

    read_mat(handle,   prefix + ".ffn_gate.weight", blk.ffn.w_gate, n_tensors)
    read_mat(handle,   prefix + ".ffn_up.weight",   blk.ffn.w_up,   n_tensors)
    read_mat(handle,   prefix + ".ffn_down.weight", blk.ffn.w_down, n_tensors)

    li = li + 1
  end

  TinyNN.tnn_gguf_free(handle)
  true
end

.read_array(handle, name, target, n_tensors) ⇒ Object

Read a 1-D tensor straight into an existing Array<Float>.



75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/toy/io/gguf_load.rb', line 75

def self.read_array(handle, name, target, n_tensors)
  idx = find_index(handle, name, n_tensors)
  if idx < 0
    puts "missing: " + name
    return
  end
  nel = target.length
  rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, target, nel)
  if rc != 0
    puts "read failed: " + name + " rc=" + rc.to_s
  end
end

.read_mat(handle, name, mat, n_tensors) ⇒ Object

Read a 2-D tensor straight into an existing Mat (writes to mat.flat).



89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/toy/io/gguf_load.rb', line 89

def self.read_mat(handle, name, mat, n_tensors)
  idx = find_index(handle, name, n_tensors)
  if idx < 0
    puts "missing: " + name
    return
  end
  nel = mat.nrows * mat.ncols
  rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, mat.flat, nel)
  if rc != 0
    puts "read failed: " + name + " rc=" + rc.to_s
  end
end

.read_per_head_bias(handle, prefix_attn, dst, n_heads, d_head, n_tensors) ⇒ Object

Per-head bias: blk.N.attn_<q|k|v>.head_H.bias, shape [d_head].



254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# File 'lib/toy/io/gguf_load.rb', line 254

def self.read_per_head_bias(handle, prefix_attn, dst, n_heads, d_head, n_tensors)
  h = 0
  while h < n_heads
    name = prefix_attn + ".head_" + h.to_s + ".bias"
    idx = find_index(handle, name, n_tensors)
    if idx < 0
      puts "missing: " + name
      return
    end
    rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, dst[h], d_head)
    if rc != 0
      puts "read failed: " + name + " rc=" + rc.to_s
      return
    end
    h = h + 1
  end
end

.read_per_head_weight(handle, prefix_attn, dst, n_heads, d_model, d_head, n_tensors) ⇒ Object

toy-checkpoint variant: each head is its own tensor named blk.N.attn_<q|k|v>.head_H.weight, shape [d_head, d_model] in ggml column-major (== row-major [d_model × d_head] in our Mat layout). That is exactly what a per-head Mat expects, so each tensor reads straight into its slot — no fan-out / strided extraction.

Used by toy#gguf-checkpoint-reload (#153) to load from-scratch toy GGUFs without going through the fused llama.cpp convention.



233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
# File 'lib/toy/io/gguf_load.rb', line 233

def self.read_per_head_weight(handle, prefix_attn, dst, n_heads, d_model, d_head, n_tensors)
  h = 0
  while h < n_heads
    name = prefix_attn + ".head_" + h.to_s + ".weight"
    idx = find_index(handle, name, n_tensors)
    if idx < 0
      puts "missing: " + name
      return
    end
    mat = dst[h]
    nel = d_model * d_head
    rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, mat.flat, nel)
    if rc != 0
      puts "read failed: " + name + " rc=" + rc.to_s
      return
    end
    h = h + 1
  end
end

.read_split_heads_bias(handle, name, dst, n_heads, d_head, n_tensors) ⇒ Object

Read a [d_model] concatenated-heads bias into n_heads × Array<Float>(d_head).



137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# File 'lib/toy/io/gguf_load.rb', line 137

def self.read_split_heads_bias(handle, name, dst, n_heads, d_head, n_tensors)
  idx = find_index(handle, name, n_tensors)
  if idx < 0
    puts "missing: " + name
    return
  end
  d_model = n_heads * d_head
  tmp = Array.new(d_model, 0.0)
  rc  = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, tmp, d_model)
  if rc != 0
    puts "read failed: " + name + " rc=" + rc.to_s
    return
  end
  h = 0
  while h < n_heads
    arr = dst[h]
    j = 0
    while j < d_head
      arr[j] = tmp[h * d_head + j]
      j = j + 1
    end
    h = h + 1
  end
end

.read_split_heads_weight(handle, name, dst, n_heads, d_model, d_head, n_tensors) ⇒ Object

Read a [d_model, d_model] concatenated-heads weight tensor into an Array<Mat> of n_heads × (d_model, d_head). Column block

h*d_head : (h+1)*d_head

of the source becomes head h’s matrix.



105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# File 'lib/toy/io/gguf_load.rb', line 105

def self.read_split_heads_weight(handle, name, dst, n_heads, d_model, d_head, n_tensors)
  idx = find_index(handle, name, n_tensors)
  if idx < 0
    puts "missing: " + name
    return
  end
  nel = d_model * d_model
  # Stage via a temporary flat buffer (~2.4 MB for distilgpt2);
  # the strided per-head copy can't run while ggml writes to dst.
  tmp = Array.new(nel, 0.0)
  rc  = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, tmp, nel)
  if rc != 0
    puts "read failed: " + name + " rc=" + rc.to_s
    return
  end
  h = 0
  while h < n_heads
    mat = dst[h]
    i = 0
    while i < d_model
      j = 0
      while j < d_head
        mat.flat[i * d_head + j] = tmp[i * d_model + h * d_head + j]
        j = j + 1
      end
      i = i + 1
    end
    h = h + 1
  end
end

.read_split_kv_bias(handle, name, dst, n_kv, d_head, n_tensors) ⇒ Object

GQA variant of read_split_heads_bias for K/V: the source is a 1-D bias of length n_kv * d_head, split into n_kv arrays of d_head. Used for Qwen2.x attn_k.bias / attn_v.bias.



165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
# File 'lib/toy/io/gguf_load.rb', line 165

def self.read_split_kv_bias(handle, name, dst, n_kv, d_head, n_tensors)
  idx = find_index(handle, name, n_tensors)
  if idx < 0
    puts "missing: " + name
    return
  end
  nel = n_kv * d_head
  tmp = Array.new(nel, 0.0)
  rc  = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, tmp, nel)
  if rc != 0
    puts "read failed: " + name + " rc=" + rc.to_s
    return
  end
  h = 0
  while h < n_kv
    arr = dst[h]
    j = 0
    while j < d_head
      arr[j] = tmp[h * d_head + j]
      j = j + 1
    end
    h = h + 1
  end
end

.read_split_kv_weight(handle, name, dst, n_kv, d_model, d_head, n_tensors) ⇒ Object

GQA variant of read_split_heads_weight: the source tensor is

d_model, n_kv * d_head

(not square), and we want to split it into

n_kv per-head matrices of shape (d_model, d_head). Mirrors the logic of read_split_heads_weight but with the narrower output dim.



194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
# File 'lib/toy/io/gguf_load.rb', line 194

def self.read_split_kv_weight(handle, name, dst, n_kv, d_model, d_head, n_tensors)
  idx = find_index(handle, name, n_tensors)
  if idx < 0
    puts "missing: " + name
    return
  end
  nel = d_model * n_kv * d_head
  tmp = Array.new(nel, 0.0)
  rc  = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, tmp, nel)
  if rc != 0
    puts "read failed: " + name + " rc=" + rc.to_s
    return
  end
  # Source row stride = n_kv * d_head; column block h is [h*d_head, (h+1)*d_head).
  src_cols = n_kv * d_head
  h = 0
  while h < n_kv
    mat = dst[h]
    i = 0
    while i < d_model
      j = 0
      while j < d_head
        mat.flat[i * d_head + j] = tmp[i * src_cols + h * d_head + j]
        j = j + 1
      end
      i = i + 1
    end
    h = h + 1
  end
end