Module: GGUFLoad
- Defined in:
- lib/toy/io/loaders/toy_smollm2_loader.rb,
lib/toy/io/gguf_load.rb,
lib/toy/io/loaders/toy_gpt2_loader.rb,
lib/toy/io/loaders/toy_smollm2_loader.rb
Overview
Read llama-family hyperparameters from a GGUF’s kv metadata. Mirrors GPT2ConfigLoader but for ‘llama.*` keys (set by convert_smollm2_to_gguf.py).
Defined Under Namespace
Classes: SmolLM2Flags
Class Method Summary collapse
- .detect_smollm2_flags(path) ⇒ Object
-
.detect_weight_type(path) ⇒ Object
Detect the GGUF’s 2D linear weight type.
-
.find_index(handle, name, n_tensors) ⇒ Object
Linear-scan tensor lookup.
-
.load_gpt2(model, path) ⇒ Object
Load distilgpt2-shaped GGUF (also fits gpt2-small/medium/large) into a caller-constructed GPT2LM.
-
.load_kv_cache_auto(kv_cache, path) ⇒ Object
Auto-dispatcher: peek at the toy.ggml_native metadata key and pick the matching loader.
-
.load_kv_cache_directly(kv_cache, path) ⇒ Object
Inference-only loader: stream GGUF weights directly into the FFI persistent buffers, skipping the Ruby Float64 Mat allocation.
-
.load_kv_cache_directly_native(kv_cache, path) ⇒ Object
Native-layout direct loader.
-
.load_toy_gpt2(model, path) ⇒ Object
Same GGUF layout, loaded into a Toy::GPT2.
-
.load_toy_smollm2(model, path) ⇒ Object
Llama-family weight load into a Toy::SmolLM2.
-
.read_array(handle, name, target, n_tensors) ⇒ Object
Read a 1-D tensor straight into an existing Array<Float>.
-
.read_mat(handle, name, mat, n_tensors) ⇒ Object
Read a 2-D tensor straight into an existing Mat (writes to mat.flat).
-
.read_per_head_bias(handle, prefix_attn, dst, n_heads, d_head, n_tensors) ⇒ Object
Per-head bias: blk.N.attn_<q|k|v>.head_H.bias, shape [d_head].
-
.read_per_head_weight(handle, prefix_attn, dst, n_heads, d_model, d_head, n_tensors) ⇒ Object
toy-checkpoint variant: each head is its own tensor named blk.N.attn_<q|k|v>.head_H.weight, shape [d_head, d_model] in ggml column-major (== row-major [d_model × d_head] in our Mat layout).
-
.read_split_heads_bias(handle, name, dst, n_heads, d_head, n_tensors) ⇒ Object
Read a [d_model] concatenated-heads bias into n_heads × Array<Float>(d_head).
-
.read_split_heads_weight(handle, name, dst, n_heads, d_model, d_head, n_tensors) ⇒ Object
Read a [d_model, d_model] concatenated-heads weight tensor into an Array<Mat> of n_heads × (d_model, d_head).
-
.read_split_kv_bias(handle, name, dst, n_kv, d_head, n_tensors) ⇒ Object
GQA variant of read_split_heads_bias for K/V: the source is a 1-D bias of length n_kv * d_head, split into n_kv arrays of d_head.
-
.read_split_kv_weight(handle, name, dst, n_kv, d_model, d_head, n_tensors) ⇒ Object
GQA variant of read_split_heads_weight: the source tensor is [d_model, n_kv * d_head] (not square), and we want to split it into n_kv per-head matrices of shape (d_model, d_head).
Class Method Details
.detect_smollm2_flags(path) ⇒ Object
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 |
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 177 def self.detect_smollm2_flags(path) handle = TinyNN.tnn_gguf_load(path) if handle == nil return SmolLM2Flags.new(false, false, false, 0, false, 0, 0, 0) end # Gemma 2 ties embeddings (no separate output.weight), but the # convention varies. We detect tie via tensor presence, not arch. untied = TinyNN.tnn_gguf_find_index(handle, "output.weight") >= 0 qkv_bias = TinyNN.tnn_gguf_find_index(handle, "blk.0.attn_q.bias") >= 0 # I-Gemma (#113): post-norm tensors. Their presence is the # sentinel for "Gemma 2-shaped block" even if the metadata arch # name varies. attn_q_norm-style models (Qwen3) don't have these. has_post_norms = TinyNN.tnn_gguf_find_index(handle, "blk.0.post_attention_norm.weight") >= 0 # M1 + #110: QK-norm — presence of attn_q_norm tensors signals # "apply RMSNorm to Q,K before RoPE". The gamma shape distinguishes # the two known dialects: # ne[0] == d_head → Qwen3-style (shared per-head gamma, applied # after the head split; equivalent across heads). # ne[0] == d_model → OLMoE / Granite-style (full-Q gamma, applied # to the concatenated d_model Q vector BEFORE # the head split; variance is over d_model dims). # These are mathematically distinct: in the full-Q form, RMSNorm # variance pools across all heads, so per-head behavior differs. qn_idx = TinyNN.tnn_gguf_find_index(handle, "blk.0.attn_q_norm.weight") qk_norm = qn_idx >= 0 qk_norm_kind = 0 if qk_norm gamma_ne0 = TinyNN.tnn_gguf_tensor_ne(handle, qn_idx, 0) # Probe d_model and the head count to derive d_head. Multi-arch # prefix logic — try each known arch in order. ap = "llama" if TinyNN.tnn_gguf_get_u32(handle, "llama.embedding_length") < 0 if TinyNN.tnn_gguf_get_u32(handle, "olmoe.embedding_length") >= 0 ap = "olmoe" elsif TinyNN.tnn_gguf_get_u32(handle, "gemma2.embedding_length") >= 0 ap = "gemma2" end end d_model_v = TinyNN.tnn_gguf_get_u32(handle, ap + ".embedding_length") n_heads_v = TinyNN.tnn_gguf_get_u32(handle, ap + ".attention.head_count") head_dim = TinyNN.tnn_gguf_get_u32(handle, ap + ".attention.key_length") if head_dim <= 0 && n_heads_v > 0 head_dim = d_model_v / n_heads_v end if gamma_ne0 == head_dim qk_norm_kind = 1 # per-head shared elsif gamma_ne0 == d_model_v qk_norm_kind = 2 # full-Q else # Unknown shape; warn loudly and default to per-head shared. # If this fires the model output will be wrong. puts "WARN: blk.0.attn_q_norm.weight has ne[0]=" + gamma_ne0.to_s + " (expected d_head=" + head_dim.to_s + " or d_model=" + d_model_v.to_s + "). Defaulting to per-head shared." qk_norm_kind = 1 end end # M3 + I-Gemma: sliding-window attention. llama.cpp emits the # window size as `<arch>.attention.sliding_window`. Treat -1 / # missing as 0. Try each known arch prefix. sw = TinyNN.tnn_gguf_get_u32(handle, "llama.attention.sliding_window") if sw < 0 sw = TinyNN.tnn_gguf_get_u32(handle, "olmoe.attention.sliding_window") end if sw < 0 sw = TinyNN.tnn_gguf_get_u32(handle, "gemma2.attention.sliding_window") end if sw < 0; sw = 0; end # I-Gemma: Gemma 2 applies SWA on alternating layers (the # `sliding_window_pattern=2` HF config; layers alternate between # full attention and sliding). llama.cpp encodes this implicitly # by setting attention.sliding_window AND using the gemma2 arch # prefix — there's no metadata key for the pattern itself, it's # inferred from `general.architecture == "gemma2"`. swa_alternates = false arch_name = TinyNN.tnn_gguf_get_str(handle, "general.architecture") if arch_name == "gemma2" && sw > 0 swa_alternates = true end # I-Gemma: soft-cap parameters for attention logits and the final # output logits. Read as f32; default 0.0 (no softcap). attn_softcap = TinyNN.tnn_gguf_get_f32(handle, "gemma2.attn_logit_softcapping") final_softcap = TinyNN.tnn_gguf_get_f32(handle, "gemma2.final_logit_softcapping") if attn_softcap < 0.0; attn_softcap = 0.0; end if final_softcap < 0.0; final_softcap = 0.0; end # I-Gemma: embedding scale. Gemma 2 multiplies token embeddings # by sqrt(d_model) post-lookup. Other archs use 1.0. = 1.0 if arch_name == "gemma2" d_model_g = TinyNN.tnn_gguf_get_u32(handle, "gemma2.embedding_length") if d_model_g > 0 # Newton sqrt avoids the Math.sqrt poly-dispatch landmine. x = d_model_g.to_f s = x > 1.0 ? x : 1.0 ni = 0 while ni < 30 s = 0.5 * (s + x / s) ni = ni + 1 end = s end end # M2.3: MoE detection. Presence of ffn_gate_inp.weight on layer 0 # is the sentinel. n_experts / n_experts_used live in <arch>.* # metadata keys; we try llama.* then fall back to olmoe.* (and # any future arch the same way). We don't *need* to know the arch # name itself — just the values. is_moe = TinyNN.tnn_gguf_find_index(handle, "blk.0.ffn_gate_inp.weight") >= 0 n_experts = 0 n_experts_used = 0 if is_moe ne_v = TinyNN.tnn_gguf_get_u32(handle, "llama.expert_count") nu_v = TinyNN.tnn_gguf_get_u32(handle, "llama.expert_used_count") if ne_v < 0 ne_v = TinyNN.tnn_gguf_get_u32(handle, "olmoe.expert_count") nu_v = TinyNN.tnn_gguf_get_u32(handle, "olmoe.expert_used_count") end n_experts = ne_v > 0 ? ne_v : 0 n_experts_used = nu_v > 0 ? nu_v : 0 end TinyNN.tnn_gguf_free(handle) SmolLM2Flags.new(untied, qkv_bias, qk_norm, sw, is_moe, n_experts, n_experts_used, qk_norm_kind, has_post_norms, , attn_softcap, final_softcap, swa_alternates) end |
.detect_weight_type(path) ⇒ Object
Detect the GGUF’s 2D linear weight type. Peeks at blk.0.attn_q.weight (always present for llama-family models). Returns the ggml type integer (0=F32, 8=Q8_0). Callers should pass this to kv.set_weight_type before kv.realize_for to enable the Q8-stays-Q8 path.
614 615 616 617 618 619 620 621 622 623 624 625 626 627 |
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 614 def self.detect_weight_type(path) handle = TinyNN.tnn_gguf_load(path) if handle == nil return 0 end idx = TinyNN.tnn_gguf_find_index(handle, "blk.0.attn_q.weight") t = if idx >= 0 TinyNN.tnn_gguf_tensor_type(handle, idx) else 0 end TinyNN.tnn_gguf_free(handle) t end |
.find_index(handle, name, n_tensors) ⇒ Object
Linear-scan tensor lookup. 100 tensors × ~50 reads = 5000 string compares — fine. A hash map would force Spinel into a polymorphic value type; not worth it.
63 64 65 66 67 68 69 70 71 72 |
# File 'lib/toy/io/gguf_load.rb', line 63 def self.find_index(handle, name, n_tensors) i = 0 while i < n_tensors if TinyNN.tnn_gguf_tensor_name(handle, i) == name return i end i = i + 1 end -1 end |
.load_gpt2(model, path) ⇒ Object
Load distilgpt2-shaped GGUF (also fits gpt2-small/medium/large) into a caller-constructed GPT2LM. Returns true on success.
274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 |
# File 'lib/toy/io/gguf_load.rb', line 274 def self.load_gpt2(model, path) handle = TinyNN.tnn_gguf_load(path) if handle == nil puts "open failed: " + path return false end n_tensors = TinyNN.tnn_gguf_n_tensors(handle) puts "loading " + path + " (" + n_tensors.to_s + " tensors)" d_model = model.d_model d_head = model.d_head n_heads = model.n_heads # Globals read_mat(handle, "token_embd.weight", model., n_tensors) read_mat(handle, "position_embd.weight", model., n_tensors) read_array(handle, "output_norm.weight", model.ln_f_gamma, n_tensors) read_array(handle, "output_norm.bias", model.ln_f_beta, n_tensors) # Per-block li = 0 while li < model.n_layers blk = model.gpt2_blocks[li] prefix = "blk." + li.to_s read_array(handle, prefix + ".attn_norm.weight", blk.ln1_gamma, n_tensors) read_array(handle, prefix + ".attn_norm.bias", blk.ln1_beta, n_tensors) read_array(handle, prefix + ".ffn_norm.weight", blk.ln2_gamma, n_tensors) read_array(handle, prefix + ".ffn_norm.bias", blk.ln2_beta, n_tensors) read_split_heads_weight(handle, prefix + ".attn_q.weight", blk.w_q, n_heads, d_model, d_head, n_tensors) read_split_heads_weight(handle, prefix + ".attn_k.weight", blk.w_k, n_heads, d_model, d_head, n_tensors) read_split_heads_weight(handle, prefix + ".attn_v.weight", blk.w_v, n_heads, d_model, d_head, n_tensors) read_split_heads_bias(handle, prefix + ".attn_q.bias", blk.b_q, n_heads, d_head, n_tensors) read_split_heads_bias(handle, prefix + ".attn_k.bias", blk.b_k, n_heads, d_head, n_tensors) read_split_heads_bias(handle, prefix + ".attn_v.bias", blk.b_v, n_heads, d_head, n_tensors) read_mat(handle, prefix + ".attn_output.weight", blk.w_o, n_tensors) read_array(handle, prefix + ".attn_output.bias", blk.b_o, n_tensors) read_mat(handle, prefix + ".ffn_up.weight", blk.w_ff1, n_tensors) read_array(handle, prefix + ".ffn_up.bias", blk.b_ff1, n_tensors) read_mat(handle, prefix + ".ffn_down.weight", blk.w_ff2, n_tensors) read_array(handle, prefix + ".ffn_down.bias", blk.b_ff2, n_tensors) li = li + 1 end TinyNN.tnn_gguf_free(handle) true end |
.load_kv_cache_auto(kv_cache, path) ⇒ Object
Auto-dispatcher: peek at the toy.ggml_native metadata key and pick the matching loader. Keeps callers ignorant of the file layout.
631 632 633 634 635 636 637 638 639 640 641 642 643 644 |
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 631 def self.load_kv_cache_auto(kv_cache, path) handle = TinyNN.tnn_gguf_load(path) if handle == nil puts "open failed: " + path return false end is_native = TinyNN.tnn_gguf_get_bool(handle, "toy.ggml_native") == 1 TinyNN.tnn_gguf_free(handle) if is_native load_kv_cache_directly_native(kv_cache, path) else load_kv_cache_directly(kv_cache, path) end end |
.load_kv_cache_directly(kv_cache, path) ⇒ Object
Inference-only loader: stream GGUF weights directly into the FFI persistent buffers, skipping the Ruby Float64 Mat allocation. 4 B/w vs the Mat-mediated 12 B/w; required for 7B-class models.
The kv_cache MUST already be realized via realize_for. We do not construct Toy::SmolLM2 at all — callers that need ‘describe` / `algorithm_card` should still use the Mat-mediated path on a 1×1-stub config.
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 |
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 312 def self.load_kv_cache_directly(kv_cache, path) handle = TinyNN.tnn_gguf_load(path) if handle == nil puts "open failed: " + path return false end n_tensors = TinyNN.tnn_gguf_n_tensors(handle) puts "loading " + path + " → FFI direct (" + n_tensors.to_s + " tensors)" sess = kv_cache.sess n_heads = kv_cache.n_heads n_kv = kv_cache.n_kv d_model = kv_cache.d_model d_head = kv_cache.d_head d_ff = kv_cache.d_ff # --- Globals ----- = TinyNN.tnn_gguf_find_index(handle, "token_embd.weight") TinyNN.tnn_gguf_copy_to_persistent(handle, , sess, kv_cache.) fn_idx = TinyNN.tnn_gguf_find_index(handle, "output_norm.weight") TinyNN.tnn_gguf_copy_1d_to_persistent(handle, fn_idx, sess, kv_cache.t_final_norm_gamma) if kv_cache.has_untied_output out_idx = TinyNN.tnn_gguf_find_index(handle, "output.weight") TinyNN.tnn_gguf_copy_to_persistent(handle, out_idx, sess, kv_cache.t_output) end # --- Per-block ----- li = 0 while li < kv_cache.n_layers blk_f = kv_cache.kv_blocks_ffi[li] prefix = "blk." + li.to_s rn1_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_norm.weight") rn2_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_norm.weight") TinyNN.tnn_gguf_copy_1d_to_persistent(handle, rn1_idx, sess, blk_f.t_rn1_gamma) TinyNN.tnn_gguf_copy_1d_to_persistent(handle, rn2_idx, sess, blk_f.t_rn2_gamma) # Q (n_heads per-head slices of attn_q.weight) q_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_q.weight") hq = 0 while hq < n_heads TinyNN.tnn_gguf_copy_head_slice_to_persistent(handle, q_idx, sess, blk_f.t_w_q[hq], hq, n_heads, d_model, d_head) hq = hq + 1 end # K, V (n_kv per-head slices each) k_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_k.weight") v_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_v.weight") hkv = 0 while hkv < n_kv TinyNN.tnn_gguf_copy_head_slice_to_persistent(handle, k_idx, sess, blk_f.t_w_k[hkv], hkv, n_kv, d_model, d_head) TinyNN.tnn_gguf_copy_head_slice_to_persistent(handle, v_idx, sess, blk_f.t_w_v[hkv], hkv, n_kv, d_model, d_head) hkv = hkv + 1 end # Optional Q/K/V biases (Qwen2.x) if kv_cache.has_qkv_bias qb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_q.bias") kb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_k.bias") vb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_v.bias") hq = 0 while hq < n_heads TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, qb_idx, sess, blk_f.t_b_q[hq], hq, d_head) hq = hq + 1 end hkv = 0 while hkv < n_kv TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, kb_idx, sess, blk_f.t_b_k[hkv], hkv, d_head) TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, vb_idx, sess, blk_f.t_b_v[hkv], hkv, d_head) hkv = hkv + 1 end end # O (attn_output.weight) — single transposed o_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_output.weight") TinyNN.tnn_gguf_copy_transposed_to_persistent(handle, o_idx, sess, blk_f.t_w_o, d_model, d_model) # FFN — gate, up, down (each single transposed) gate_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_gate.weight") up_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_up.weight") down_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_down.weight") TinyNN.tnn_gguf_copy_transposed_to_persistent(handle, gate_idx, sess, blk_f.t_w_gate, d_model, d_ff) TinyNN.tnn_gguf_copy_transposed_to_persistent(handle, up_idx, sess, blk_f.t_w_up, d_model, d_ff) TinyNN.tnn_gguf_copy_transposed_to_persistent(handle, down_idx, sess, blk_f.t_w_down, d_ff, d_model) li = li + 1 end # Zero-init K/V buffers (matches the Mat-mediated path's kv_zero_* # uploads — without this the persistent K/V tensors contain # garbage from the backend's initial allocation). # P5.2: K and V share layout ne=[d_head, max_T] now, so the # zero-init Mat is shared too. Same Q8 skip rule for both. kv_zero = Mat.new(kv_cache.max_T, d_head) li = 0 while li < kv_cache.n_layers blk_f = kv_cache.kv_blocks_ffi[li] hkv = 0 while hkv < n_kv if kv_cache.kv_type_k != 8 TinyNN.upload_row_major(sess, blk_f.t_K[hkv], kv_zero) end if kv_cache.kv_type_v != 8 TinyNN.upload_row_major(sess, blk_f.t_V[hkv], kv_zero) end hkv = hkv + 1 end li = li + 1 end TinyNN.tnn_gguf_free(handle) true end |
.load_kv_cache_directly_native(kv_cache, path) ⇒ Object
Native-layout direct loader. Same shape as load_kv_cache_directly but the source GGUF was written with –ggml-native — 2D linear weights are stored in HF-native [out, in] row-major, which already matches ggml’s column-major ne=[in, out] byte order. All transposes are gone; per-head Q/K/V slices are contiguous byte ranges.
See [[project_mmap_phase1_2026_05_18]] / docs/memory-design.md for the rationale.
452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 |
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 452 def self.load_kv_cache_directly_native(kv_cache, path) handle = TinyNN.tnn_gguf_load(path) if handle == nil puts "open failed: " + path return false end n_tensors = TinyNN.tnn_gguf_n_tensors(handle) puts "loading " + path + " → FFI direct (native, " + n_tensors.to_s + " tensors)" sess = kv_cache.sess n_heads = kv_cache.n_heads n_kv = kv_cache.n_kv d_model = kv_cache.d_model d_head = kv_cache.d_head d_ff = kv_cache.d_ff # Globals (token_embd, output_norm, optional untied output) — these # were already non-transposed even in the old converter; loader is # identical to the legacy path. = TinyNN.tnn_gguf_find_index(handle, "token_embd.weight") TinyNN.tnn_gguf_copy_to_persistent(handle, , sess, kv_cache.) fn_idx = TinyNN.tnn_gguf_find_index(handle, "output_norm.weight") TinyNN.tnn_gguf_copy_1d_to_persistent(handle, fn_idx, sess, kv_cache.t_final_norm_gamma) if kv_cache.has_untied_output out_idx = TinyNN.tnn_gguf_find_index(handle, "output.weight") if kv_cache.weight_type != 0 TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, out_idx, sess, kv_cache.t_output) else TinyNN.tnn_gguf_copy_to_persistent(handle, out_idx, sess, kv_cache.t_output) end end li = 0 while li < kv_cache.n_layers blk_f = kv_cache.kv_blocks_ffi[li] prefix = "blk." + li.to_s rn1_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_norm.weight") rn2_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_norm.weight") TinyNN.tnn_gguf_copy_1d_to_persistent(handle, rn1_idx, sess, blk_f.t_rn1_gamma) TinyNN.tnn_gguf_copy_1d_to_persistent(handle, rn2_idx, sess, blk_f.t_rn2_gamma) # Per-head Q/K/V — native layout: contiguous byte range. When the # cache is in Q8 mode (Phase 3) we use the verbatim head-slice # helper, which is type-agnostic and just memcpys the right # contiguous range. For F32 mode the f32 helper does the same # plus a dequant fallback (in case the GGUF is Q8 but the cache # is F32 — old behavior). use_verbatim = kv_cache.weight_type != 0 q_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_q.weight") hq = 0 while hq < n_heads if use_verbatim TinyNN.tnn_gguf_copy_verbatim_head_slice_to_persistent(handle, q_idx, sess, blk_f.t_w_q[hq], hq, n_heads) else TinyNN.tnn_gguf_copy_head_slice_to_persistent_native(handle, q_idx, sess, blk_f.t_w_q[hq], hq, n_heads, d_model, d_head) end hq = hq + 1 end k_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_k.weight") v_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_v.weight") hkv = 0 while hkv < n_kv if use_verbatim TinyNN.tnn_gguf_copy_verbatim_head_slice_to_persistent(handle, k_idx, sess, blk_f.t_w_k[hkv], hkv, n_kv) TinyNN.tnn_gguf_copy_verbatim_head_slice_to_persistent(handle, v_idx, sess, blk_f.t_w_v[hkv], hkv, n_kv) else TinyNN.tnn_gguf_copy_head_slice_to_persistent_native(handle, k_idx, sess, blk_f.t_w_k[hkv], hkv, n_kv, d_model, d_head) TinyNN.tnn_gguf_copy_head_slice_to_persistent_native(handle, v_idx, sess, blk_f.t_w_v[hkv], hkv, n_kv, d_model, d_head) end hkv = hkv + 1 end # Q/K/V biases: 1-D, identical loader (biases were already # take()'d untransposed even in the legacy converter). if kv_cache.has_qkv_bias qb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_q.bias") kb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_k.bias") vb_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_v.bias") hq = 0 while hq < n_heads TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, qb_idx, sess, blk_f.t_b_q[hq], hq, d_head) hq = hq + 1 end hkv = 0 while hkv < n_kv TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, kb_idx, sess, blk_f.t_b_k[hkv], hkv, d_head) TinyNN.tnn_gguf_copy_head_bias_slice_to_persistent(handle, vb_idx, sess, blk_f.t_b_v[hkv], hkv, d_head) hkv = hkv + 1 end end # O / FFN gate / up / down — native: plain memcpy. Q8 mode # uses the verbatim primitive (same shape; type-preserving). o_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".attn_output.weight") gate_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_gate.weight") up_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_up.weight") down_idx = TinyNN.tnn_gguf_find_index(handle, prefix + ".ffn_down.weight") if use_verbatim TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, o_idx, sess, blk_f.t_w_o) TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, gate_idx, sess, blk_f.t_w_gate) TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, up_idx, sess, blk_f.t_w_up) TinyNN.tnn_gguf_copy_verbatim_to_persistent(handle, down_idx, sess, blk_f.t_w_down) else TinyNN.tnn_gguf_copy_to_persistent(handle, o_idx, sess, blk_f.t_w_o) TinyNN.tnn_gguf_copy_to_persistent(handle, gate_idx, sess, blk_f.t_w_gate) TinyNN.tnn_gguf_copy_to_persistent(handle, up_idx, sess, blk_f.t_w_up) TinyNN.tnn_gguf_copy_to_persistent(handle, down_idx, sess, blk_f.t_w_down) end li = li + 1 end # Zero-init K/V buffers (same as the legacy path). # P5.2: K and V share layout ne=[d_head, max_T] now, so the # zero-init Mat is shared too. Same Q8 skip rule for both. kv_zero = Mat.new(kv_cache.max_T, d_head) li = 0 while li < kv_cache.n_layers blk_f = kv_cache.kv_blocks_ffi[li] hkv = 0 while hkv < n_kv if kv_cache.kv_type_k != 8 TinyNN.upload_row_major(sess, blk_f.t_K[hkv], kv_zero) end if kv_cache.kv_type_v != 8 TinyNN.upload_row_major(sess, blk_f.t_V[hkv], kv_zero) end hkv = hkv + 1 end li = li + 1 end TinyNN.tnn_gguf_free(handle) true end |
.load_toy_gpt2(model, path) ⇒ Object
Same GGUF layout, loaded into a Toy::GPT2. The weights live under sub-modules now (‘blk.attn.w_q`, `blk.ln1.gamma`, …), so this mirrors load_gpt2 with the new path expressions.
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/toy/io/loaders/toy_gpt2_loader.rb', line 13 def self.load_toy_gpt2(model, path) handle = TinyNN.tnn_gguf_load(path) if handle == nil puts "open failed: " + path return false end n_tensors = TinyNN.tnn_gguf_n_tensors(handle) puts "loading " + path + " (" + n_tensors.to_s + " tensors)" cfg = model.cfg d_model = cfg.d_model n_heads = cfg.n_heads d_head = d_model / n_heads read_mat(handle, "token_embd.weight", model..weight, n_tensors) read_mat(handle, "position_embd.weight", model..weight, n_tensors) read_array(handle, "output_norm.weight", model.final_norm.gamma, n_tensors) read_array(handle, "output_norm.bias", model.final_norm.beta, n_tensors) li = 0 while li < cfg.n_layers blk = model.stack[li] prefix = "blk." + li.to_s read_array(handle, prefix + ".attn_norm.weight", blk.ln1.gamma, n_tensors) read_array(handle, prefix + ".attn_norm.bias", blk.ln1.beta, n_tensors) read_array(handle, prefix + ".ffn_norm.weight", blk.ln2.gamma, n_tensors) read_array(handle, prefix + ".ffn_norm.bias", blk.ln2.beta, n_tensors) read_split_heads_weight(handle, prefix + ".attn_q.weight", blk.attn.w_q, n_heads, d_model, d_head, n_tensors) read_split_heads_weight(handle, prefix + ".attn_k.weight", blk.attn.w_k, n_heads, d_model, d_head, n_tensors) read_split_heads_weight(handle, prefix + ".attn_v.weight", blk.attn.w_v, n_heads, d_model, d_head, n_tensors) read_split_heads_bias(handle, prefix + ".attn_q.bias", blk.attn.b_q, n_heads, d_head, n_tensors) read_split_heads_bias(handle, prefix + ".attn_k.bias", blk.attn.b_k, n_heads, d_head, n_tensors) read_split_heads_bias(handle, prefix + ".attn_v.bias", blk.attn.b_v, n_heads, d_head, n_tensors) read_mat(handle, prefix + ".attn_output.weight", blk.attn.w_o, n_tensors) read_array(handle, prefix + ".attn_output.bias", blk.attn.b_o, n_tensors) read_mat(handle, prefix + ".ffn_up.weight", blk.ffn.w1, n_tensors) read_array(handle, prefix + ".ffn_up.bias", blk.ffn.b1, n_tensors) read_mat(handle, prefix + ".ffn_down.weight", blk.ffn.w2, n_tensors) read_array(handle, prefix + ".ffn_down.bias", blk.ffn.b2, n_tensors) li = li + 1 end TinyNN.tnn_gguf_free(handle) true end |
.load_toy_smollm2(model, path) ⇒ Object
Llama-family weight load into a Toy::SmolLM2.
Tensor name conventions match prep/convert_smollm2_to_gguf.py. The converter has already transposed every nn.Linear weight from HF’s [out, in] to our [in, out] orientation.
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
# File 'lib/toy/io/loaders/toy_smollm2_loader.rb', line 15 def self.load_toy_smollm2(model, path) handle = TinyNN.tnn_gguf_load(path) if handle == nil puts "open failed: " + path return false end n_tensors = TinyNN.tnn_gguf_n_tensors(handle) puts "loading " + path + " (" + n_tensors.to_s + " tensors)" cfg = model.cfg d_model = cfg.d_model n_heads = cfg.n_heads n_kv = cfg.n_kv d_head = d_model / n_heads read_mat(handle, "token_embd.weight", model..weight, n_tensors) read_array(handle, "output_norm.weight", model.final_norm.gamma, n_tensors) # Untied output (`output.weight`) is present for TinyLlama / Llama-2 # but not for SmolLM2 / Qwen2.5. Detect via tensor presence; the # converter omits it for tied models. output_idx = find_index(handle, "output.weight", n_tensors) if output_idx >= 0 puts " untied output: output.weight present" model.enable_untied_output! read_mat(handle, "output.weight", model.output_proj, n_tensors) end # Q/K/V biases are a Qwen2.x trait (Llama / SmolLM2 / TinyLlama lack # them). Detect via attn_q.bias in block 0; the converter writes all # three when any are present in the HF safetensors. The per-head # variant (toy from-scratch checkpoints) carries blk.0.attn_q.head_0.bias. has_qkv_bias = (find_index(handle, "blk.0.attn_q.bias", n_tensors) >= 0) || (find_index(handle, "blk.0.attn_q.head_0.bias", n_tensors) >= 0) if has_qkv_bias puts " Q/K/V biases present (Qwen2.x-style)" end # toy#gguf-checkpoint-reload (#153) — from-scratch checkpoints written # by ToyGGUFWriter store one tensor PER HEAD (blk.N.attn_q.head_H.weight) # rather than the fused llama.cpp shape. Detect via the head_0 sentinel. per_head = find_index(handle, "blk.0.attn_q.head_0.weight", n_tensors) >= 0 if per_head puts " per-head tensors (toy from-scratch checkpoint format)" end li = 0 while li < cfg.n_layers blk = model.stack[li] prefix = "blk." + li.to_s read_array(handle, prefix + ".attn_norm.weight", blk.rn1.gamma, n_tensors) read_array(handle, prefix + ".ffn_norm.weight", blk.rn2.gamma, n_tensors) if per_head read_per_head_weight(handle, prefix + ".attn_q", blk.attn.w_q, n_heads, d_model, d_head, n_tensors) read_per_head_weight(handle, prefix + ".attn_k", blk.attn.w_k, n_kv, d_model, d_head, n_tensors) read_per_head_weight(handle, prefix + ".attn_v", blk.attn.w_v, n_kv, d_model, d_head, n_tensors) else # Q: full [d_model, n_heads * d_head] = [d_model, d_model] read_split_heads_weight(handle, prefix + ".attn_q.weight", blk.attn.w_q, n_heads, d_model, d_head, n_tensors) # K, V: narrower [d_model, n_kv * d_head] — uses the GQA reader. read_split_kv_weight(handle, prefix + ".attn_k.weight", blk.attn.w_k, n_kv, d_model, d_head, n_tensors) read_split_kv_weight(handle, prefix + ".attn_v.weight", blk.attn.w_v, n_kv, d_model, d_head, n_tensors) end read_mat(handle, prefix + ".attn_output.weight", blk.attn.w_o, n_tensors) if has_qkv_bias if per_head read_per_head_bias(handle, prefix + ".attn_q", blk.attn.b_q, n_heads, d_head, n_tensors) read_per_head_bias(handle, prefix + ".attn_k", blk.attn.b_k, n_kv, d_head, n_tensors) read_per_head_bias(handle, prefix + ".attn_v", blk.attn.b_v, n_kv, d_head, n_tensors) else # Q bias: [n_heads * d_head] split into per-Q-head arrays. read_split_heads_bias(handle, prefix + ".attn_q.bias", blk.attn.b_q, n_heads, d_head, n_tensors) # K/V biases: [n_kv * d_head] split into per-KV-head arrays. read_split_kv_bias(handle, prefix + ".attn_k.bias", blk.attn.b_k, n_kv, d_head, n_tensors) read_split_kv_bias(handle, prefix + ".attn_v.bias", blk.attn.b_v, n_kv, d_head, n_tensors) end blk.attn.enable_qkv_bias! end read_mat(handle, prefix + ".ffn_gate.weight", blk.ffn.w_gate, n_tensors) read_mat(handle, prefix + ".ffn_up.weight", blk.ffn.w_up, n_tensors) read_mat(handle, prefix + ".ffn_down.weight", blk.ffn.w_down, n_tensors) li = li + 1 end TinyNN.tnn_gguf_free(handle) true end |
.read_array(handle, name, target, n_tensors) ⇒ Object
Read a 1-D tensor straight into an existing Array<Float>.
75 76 77 78 79 80 81 82 83 84 85 86 |
# File 'lib/toy/io/gguf_load.rb', line 75 def self.read_array(handle, name, target, n_tensors) idx = find_index(handle, name, n_tensors) if idx < 0 puts "missing: " + name return end nel = target.length rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, target, nel) if rc != 0 puts "read failed: " + name + " rc=" + rc.to_s end end |
.read_mat(handle, name, mat, n_tensors) ⇒ Object
Read a 2-D tensor straight into an existing Mat (writes to mat.flat).
89 90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/toy/io/gguf_load.rb', line 89 def self.read_mat(handle, name, mat, n_tensors) idx = find_index(handle, name, n_tensors) if idx < 0 puts "missing: " + name return end nel = mat.nrows * mat.ncols rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, mat.flat, nel) if rc != 0 puts "read failed: " + name + " rc=" + rc.to_s end end |
.read_per_head_bias(handle, prefix_attn, dst, n_heads, d_head, n_tensors) ⇒ Object
Per-head bias: blk.N.attn_<q|k|v>.head_H.bias, shape [d_head].
254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
# File 'lib/toy/io/gguf_load.rb', line 254 def self.read_per_head_bias(handle, prefix_attn, dst, n_heads, d_head, n_tensors) h = 0 while h < n_heads name = prefix_attn + ".head_" + h.to_s + ".bias" idx = find_index(handle, name, n_tensors) if idx < 0 puts "missing: " + name return end rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, dst[h], d_head) if rc != 0 puts "read failed: " + name + " rc=" + rc.to_s return end h = h + 1 end end |
.read_per_head_weight(handle, prefix_attn, dst, n_heads, d_model, d_head, n_tensors) ⇒ Object
toy-checkpoint variant: each head is its own tensor named blk.N.attn_<q|k|v>.head_H.weight, shape [d_head, d_model] in ggml column-major (== row-major [d_model × d_head] in our Mat layout). That is exactly what a per-head Mat expects, so each tensor reads straight into its slot — no fan-out / strided extraction.
Used by toy#gguf-checkpoint-reload (#153) to load from-scratch toy GGUFs without going through the fused llama.cpp convention.
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 |
# File 'lib/toy/io/gguf_load.rb', line 233 def self.read_per_head_weight(handle, prefix_attn, dst, n_heads, d_model, d_head, n_tensors) h = 0 while h < n_heads name = prefix_attn + ".head_" + h.to_s + ".weight" idx = find_index(handle, name, n_tensors) if idx < 0 puts "missing: " + name return end mat = dst[h] nel = d_model * d_head rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, mat.flat, nel) if rc != 0 puts "read failed: " + name + " rc=" + rc.to_s return end h = h + 1 end end |
.read_split_heads_bias(handle, name, dst, n_heads, d_head, n_tensors) ⇒ Object
Read a [d_model] concatenated-heads bias into n_heads × Array<Float>(d_head).
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# File 'lib/toy/io/gguf_load.rb', line 137 def self.read_split_heads_bias(handle, name, dst, n_heads, d_head, n_tensors) idx = find_index(handle, name, n_tensors) if idx < 0 puts "missing: " + name return end d_model = n_heads * d_head tmp = Array.new(d_model, 0.0) rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, tmp, d_model) if rc != 0 puts "read failed: " + name + " rc=" + rc.to_s return end h = 0 while h < n_heads arr = dst[h] j = 0 while j < d_head arr[j] = tmp[h * d_head + j] j = j + 1 end h = h + 1 end end |
.read_split_heads_weight(handle, name, dst, n_heads, d_model, d_head, n_tensors) ⇒ Object
Read a [d_model, d_model] concatenated-heads weight tensor into an Array<Mat> of n_heads × (d_model, d_head). Column block
- h*d_head : (h+1)*d_head
-
of the source becomes head h’s matrix.
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
# File 'lib/toy/io/gguf_load.rb', line 105 def self.read_split_heads_weight(handle, name, dst, n_heads, d_model, d_head, n_tensors) idx = find_index(handle, name, n_tensors) if idx < 0 puts "missing: " + name return end nel = d_model * d_model # Stage via a temporary flat buffer (~2.4 MB for distilgpt2); # the strided per-head copy can't run while ggml writes to dst. tmp = Array.new(nel, 0.0) rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, tmp, nel) if rc != 0 puts "read failed: " + name + " rc=" + rc.to_s return end h = 0 while h < n_heads mat = dst[h] i = 0 while i < d_model j = 0 while j < d_head mat.flat[i * d_head + j] = tmp[i * d_model + h * d_head + j] j = j + 1 end i = i + 1 end h = h + 1 end end |
.read_split_kv_bias(handle, name, dst, n_kv, d_head, n_tensors) ⇒ Object
GQA variant of read_split_heads_bias for K/V: the source is a 1-D bias of length n_kv * d_head, split into n_kv arrays of d_head. Used for Qwen2.x attn_k.bias / attn_v.bias.
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
# File 'lib/toy/io/gguf_load.rb', line 165 def self.read_split_kv_bias(handle, name, dst, n_kv, d_head, n_tensors) idx = find_index(handle, name, n_tensors) if idx < 0 puts "missing: " + name return end nel = n_kv * d_head tmp = Array.new(nel, 0.0) rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, tmp, nel) if rc != 0 puts "read failed: " + name + " rc=" + rc.to_s return end h = 0 while h < n_kv arr = dst[h] j = 0 while j < d_head arr[j] = tmp[h * d_head + j] j = j + 1 end h = h + 1 end end |
.read_split_kv_weight(handle, name, dst, n_kv, d_model, d_head, n_tensors) ⇒ Object
GQA variant of read_split_heads_weight: the source tensor is
- d_model, n_kv * d_head
-
(not square), and we want to split it into
n_kv per-head matrices of shape (d_model, d_head). Mirrors the logic of read_split_heads_weight but with the narrower output dim.
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
# File 'lib/toy/io/gguf_load.rb', line 194 def self.read_split_kv_weight(handle, name, dst, n_kv, d_model, d_head, n_tensors) idx = find_index(handle, name, n_tensors) if idx < 0 puts "missing: " + name return end nel = d_model * n_kv * d_head tmp = Array.new(nel, 0.0) rc = TinyNN.tnn_gguf_read_f32_to_doubles(handle, idx, tmp, nel) if rc != 0 puts "read failed: " + name + " rc=" + rc.to_s return end # Source row stride = n_kv * d_head; column block h is [h*d_head, (h+1)*d_head). src_cols = n_kv * d_head h = 0 while h < n_kv mat = dst[h] i = 0 while i < d_model j = 0 while j < d_head mat.flat[i * d_head + j] = tmp[i * src_cols + h * d_head + j] j = j + 1 end i = i + 1 end h = h + 1 end end |