Module: ToyGGUFWriter
- Defined in:
- lib/toy/train/toy_gguf_writer.rb
Overview
tao#gguf-checkpoint-writer — GGUF snapshot writer for training runs.
Writes $TAO_RUN_DIR/weights/step_N.gguf on a schedule (CHECKPOINT_EVERY=N), plus a final snapshot at run_end. Maintains $TAO_RUN_DIR/weights/latest as a symlink to the most recent file.
Format: thin wrap around ggml’s gguf writer (tinynn/tinynn_ggml.c). Caller supplies the model arch + hyperparams (which become GGUF metadata KV pairs) and the session whose PARAM tensors should be serialised.
Loadable-by-inference status: the GGUF we write is structurally correct (parseable via gguf_init_from_file, all metadata + tensor data present), but toy’s inference loader (lib/toy_smollm2_loader.rb) expects llama.cpp tensor naming convention (‘blk.N.attn_q.weight`, etc.) and per-LAYER fused tensors — neither of which matches toy’s per-head training graph. Bridging that is tracked as toy#gguf-checkpoint-reload (filed separately).
Spinel notes:
- Naming uses simple String.concat; no #{} interpolation.
- Plist (Array<:ptr>) flows from the caller; we don't construct
Array<:ptr> inside the module to avoid recurring landmine #1.
Class Method Summary collapse
-
.name_params(plist) ⇒ Object
Preserve the names set during realize (toy#semantic-tensor-names, GH#11).
-
.write(cfg, plist, path, run_id, step) ⇒ Object
Write a checkpoint.
-
.write_step(cfg, plist, weights_dir, run_id, step) ⇒ Object
Convenience: ensure $TAO_RUN_DIR/weights/ exists, then write the checkpoint + update the ‘latest` symlink.
Class Method Details
.name_params(plist) ⇒ Object
Preserve the names set during realize (toy#semantic-tensor-names, GH#11). The realize_for_* paths now annotate each PARAM with llama.cpp-convention names (“token_embd.weight”, “blk.N.attn_q.head_h.weight”, …) so we no longer overwrite with “param_N”. A tensor without a name (older training graphs that haven’t been migrated to set names) falls back to whatever ggml’s auto-named it — no-op here.
30 31 32 |
# File 'lib/toy/train/toy_gguf_writer.rb', line 30 def self.name_params(plist) # Intentionally empty. See header comment. end |
.write(cfg, plist, path, run_id, step) ⇒ Object
Write a checkpoint. ‘cfg` carries the model hyperparams; `plist` is the param-ordered tensor pointer array (from ToyDriftGrad.params or ToyDescribeFlow’s index builder); ‘path` is the destination GGUF (caller manages directory creation + naming convention). Returns 0 on success, negative on failure.
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
# File 'lib/toy/train/toy_gguf_writer.rb', line 39 def self.write(cfg, plist, path, run_id, step) ctx = TinyNN.tnn_gguf_w_init if ctx == nil || ctx == TinyNN.tnn_null_ptr return -1 end # Standard arch metadata — enough for downstream tooling to know # what shape was trained. We use "llama" so future Tao tooling # that sniffs general.architecture has something familiar. TinyNN.tnn_gguf_w_set_str(ctx, "general.architecture", "llama") TinyNN.tnn_gguf_w_set_str(ctx, "general.name", "toy-from-scratch") TinyNN.tnn_gguf_w_set_str(ctx, "general.run_id", run_id) TinyNN.tnn_gguf_w_set_u32(ctx, "general.step", step) TinyNN.tnn_gguf_w_set_u32(ctx, "llama.vocab_size", cfg.vocab) TinyNN.tnn_gguf_w_set_u32(ctx, "llama.embedding_length", cfg.d_model) TinyNN.tnn_gguf_w_set_u32(ctx, "llama.feed_forward_length", cfg.d_ff) TinyNN.tnn_gguf_w_set_u32(ctx, "llama.block_count", cfg.n_layers) TinyNN.tnn_gguf_w_set_u32(ctx, "llama.attention.head_count", cfg.n_heads) TinyNN.tnn_gguf_w_set_u32(ctx, "llama.attention.head_count_kv", cfg.n_kv) TinyNN.tnn_gguf_w_set_u32(ctx, "llama.context_length", cfg.ctx) TinyNN.tnn_gguf_w_set_f32(ctx, "llama.attention.layer_norm_rms_epsilon", cfg.rms_eps) TinyNN.tnn_gguf_w_set_f32(ctx, "llama.rope.freq_base", cfg.rope_base) # Provenance — the toy-side checkpoint format version. TinyNN.tnn_gguf_w_set_str(ctx, "toy.checkpoint_format", "toy-from-scratch/v1") TinyNN.tnn_gguf_w_set_u32(ctx, "toy.n_params_written", plist.length) # toy#gguf-checkpoint-reload (#153): the bytes go out in native # ggml column-major because we hand finalized ggml tensors directly # to gguf_add_tensor. Flag it so transformer_lm.rb's load_cpu picks # the mmap path (which understands the per-head naming convention) # instead of the legacy direct loader. TinyNN.tnn_gguf_w_set_bool(ctx, "toy.ggml_native", 1) # Name each param and add it. name_params(plist) i = 0 while i < plist.length TinyNN.tnn_gguf_w_add_tensor(ctx, plist[i]) i = i + 1 end rc = TinyNN.tnn_gguf_w_finalize(ctx, path) TinyNN.tnn_gguf_w_free(ctx) rc end |
.write_step(cfg, plist, weights_dir, run_id, step) ⇒ Object
Convenience: ensure $TAO_RUN_DIR/weights/ exists, then write the checkpoint + update the ‘latest` symlink. `weights_dir` is the full path (e.g. “/tmp/runs/abc/weights”). Returns 0 on success.
87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/toy/train/toy_gguf_writer.rb', line 87 def self.write_step(cfg, plist, weights_dir, run_id, step) TinyNN.tnn_filesystem_mkdir(weights_dir) fname = "step_" + step.to_s + ".gguf" fpath = weights_dir + "/" + fname rc = write(cfg, plist, fpath, run_id, step) if rc == 0 # latest → step_N.gguf (relative target so it works under symlinked # weights dirs and across rsync moves). lpath = weights_dir + "/latest" TinyNN.tnn_filesystem_symlink(fname, lpath) end rc end |