Class: Toy::LLM::Archs::LlamaArch
- Inherits:
-
Object
- Object
- Toy::LLM::Archs::LlamaArch
- Defined in:
- lib/toy/llm/archs/llama_arch.rb,
lib/toy/llm/archs/llama_arch_cuda.rb,
lib/toy/llm/archs/llama_arch_metal.rb
Overview
The llama-family sequence-mode arch. Owns the arch-level persistent handles (the cache realize paths allocate+assign them via cache delegators). Field names are UNCHANGED from the former cache ivars so the cache-side realize / train / decode / tap walkers keep working by accessor name.
Instance Attribute Summary collapse
-
#seq_blocks_ffi ⇒ Object
Returns the value of attribute seq_blocks_ffi.
-
#seq_donor_d_in ⇒ Object
Returns the value of attribute seq_donor_d_in.
-
#seq_gdn_blocks_ffi ⇒ Object
Returns the value of attribute seq_gdn_blocks_ffi.
-
#seq_layer_kinds ⇒ Object
Returns the value of attribute seq_layer_kinds.
-
#seq_layer_specs ⇒ Object
Returns the value of attribute seq_layer_specs.
-
#seq_rope_cfg ⇒ Object
Returns the value of attribute seq_rope_cfg.
-
#t_seq_final_norm_gamma ⇒ Object
Returns the value of attribute t_seq_final_norm_gamma.
-
#t_seq_output ⇒ Object
Returns the value of attribute t_seq_output.
-
#t_seq_token_embed ⇒ Object
Returns the value of attribute t_seq_token_embed.
-
#t_seq_w_proj ⇒ Object
Returns the value of attribute t_seq_w_proj.
Instance Method Summary collapse
-
#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ Object
P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens).
-
#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ Object
SEQ-MODE forward orchestration.
-
#initialize ⇒ LlamaArch
constructor
A new instance of LlamaArch.
-
#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ Object
Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type).
- #seed_blocks!(n_layers) ⇒ Object
-
#set_gdn_layer!(idx) ⇒ Object
Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks.
Constructor Details
#initialize ⇒ LlamaArch
Returns a new instance of LlamaArch.
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 92 def initialize @t_seq_token_embed = TinyNN.tnn_null_ptr @t_seq_final_norm_gamma = TinyNN.tnn_null_ptr @t_seq_output = TinyNN.tnn_null_ptr @t_seq_w_proj = TinyNN.tnn_null_ptr # Seed with one block — matches the former cache init (L112). @seq_blocks_ffi = [Toy::LLM::Blocks::TransformerBlock.new] # Phase 3 — parallel seed: one attention spec for the seed block. @seq_layer_specs = [Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)] # Phase 5 — parallel int dispatch keys (KIND_ATTENTION for the seed). @seq_layer_kinds = [Toy::LLM::Archs::LayerSpec::KIND_ATTENTION] # Phase 5 — parallel GDN-block slots. Seeded with GDNBlock placeholders so # the array is MONOMORPHIC (all GDNBlock) — the seam's KIND_GDN call site # never sees a mixed null/object array (Spinel poly-array landmine). At # KIND_ATTENTION layers the placeholder is simply never invoked. @seq_gdn_blocks_ffi = [Toy::LLM::Blocks::GDNBlock.new] @seq_donor_d_in = 0 # The cache overwrites seq_rope_cfg with the real RoPE::Cfg before # build_forward runs (each realize prologue rebuilds it). @seq_rope_cfg = TinyNN.tnn_null_ptr end |
Instance Attribute Details
#seq_blocks_ffi ⇒ Object
Returns the value of attribute seq_blocks_ffi.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def seq_blocks_ffi @seq_blocks_ffi end |
#seq_donor_d_in ⇒ Object
Returns the value of attribute seq_donor_d_in.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def seq_donor_d_in @seq_donor_d_in end |
#seq_gdn_blocks_ffi ⇒ Object
Returns the value of attribute seq_gdn_blocks_ffi.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def seq_gdn_blocks_ffi @seq_gdn_blocks_ffi end |
#seq_layer_kinds ⇒ Object
Returns the value of attribute seq_layer_kinds.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def seq_layer_kinds @seq_layer_kinds end |
#seq_layer_specs ⇒ Object
Returns the value of attribute seq_layer_specs.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def seq_layer_specs @seq_layer_specs end |
#seq_rope_cfg ⇒ Object
Returns the value of attribute seq_rope_cfg.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def seq_rope_cfg @seq_rope_cfg end |
#t_seq_final_norm_gamma ⇒ Object
Returns the value of attribute t_seq_final_norm_gamma.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def t_seq_final_norm_gamma @t_seq_final_norm_gamma end |
#t_seq_output ⇒ Object
Returns the value of attribute t_seq_output.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def t_seq_output @t_seq_output end |
#t_seq_token_embed ⇒ Object
Returns the value of attribute t_seq_token_embed.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def @t_seq_token_embed end |
#t_seq_w_proj ⇒ Object
Returns the value of attribute t_seq_w_proj.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def t_seq_w_proj @t_seq_w_proj end |
Instance Method Details
#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ Object
P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens). The arch already OWNS these handles (the load_globals_from_gguf_mmap! precedent); the engine’s @ft_globals_* recorders + the frozen-embed :str namer are back-called through ‘cache` (the tnn_tensor_set_name :str FFI stays on the cache realize runtime path — same discipline as ft_name_last / lora_name_q!). donor_d_in>0 = projection lens (frozen donor-width embed + trainable lens.proj); 0 = standard.
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 197 def alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) if donor_d_in > 0 self. = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, donor_d_in) cache.name_global!(self., "token_embd.weight") self.t_seq_w_proj = TinyNN.tnn_input_2d_f32_persistent(sess, d_model, donor_d_in) cache.ft_add_global_2d(self.t_seq_w_proj, d_model, donor_d_in) cache.ft_name_last_global("lens.proj.weight") else self. = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model) cache.ft_add_global_2d(self., vocab, d_model) cache.ft_name_last_global("token_embd.weight") end self.t_seq_final_norm_gamma = TinyNN.tnn_input_1d_f32_persistent(sess, d_model) cache.ft_add_global_1d(self.t_seq_final_norm_gamma) cache.ft_name_last_global("output_norm.weight") if untied self.t_seq_output = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model) cache.ft_add_global_2d(self.t_seq_output, vocab, d_model) cache.ft_name_last_global("output.weight") end end |
#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ Object
SEQ-MODE forward orchestration. The per-graph INPUT handles (token_ids, positions) are ALLOCATED BY THE CACHE before this call (cache-owned graph I/O, read by forward() and the uploaders) and passed in; ditto t_rope_freq_factors and t_attn_mask. The arch builds: get_rows(token_embed, token_ids) → x_embed (tap), optional projection-lens matmul(w_proj, x_embed) when seq_donor_d_in>0 (tap), the shared TransformerBlockCtx built ONCE, the block-stacking loop, final RMSNorm (tap), tied/untied logits matmul (tap). Returns the three per-graph output handles in a LlamaArchForwardOut.
230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 230 def build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) eps = seq_eps scale = 1.0 / Math.sqrt(seq_d_head.to_f) # Per-forward block context: the 14 config/handle values the block # body reads. Positional class (no keyword_init) — matches the # TransformerBlockCtx member order exactly. Built once before the # block-stacking loop; shared (read-only) across all blocks. ctx = Toy::LLM::Blocks::TransformerBlockCtx.new( scale, eps, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, t_positions, t_rope_freq_factors, self.seq_rope_cfg, seq_t, seq_b, t_attn_mask) = TinyNN.tnn_get_rows(sess, self., t_token_ids) TinyNN.tnn_set_output() # E2.3 — projection lens. ggml matmul(W, x) with W=[donor_d_in, d_model] # and x=[donor_d_in, T] gives [d_model, T] (contraction on ne[0]). if self.seq_donor_d_in > 0 t_proj = TinyNN.tnn_matmul(sess, self.t_seq_w_proj, ) TinyNN.tnn_set_output(t_proj) t_cur = t_proj else t_cur = end li_g = 0 while li_g < seq_n_layers # Phase 3 — per-layer descriptor dispatch. The branch compares a FLAT # INT (spec.kind) and each arm calls a CONCRETE typed block method, so # every .build_forward call site stays monomorphic (one receiver # class). KIND_ATTENTION is the only arm wired today; KIND_GDN gets its # own arm + its own typed block array in Phase 5. Unknown kinds fail # loud rather than silently building the wrong graph (never-mask rule). spec_kind = self.seq_layer_kinds[li_g] if spec_kind == Toy::LLM::Archs::LayerSpec::KIND_ATTENTION t_cur = self.seq_blocks_ffi[li_g].build_forward(sess, t_cur, ctx) elsif spec_kind == Toy::LLM::Archs::LayerSpec::KIND_GDN # Concrete typed call into the parallel GDN array — the GDN block reads # its own dims (set at alloc); seq_t/eps come from the shared ctx. t_cur = self.seq_gdn_blocks_ffi[li_g].build_forward(sess, t_cur, seq_t, eps) else raise "LlamaArch#build_forward: unsupported layer kind #{spec_kind} at layer #{li_g}" end li_g = li_g + 1 end x_final = Toy::LLM::Primitives::RMSNorm.build(sess, t_cur, self.t_seq_final_norm_gamma, eps) TinyNN.tnn_set_output(x_final) if seq_has_untied_output logits = TinyNN.tnn_matmul(sess, self.t_seq_output, x_final) else logits = TinyNN.tnn_matmul(sess, self., x_final) end TinyNN.tnn_set_output(logits) Toy::LLM::Archs::LlamaArchForwardOut.new(, x_final, logits) end |
#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ Object
Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type). The cache’s realize_for_mmap formerly ran this block inline (P2.6 pass-2 Step 1); it is moved VERBATIM here (same FFI primitives, same find_index/file_offset/type LITERAL string lookups at runtime, same untied conditional which the GGUF round-trip gate exercises true). The arch already OWNS these accessors (L68), so no new class / Struct / FFI :str at class load. Called ONLY from realize_for_mmap — the random_init globals and full_finetune’s else-branch globals are structurally different and are NOT routed through this helper. Mirrors the seed_blocks! / alloc_trainable_f32_weights! precedents.
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 167 def load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) eidx = TinyNN.tnn_gguf_find_index(gguf_handle, "token_embd.weight") eoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, eidx) etyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, eidx) @t_seq_token_embed = TinyNN.tnn_input_2d_persistent_mmap(sess, vocab, d_model, etyp, eoff) fnidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output_norm.weight") fnoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, fnidx) @t_seq_final_norm_gamma = TinyNN.tnn_input_1d_persistent_mmap(sess, d_model, 0, fnoff) if untied oidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output.weight") ooff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, oidx) otyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, oidx) @t_seq_output = TinyNN.tnn_input_2d_persistent_mmap(sess, vocab, d_model, otyp, ooff) end end |
#seed_blocks!(n_layers) ⇒ Object
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 137 def seed_blocks!(n_layers) @seq_blocks_ffi = [Toy::LLM::Blocks::TransformerBlock.new] # Phase 3 — seed the parallel spec array in lockstep. Every layer is # KIND_ATTENTION for now (the homogeneous-Llama refactor gate); Phase 5 # overwrites individual entries with KIND_GDN for Dragon's pattern. @seq_layer_specs = [Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)] @seq_gdn_blocks_ffi = [Toy::LLM::Blocks::GDNBlock.new] @seq_layer_kinds = [Toy::LLM::Archs::LayerSpec::KIND_ATTENTION] li_init = 1 while li_init < n_layers @seq_blocks_ffi.push(Toy::LLM::Blocks::TransformerBlock.new) @seq_layer_specs.push(Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)) @seq_gdn_blocks_ffi.push(Toy::LLM::Blocks::GDNBlock.new) @seq_layer_kinds.push(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION) li_init = li_init + 1 end end |
#set_gdn_layer!(idx) ⇒ Object
Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks. The four cache realize paths each ran this identical loop verbatim (P2.6 Step 2) — seed one block, then push n_layers-1 more — so the SHAPE here matches the former cache loop byte-for-byte (length == n_layers; first element is a fresh block, exactly like the cache’s ‘[TransformerBlock.new]` seed). The arch already OWNS this array (ctor seeds it with one block at L83) and already constructs TransformerBlock.new there, so no new class / Struct / FFI :str at class load. Each realize path now calls this via the cache’s seq_blocks_ffi delegator chain (self.seq_arch). Phase 5 hybrid — rebuild the per-layer spec array from a per-layer GDN bool flag, using the LayerSpec CTOR (never the .kind= setter: mutating LayerSpec.kind elsewhere while build_forward reads it trips a Spinel codegen miscompile that corrupts the token-id finalize). Called after seed_blocks!, before alloc. Mark ONE layer as GDN. Takes an INT index (never an array param — a function-parameter array trips the Spinel #688 type-lock landmine, which here manifests as a token-id-finalize codegen miscompile). Mutates the plain int dispatch array element (proven-safe).
133 134 135 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 133 def set_gdn_layer!(idx) @seq_layer_kinds[idx] = Toy::LLM::Archs::LayerSpec::KIND_GDN end |