Class: Toy::LLM::Archs::LlamaArch
- Inherits:
-
Object
- Object
- Toy::LLM::Archs::LlamaArch
- Defined in:
- lib/toy/llm/archs/llama_arch.rb,
lib/toy/llm/archs/llama_arch_cuda.rb,
lib/toy/llm/archs/llama_arch_metal.rb
Overview
The llama-family sequence-mode arch. Owns the arch-level persistent handles (the cache realize paths allocate+assign them via cache delegators). Field names are UNCHANGED from the former cache ivars so the cache-side realize / train / decode / tap walkers keep working by accessor name.
Instance Attribute Summary collapse
-
#seq_blocks_ffi ⇒ Object
Returns the value of attribute seq_blocks_ffi.
-
#seq_donor_d_in ⇒ Object
Returns the value of attribute seq_donor_d_in.
-
#seq_rope_cfg ⇒ Object
Returns the value of attribute seq_rope_cfg.
-
#t_seq_final_norm_gamma ⇒ Object
Returns the value of attribute t_seq_final_norm_gamma.
-
#t_seq_output ⇒ Object
Returns the value of attribute t_seq_output.
-
#t_seq_token_embed ⇒ Object
Returns the value of attribute t_seq_token_embed.
-
#t_seq_w_proj ⇒ Object
Returns the value of attribute t_seq_w_proj.
Instance Method Summary collapse
-
#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ Object
P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens).
-
#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ Object
SEQ-MODE forward orchestration.
-
#initialize ⇒ LlamaArch
constructor
A new instance of LlamaArch.
-
#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ Object
Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type).
-
#seed_blocks!(n_layers) ⇒ Object
Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks.
Constructor Details
#initialize ⇒ LlamaArch
Returns a new instance of LlamaArch.
77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 77 def initialize @t_seq_token_embed = TinyNN.tnn_null_ptr @t_seq_final_norm_gamma = TinyNN.tnn_null_ptr @t_seq_output = TinyNN.tnn_null_ptr @t_seq_w_proj = TinyNN.tnn_null_ptr # Seed with one block — matches the former cache init (L112). @seq_blocks_ffi = [Toy::LLM::Blocks::TransformerBlock.new] @seq_donor_d_in = 0 # The cache overwrites seq_rope_cfg with the real RoPE::Cfg before # build_forward runs (each realize prologue rebuilds it). @seq_rope_cfg = TinyNN.tnn_null_ptr end |
Instance Attribute Details
#seq_blocks_ffi ⇒ Object
Returns the value of attribute seq_blocks_ffi.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def seq_blocks_ffi @seq_blocks_ffi end |
#seq_donor_d_in ⇒ Object
Returns the value of attribute seq_donor_d_in.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def seq_donor_d_in @seq_donor_d_in end |
#seq_rope_cfg ⇒ Object
Returns the value of attribute seq_rope_cfg.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def seq_rope_cfg @seq_rope_cfg end |
#t_seq_final_norm_gamma ⇒ Object
Returns the value of attribute t_seq_final_norm_gamma.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def t_seq_final_norm_gamma @t_seq_final_norm_gamma end |
#t_seq_output ⇒ Object
Returns the value of attribute t_seq_output.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def t_seq_output @t_seq_output end |
#t_seq_token_embed ⇒ Object
Returns the value of attribute t_seq_token_embed.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def @t_seq_token_embed end |
#t_seq_w_proj ⇒ Object
Returns the value of attribute t_seq_w_proj.
68 69 70 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 68 def t_seq_w_proj @t_seq_w_proj end |
Instance Method Details
#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ Object
P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens). The arch already OWNS these handles (the load_globals_from_gguf_mmap! precedent); the engine’s @ft_globals_* recorders + the frozen-embed :str namer are back-called through ‘cache` (the tnn_tensor_set_name :str FFI stays on the cache realize runtime path — same discipline as ft_name_last / lora_name_q!). donor_d_in>0 = projection lens (frozen donor-width embed + trainable lens.proj); 0 = standard.
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 151 def alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) if donor_d_in > 0 self. = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, donor_d_in) cache.name_global!(self., "token_embd.weight") self.t_seq_w_proj = TinyNN.tnn_input_2d_f32_persistent(sess, d_model, donor_d_in) cache.ft_add_global_2d(self.t_seq_w_proj, d_model, donor_d_in) cache.ft_name_last_global("lens.proj.weight") else self. = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model) cache.ft_add_global_2d(self., vocab, d_model) cache.ft_name_last_global("token_embd.weight") end self.t_seq_final_norm_gamma = TinyNN.tnn_input_1d_f32_persistent(sess, d_model) cache.ft_add_global_1d(self.t_seq_final_norm_gamma) cache.ft_name_last_global("output_norm.weight") if untied self.t_seq_output = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model) cache.ft_add_global_2d(self.t_seq_output, vocab, d_model) cache.ft_name_last_global("output.weight") end end |
#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ Object
SEQ-MODE forward orchestration. The per-graph INPUT handles (token_ids, positions) are ALLOCATED BY THE CACHE before this call (cache-owned graph I/O, read by forward() and the uploaders) and passed in; ditto t_rope_freq_factors and t_attn_mask. The arch builds: get_rows(token_embed, token_ids) → x_embed (tap), optional projection-lens matmul(w_proj, x_embed) when seq_donor_d_in>0 (tap), the shared TransformerBlockCtx built ONCE, the block-stacking loop, final RMSNorm (tap), tied/untied logits matmul (tap). Returns the three per-graph output handles in a LlamaArchForwardOut.
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 184 def build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) eps = seq_eps scale = 1.0 / Math.sqrt(seq_d_head.to_f) # Per-forward block context: the 14 config/handle values the block # body reads. Positional class (no keyword_init) — matches the # TransformerBlockCtx member order exactly. Built once before the # block-stacking loop; shared (read-only) across all blocks. ctx = Toy::LLM::Blocks::TransformerBlockCtx.new( scale, eps, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, t_positions, t_rope_freq_factors, self.seq_rope_cfg, seq_t, seq_b, t_attn_mask) = TinyNN.tnn_get_rows(sess, self., t_token_ids) TinyNN.tnn_set_output() # E2.3 — projection lens. ggml matmul(W, x) with W=[donor_d_in, d_model] # and x=[donor_d_in, T] gives [d_model, T] (contraction on ne[0]). if self.seq_donor_d_in > 0 t_proj = TinyNN.tnn_matmul(sess, self.t_seq_w_proj, ) TinyNN.tnn_set_output(t_proj) t_cur = t_proj else t_cur = end li_g = 0 while li_g < seq_n_layers t_cur = self.seq_blocks_ffi[li_g].build_forward(sess, t_cur, ctx) li_g = li_g + 1 end x_final = Toy::LLM::Primitives::RMSNorm.build(sess, t_cur, self.t_seq_final_norm_gamma, eps) TinyNN.tnn_set_output(x_final) if seq_has_untied_output logits = TinyNN.tnn_matmul(sess, self.t_seq_output, x_final) else logits = TinyNN.tnn_matmul(sess, self., x_final) end TinyNN.tnn_set_output(logits) Toy::LLM::Archs::LlamaArchForwardOut.new(, x_final, logits) end |
#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ Object
Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type). The cache’s realize_for_mmap formerly ran this block inline (P2.6 pass-2 Step 1); it is moved VERBATIM here (same FFI primitives, same find_index/file_offset/type LITERAL string lookups at runtime, same untied conditional which the GGUF round-trip gate exercises true). The arch already OWNS these accessors (L68), so no new class / Struct / FFI :str at class load. Called ONLY from realize_for_mmap — the random_init globals and full_finetune’s else-branch globals are structurally different and are NOT routed through this helper. Mirrors the seed_blocks! / alloc_trainable_f32_weights! precedents.
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 121 def load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) eidx = TinyNN.tnn_gguf_find_index(gguf_handle, "token_embd.weight") eoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, eidx) etyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, eidx) @t_seq_token_embed = TinyNN.tnn_input_2d_persistent_mmap(sess, vocab, d_model, etyp, eoff) fnidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output_norm.weight") fnoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, fnidx) @t_seq_final_norm_gamma = TinyNN.tnn_input_1d_persistent_mmap(sess, d_model, 0, fnoff) if untied oidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output.weight") ooff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, oidx) otyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, oidx) @t_seq_output = TinyNN.tnn_input_2d_persistent_mmap(sess, vocab, d_model, otyp, ooff) end end |
#seed_blocks!(n_layers) ⇒ Object
Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks. The four cache realize paths each ran this identical loop verbatim (P2.6 Step 2) — seed one block, then push n_layers-1 more — so the SHAPE here matches the former cache loop byte-for-byte (length == n_layers; first element is a fresh block, exactly like the cache’s ‘[TransformerBlock.new]` seed). The arch already OWNS this array (ctor seeds it with one block at L83) and already constructs TransformerBlock.new there, so no new class / Struct / FFI :str at class load. Each realize path now calls this via the cache’s seq_blocks_ffi delegator chain (self.seq_arch).
100 101 102 103 104 105 106 107 |
# File 'lib/toy/llm/archs/llama_arch.rb', line 100 def seed_blocks!(n_layers) @seq_blocks_ffi = [Toy::LLM::Blocks::TransformerBlock.new] li_init = 1 while li_init < n_layers @seq_blocks_ffi.push(Toy::LLM::Blocks::TransformerBlock.new) li_init = li_init + 1 end end |