Class: Toy::LLM::Archs::LlamaArch

Inherits:
Object
  • Object
show all
Defined in:
lib/toy/llm/archs/llama_arch.rb,
lib/toy/llm/archs/llama_arch_cuda.rb,
lib/toy/llm/archs/llama_arch_metal.rb

Overview

The llama-family sequence-mode arch. Owns the arch-level persistent handles (the cache realize paths allocate+assign them via cache delegators). Field names are UNCHANGED from the former cache ivars so the cache-side realize / train / decode / tap walkers keep working by accessor name.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeLlamaArch

Returns a new instance of LlamaArch.



92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/toy/llm/archs/llama_arch.rb', line 92

def initialize
  @t_seq_token_embed      = TinyNN.tnn_null_ptr
  @t_seq_final_norm_gamma = TinyNN.tnn_null_ptr
  @t_seq_output           = TinyNN.tnn_null_ptr
  @t_seq_w_proj           = TinyNN.tnn_null_ptr
  # Seed with one block — matches the former cache init (L112).
  @seq_blocks_ffi         = [Toy::LLM::Blocks::TransformerBlock.new]
  # Phase 3 — parallel seed: one attention spec for the seed block.
  @seq_layer_specs        = [Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)]
  # Phase 5 — parallel int dispatch keys (KIND_ATTENTION for the seed).
  @seq_layer_kinds        = [Toy::LLM::Archs::LayerSpec::KIND_ATTENTION]
  # Phase 5 — parallel GDN-block slots. Seeded with GDNBlock placeholders so
  # the array is MONOMORPHIC (all GDNBlock) — the seam's KIND_GDN call site
  # never sees a mixed null/object array (Spinel poly-array landmine). At
  # KIND_ATTENTION layers the placeholder is simply never invoked.
  @seq_gdn_blocks_ffi     = [Toy::LLM::Blocks::GDNBlock.new]
  @seq_donor_d_in         = 0
  # The cache overwrites seq_rope_cfg with the real RoPE::Cfg before
  # build_forward runs (each realize prologue rebuilds it).
  @seq_rope_cfg           = TinyNN.tnn_null_ptr
end

Instance Attribute Details

#seq_blocks_ffiObject

Returns the value of attribute seq_blocks_ffi.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_blocks_ffi
  @seq_blocks_ffi
end

#seq_donor_d_inObject

Returns the value of attribute seq_donor_d_in.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_donor_d_in
  @seq_donor_d_in
end

#seq_gdn_blocks_ffiObject

Returns the value of attribute seq_gdn_blocks_ffi.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_gdn_blocks_ffi
  @seq_gdn_blocks_ffi
end

#seq_layer_kindsObject

Returns the value of attribute seq_layer_kinds.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_layer_kinds
  @seq_layer_kinds
end

#seq_layer_specsObject

Returns the value of attribute seq_layer_specs.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_layer_specs
  @seq_layer_specs
end

#seq_rope_cfgObject

Returns the value of attribute seq_rope_cfg.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_rope_cfg
  @seq_rope_cfg
end

#t_seq_final_norm_gammaObject

Returns the value of attribute t_seq_final_norm_gamma.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_final_norm_gamma
  @t_seq_final_norm_gamma
end

#t_seq_outputObject

Returns the value of attribute t_seq_output.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_output
  @t_seq_output
end

#t_seq_token_embedObject

Returns the value of attribute t_seq_token_embed.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_token_embed
  @t_seq_token_embed
end

#t_seq_w_projObject

Returns the value of attribute t_seq_w_proj.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_w_proj
  @t_seq_w_proj
end

Instance Method Details

#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ Object

P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens). The arch already OWNS these handles (the load_globals_from_gguf_mmap! precedent); the engine’s @ft_globals_* recorders + the frozen-embed :str namer are back-called through ‘cache` (the tnn_tensor_set_name :str FFI stays on the cache realize runtime path — same discipline as ft_name_last / lora_name_q!). donor_d_in>0 = projection lens (frozen donor-width embed + trainable lens.proj); 0 = standard.



197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# File 'lib/toy/llm/archs/llama_arch.rb', line 197

def alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied)
  if donor_d_in > 0
    self.t_seq_token_embed = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, donor_d_in)
    cache.name_global!(self.t_seq_token_embed, "token_embd.weight")
    self.t_seq_w_proj = TinyNN.tnn_input_2d_f32_persistent(sess, d_model, donor_d_in)
    cache.ft_add_global_2d(self.t_seq_w_proj, d_model, donor_d_in)
    cache.ft_name_last_global("lens.proj.weight")
  else
    self.t_seq_token_embed = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model)
    cache.ft_add_global_2d(self.t_seq_token_embed, vocab, d_model)
    cache.ft_name_last_global("token_embd.weight")
  end

  self.t_seq_final_norm_gamma = TinyNN.tnn_input_1d_f32_persistent(sess, d_model)
  cache.ft_add_global_1d(self.t_seq_final_norm_gamma)
  cache.ft_name_last_global("output_norm.weight")

  if untied
    self.t_seq_output = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model)
    cache.ft_add_global_2d(self.t_seq_output, vocab, d_model)
    cache.ft_name_last_global("output.weight")
  end
end

#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ Object

SEQ-MODE forward orchestration. The per-graph INPUT handles (token_ids, positions) are ALLOCATED BY THE CACHE before this call (cache-owned graph I/O, read by forward() and the uploaders) and passed in; ditto t_rope_freq_factors and t_attn_mask. The arch builds: get_rows(token_embed, token_ids) → x_embed (tap), optional projection-lens matmul(w_proj, x_embed) when seq_donor_d_in>0 (tap), the shared TransformerBlockCtx built ONCE, the block-stacking loop, final RMSNorm (tap), tied/untied logits matmul (tap). Returns the three per-graph output handles in a LlamaArchForwardOut.



230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
# File 'lib/toy/llm/archs/llama_arch.rb', line 230

def build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors,
                  t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads,
                  seq_group_size, seq_has_qkv_bias, seq_weight_dtype,
                  seq_lora_q_enabled, seq_t, seq_b, seq_n_layers,
                  seq_has_untied_output)
  eps   = seq_eps
  scale = 1.0 / Math.sqrt(seq_d_head.to_f)

  # Per-forward block context: the 14 config/handle values the block
  # body reads. Positional class (no keyword_init) — matches the
  # TransformerBlockCtx member order exactly. Built once before the
  # block-stacking loop; shared (read-only) across all blocks.
  ctx = Toy::LLM::Blocks::TransformerBlockCtx.new(
    scale, eps, seq_n_kv, seq_n_heads, seq_group_size,
    seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled,
    t_positions, t_rope_freq_factors, self.seq_rope_cfg,
    seq_t, seq_b, t_attn_mask)

  x_embed = TinyNN.tnn_get_rows(sess, self.t_seq_token_embed, t_token_ids)
  TinyNN.tnn_set_output(x_embed)

  # E2.3 — projection lens. ggml matmul(W, x) with W=[donor_d_in, d_model]
  # and x=[donor_d_in, T] gives [d_model, T] (contraction on ne[0]).
  if self.seq_donor_d_in > 0
    t_proj = TinyNN.tnn_matmul(sess, self.t_seq_w_proj, x_embed)
    TinyNN.tnn_set_output(t_proj)
    t_cur = t_proj
  else
    t_cur = x_embed
  end
  li_g = 0
  while li_g < seq_n_layers
    # Phase 3 — per-layer descriptor dispatch. The branch compares a FLAT
    # INT (spec.kind) and each arm calls a CONCRETE typed block method, so
    # every .build_forward call site stays monomorphic (one receiver
    # class). KIND_ATTENTION is the only arm wired today; KIND_GDN gets its
    # own arm + its own typed block array in Phase 5. Unknown kinds fail
    # loud rather than silently building the wrong graph (never-mask rule).
    spec_kind = self.seq_layer_kinds[li_g]
    if spec_kind == Toy::LLM::Archs::LayerSpec::KIND_ATTENTION
      t_cur = self.seq_blocks_ffi[li_g].build_forward(sess, t_cur, ctx)
    elsif spec_kind == Toy::LLM::Archs::LayerSpec::KIND_GDN
      # Concrete typed call into the parallel GDN array — the GDN block reads
      # its own dims (set at alloc); seq_t/eps come from the shared ctx.
      t_cur = self.seq_gdn_blocks_ffi[li_g].build_forward(sess, t_cur, seq_t, eps)
    else
      raise "LlamaArch#build_forward: unsupported layer kind #{spec_kind} at layer #{li_g}"
    end
    li_g = li_g + 1
  end

  x_final = Toy::LLM::Primitives::RMSNorm.build(sess, t_cur, self.t_seq_final_norm_gamma, eps)
  TinyNN.tnn_set_output(x_final)

  if seq_has_untied_output
    logits = TinyNN.tnn_matmul(sess, self.t_seq_output, x_final)
  else
    logits = TinyNN.tnn_matmul(sess, self.t_seq_token_embed, x_final)
  end
  TinyNN.tnn_set_output(logits)

  Toy::LLM::Archs::LlamaArchForwardOut.new(x_embed, x_final, logits)
end

#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ Object

Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type). The cache’s realize_for_mmap formerly ran this block inline (P2.6 pass-2 Step 1); it is moved VERBATIM here (same FFI primitives, same find_index/file_offset/type LITERAL string lookups at runtime, same untied conditional which the GGUF round-trip gate exercises true). The arch already OWNS these accessors (L68), so no new class / Struct / FFI :str at class load. Called ONLY from realize_for_mmap — the random_init globals and full_finetune’s else-branch globals are structurally different and are NOT routed through this helper. Mirrors the seed_blocks! / alloc_trainable_f32_weights! precedents.



167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# File 'lib/toy/llm/archs/llama_arch.rb', line 167

def load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied)
  eidx = TinyNN.tnn_gguf_find_index(gguf_handle, "token_embd.weight")
  eoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, eidx)
  etyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, eidx)
  @t_seq_token_embed = TinyNN.tnn_input_2d_persistent_mmap(sess,
                         vocab, d_model, etyp, eoff)

  fnidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output_norm.weight")
  fnoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, fnidx)
  @t_seq_final_norm_gamma = TinyNN.tnn_input_1d_persistent_mmap(sess,
                              d_model, 0, fnoff)

  if untied
    oidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output.weight")
    ooff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, oidx)
    otyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, oidx)
    @t_seq_output = TinyNN.tnn_input_2d_persistent_mmap(sess,
                      vocab, d_model, otyp, ooff)
  end
end

#seed_blocks!(n_layers) ⇒ Object



137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# File 'lib/toy/llm/archs/llama_arch.rb', line 137

def seed_blocks!(n_layers)
  @seq_blocks_ffi = [Toy::LLM::Blocks::TransformerBlock.new]
  # Phase 3 — seed the parallel spec array in lockstep. Every layer is
  # KIND_ATTENTION for now (the homogeneous-Llama refactor gate); Phase 5
  # overwrites individual entries with KIND_GDN for Dragon's pattern.
  @seq_layer_specs = [Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)]
  @seq_gdn_blocks_ffi = [Toy::LLM::Blocks::GDNBlock.new]
  @seq_layer_kinds = [Toy::LLM::Archs::LayerSpec::KIND_ATTENTION]
  li_init = 1
  while li_init < n_layers
    @seq_blocks_ffi.push(Toy::LLM::Blocks::TransformerBlock.new)
    @seq_layer_specs.push(Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION))
    @seq_gdn_blocks_ffi.push(Toy::LLM::Blocks::GDNBlock.new)
    @seq_layer_kinds.push(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)
    li_init = li_init + 1
  end
end

#set_gdn_layer!(idx) ⇒ Object

Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks. The four cache realize paths each ran this identical loop verbatim (P2.6 Step 2) — seed one block, then push n_layers-1 more — so the SHAPE here matches the former cache loop byte-for-byte (length == n_layers; first element is a fresh block, exactly like the cache’s ‘[TransformerBlock.new]` seed). The arch already OWNS this array (ctor seeds it with one block at L83) and already constructs TransformerBlock.new there, so no new class / Struct / FFI :str at class load. Each realize path now calls this via the cache’s seq_blocks_ffi delegator chain (self.seq_arch). Phase 5 hybrid — rebuild the per-layer spec array from a per-layer GDN bool flag, using the LayerSpec CTOR (never the .kind= setter: mutating LayerSpec.kind elsewhere while build_forward reads it trips a Spinel codegen miscompile that corrupts the token-id finalize). Called after seed_blocks!, before alloc. Mark ONE layer as GDN. Takes an INT index (never an array param — a function-parameter array trips the Spinel #688 type-lock landmine, which here manifests as a token-id-finalize codegen miscompile). Mutates the plain int dispatch array element (proven-safe).



133
134
135
# File 'lib/toy/llm/archs/llama_arch.rb', line 133

def set_gdn_layer!(idx)
  @seq_layer_kinds[idx] = Toy::LLM::Archs::LayerSpec::KIND_GDN
end