Class: Toy::LLM::Archs::LlamaArch

Inherits:
Object
  • Object
show all
Defined in:
lib/toy/llm/archs/llama_arch.rb,
lib/toy/llm/archs/llama_arch_cuda.rb,
lib/toy/llm/archs/llama_arch_metal.rb

Overview

The llama-family sequence-mode arch. Owns the arch-level persistent handles (the cache realize paths allocate+assign them via cache delegators). Field names are UNCHANGED from the former cache ivars so the cache-side realize / train / decode / tap walkers keep working by accessor name.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeLlamaArch

Returns a new instance of LlamaArch.



77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/toy/llm/archs/llama_arch.rb', line 77

def initialize
  @t_seq_token_embed      = TinyNN.tnn_null_ptr
  @t_seq_final_norm_gamma = TinyNN.tnn_null_ptr
  @t_seq_output           = TinyNN.tnn_null_ptr
  @t_seq_w_proj           = TinyNN.tnn_null_ptr
  # Seed with one block — matches the former cache init (L112).
  @seq_blocks_ffi         = [Toy::LLM::Blocks::TransformerBlock.new]
  @seq_donor_d_in         = 0
  # The cache overwrites seq_rope_cfg with the real RoPE::Cfg before
  # build_forward runs (each realize prologue rebuilds it).
  @seq_rope_cfg           = TinyNN.tnn_null_ptr
end

Instance Attribute Details

#seq_blocks_ffiObject

Returns the value of attribute seq_blocks_ffi.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_blocks_ffi
  @seq_blocks_ffi
end

#seq_donor_d_inObject

Returns the value of attribute seq_donor_d_in.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_donor_d_in
  @seq_donor_d_in
end

#seq_rope_cfgObject

Returns the value of attribute seq_rope_cfg.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_rope_cfg
  @seq_rope_cfg
end

#t_seq_final_norm_gammaObject

Returns the value of attribute t_seq_final_norm_gamma.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_final_norm_gamma
  @t_seq_final_norm_gamma
end

#t_seq_outputObject

Returns the value of attribute t_seq_output.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_output
  @t_seq_output
end

#t_seq_token_embedObject

Returns the value of attribute t_seq_token_embed.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_token_embed
  @t_seq_token_embed
end

#t_seq_w_projObject

Returns the value of attribute t_seq_w_proj.



68
69
70
# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_w_proj
  @t_seq_w_proj
end

Instance Method Details

#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ Object

P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens). The arch already OWNS these handles (the load_globals_from_gguf_mmap! precedent); the engine’s @ft_globals_* recorders + the frozen-embed :str namer are back-called through ‘cache` (the tnn_tensor_set_name :str FFI stays on the cache realize runtime path — same discipline as ft_name_last / lora_name_q!). donor_d_in>0 = projection lens (frozen donor-width embed + trainable lens.proj); 0 = standard.



151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# File 'lib/toy/llm/archs/llama_arch.rb', line 151

def alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied)
  if donor_d_in > 0
    self.t_seq_token_embed = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, donor_d_in)
    cache.name_global!(self.t_seq_token_embed, "token_embd.weight")
    self.t_seq_w_proj = TinyNN.tnn_input_2d_f32_persistent(sess, d_model, donor_d_in)
    cache.ft_add_global_2d(self.t_seq_w_proj, d_model, donor_d_in)
    cache.ft_name_last_global("lens.proj.weight")
  else
    self.t_seq_token_embed = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model)
    cache.ft_add_global_2d(self.t_seq_token_embed, vocab, d_model)
    cache.ft_name_last_global("token_embd.weight")
  end

  self.t_seq_final_norm_gamma = TinyNN.tnn_input_1d_f32_persistent(sess, d_model)
  cache.ft_add_global_1d(self.t_seq_final_norm_gamma)
  cache.ft_name_last_global("output_norm.weight")

  if untied
    self.t_seq_output = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model)
    cache.ft_add_global_2d(self.t_seq_output, vocab, d_model)
    cache.ft_name_last_global("output.weight")
  end
end

#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ Object

SEQ-MODE forward orchestration. The per-graph INPUT handles (token_ids, positions) are ALLOCATED BY THE CACHE before this call (cache-owned graph I/O, read by forward() and the uploaders) and passed in; ditto t_rope_freq_factors and t_attn_mask. The arch builds: get_rows(token_embed, token_ids) → x_embed (tap), optional projection-lens matmul(w_proj, x_embed) when seq_donor_d_in>0 (tap), the shared TransformerBlockCtx built ONCE, the block-stacking loop, final RMSNorm (tap), tied/untied logits matmul (tap). Returns the three per-graph output handles in a LlamaArchForwardOut.



184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# File 'lib/toy/llm/archs/llama_arch.rb', line 184

def build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors,
                  t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads,
                  seq_group_size, seq_has_qkv_bias, seq_weight_dtype,
                  seq_lora_q_enabled, seq_t, seq_b, seq_n_layers,
                  seq_has_untied_output)
  eps   = seq_eps
  scale = 1.0 / Math.sqrt(seq_d_head.to_f)

  # Per-forward block context: the 14 config/handle values the block
  # body reads. Positional class (no keyword_init) — matches the
  # TransformerBlockCtx member order exactly. Built once before the
  # block-stacking loop; shared (read-only) across all blocks.
  ctx = Toy::LLM::Blocks::TransformerBlockCtx.new(
    scale, eps, seq_n_kv, seq_n_heads, seq_group_size,
    seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled,
    t_positions, t_rope_freq_factors, self.seq_rope_cfg,
    seq_t, seq_b, t_attn_mask)

  x_embed = TinyNN.tnn_get_rows(sess, self.t_seq_token_embed, t_token_ids)
  TinyNN.tnn_set_output(x_embed)

  # E2.3 — projection lens. ggml matmul(W, x) with W=[donor_d_in, d_model]
  # and x=[donor_d_in, T] gives [d_model, T] (contraction on ne[0]).
  if self.seq_donor_d_in > 0
    t_proj = TinyNN.tnn_matmul(sess, self.t_seq_w_proj, x_embed)
    TinyNN.tnn_set_output(t_proj)
    t_cur = t_proj
  else
    t_cur = x_embed
  end
  li_g = 0
  while li_g < seq_n_layers
    t_cur = self.seq_blocks_ffi[li_g].build_forward(sess, t_cur, ctx)
    li_g = li_g + 1
  end

  x_final = Toy::LLM::Primitives::RMSNorm.build(sess, t_cur, self.t_seq_final_norm_gamma, eps)
  TinyNN.tnn_set_output(x_final)

  if seq_has_untied_output
    logits = TinyNN.tnn_matmul(sess, self.t_seq_output, x_final)
  else
    logits = TinyNN.tnn_matmul(sess, self.t_seq_token_embed, x_final)
  end
  TinyNN.tnn_set_output(logits)

  Toy::LLM::Archs::LlamaArchForwardOut.new(x_embed, x_final, logits)
end

#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ Object

Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type). The cache’s realize_for_mmap formerly ran this block inline (P2.6 pass-2 Step 1); it is moved VERBATIM here (same FFI primitives, same find_index/file_offset/type LITERAL string lookups at runtime, same untied conditional which the GGUF round-trip gate exercises true). The arch already OWNS these accessors (L68), so no new class / Struct / FFI :str at class load. Called ONLY from realize_for_mmap — the random_init globals and full_finetune’s else-branch globals are structurally different and are NOT routed through this helper. Mirrors the seed_blocks! / alloc_trainable_f32_weights! precedents.



121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# File 'lib/toy/llm/archs/llama_arch.rb', line 121

def load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied)
  eidx = TinyNN.tnn_gguf_find_index(gguf_handle, "token_embd.weight")
  eoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, eidx)
  etyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, eidx)
  @t_seq_token_embed = TinyNN.tnn_input_2d_persistent_mmap(sess,
                         vocab, d_model, etyp, eoff)

  fnidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output_norm.weight")
  fnoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, fnidx)
  @t_seq_final_norm_gamma = TinyNN.tnn_input_1d_persistent_mmap(sess,
                              d_model, 0, fnoff)

  if untied
    oidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output.weight")
    ooff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, oidx)
    otyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, oidx)
    @t_seq_output = TinyNN.tnn_input_2d_persistent_mmap(sess,
                      vocab, d_model, otyp, ooff)
  end
end

#seed_blocks!(n_layers) ⇒ Object

Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks. The four cache realize paths each ran this identical loop verbatim (P2.6 Step 2) — seed one block, then push n_layers-1 more — so the SHAPE here matches the former cache loop byte-for-byte (length == n_layers; first element is a fresh block, exactly like the cache’s ‘[TransformerBlock.new]` seed). The arch already OWNS this array (ctor seeds it with one block at L83) and already constructs TransformerBlock.new there, so no new class / Struct / FFI :str at class load. Each realize path now calls this via the cache’s seq_blocks_ffi delegator chain (self.seq_arch).



100
101
102
103
104
105
106
107
# File 'lib/toy/llm/archs/llama_arch.rb', line 100

def seed_blocks!(n_layers)
  @seq_blocks_ffi = [Toy::LLM::Blocks::TransformerBlock.new]
  li_init = 1
  while li_init < n_layers
    @seq_blocks_ffi.push(Toy::LLM::Blocks::TransformerBlock.new)
    li_init = li_init + 1
  end
end