Class: Toy::LLM::Archs::LlamaArch

Inherits:

Object

Object
Toy::LLM::Archs::LlamaArch

show all

Defined in:: lib/toy/llm/archs/llama_arch.rb,
lib/toy/llm/archs/llama_arch_cuda.rb,
lib/toy/llm/archs/llama_arch_metal.rb

Overview

The llama-family sequence-mode arch. Owns the arch-level persistent handles (the cache realize paths allocate+assign them via cache delegators). Field names are UNCHANGED from the former cache ivars so the cache-side realize / train / decode / tap walkers keep working by accessor name.

Instance Attribute Summary collapse

#seq_blocks_ffi ⇒ Object

Returns the value of attribute seq_blocks_ffi.
#seq_donor_d_in ⇒ Object

Returns the value of attribute seq_donor_d_in.
#seq_rope_cfg ⇒ Object

Returns the value of attribute seq_rope_cfg.
#t_seq_final_norm_gamma ⇒ Object

Returns the value of attribute t_seq_final_norm_gamma.
#t_seq_output ⇒ Object

Returns the value of attribute t_seq_output.
#t_seq_token_embed ⇒ Object

Returns the value of attribute t_seq_token_embed.
#t_seq_w_proj ⇒ Object

Returns the value of attribute t_seq_w_proj.

Instance Method Summary collapse

#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ Object

P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens).
#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ Object

SEQ-MODE forward orchestration.
#initialize ⇒ LlamaArch constructor

A new instance of LlamaArch.
#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ Object

Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type).
#seed_blocks!(n_layers) ⇒ Object

Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks.

Constructor Details

#initialize ⇒ `LlamaArch`

Returns a new instance of LlamaArch.

# File 'lib/toy/llm/archs/llama_arch.rb', line 77

def initialize
  @t_seq_token_embed      = TinyNN.tnn_null_ptr
  @t_seq_final_norm_gamma = TinyNN.tnn_null_ptr
  @t_seq_output           = TinyNN.tnn_null_ptr
  @t_seq_w_proj           = TinyNN.tnn_null_ptr
  # Seed with one block — matches the former cache init (L112).
  @seq_blocks_ffi         = [Toy::LLM::Blocks::TransformerBlock.new]
  @seq_donor_d_in         = 0
  # The cache overwrites seq_rope_cfg with the real RoPE::Cfg before
  # build_forward runs (each realize prologue rebuilds it).
  @seq_rope_cfg           = TinyNN.tnn_null_ptr
end

Instance Attribute Details

#seq_blocks_ffi ⇒ `Object`

Returns the value of attribute seq_blocks_ffi.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_blocks_ffi
  @seq_blocks_ffi
end

#seq_donor_d_in ⇒ `Object`

Returns the value of attribute seq_donor_d_in.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_donor_d_in
  @seq_donor_d_in
end

#seq_rope_cfg ⇒ `Object`

Returns the value of attribute seq_rope_cfg.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_rope_cfg
  @seq_rope_cfg
end

#t_seq_final_norm_gamma ⇒ `Object`

Returns the value of attribute t_seq_final_norm_gamma.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_final_norm_gamma
  @t_seq_final_norm_gamma
end

#t_seq_output ⇒ `Object`

Returns the value of attribute t_seq_output.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_output
  @t_seq_output
end

#t_seq_token_embed ⇒ `Object`

Returns the value of attribute t_seq_token_embed.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_token_embed
  @t_seq_token_embed
end

#t_seq_w_proj ⇒ `Object`

Returns the value of attribute t_seq_w_proj.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_w_proj
  @t_seq_w_proj
end

Instance Method Details

#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ `Object`

P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens). The arch already OWNS these handles (the load_globals_from_gguf_mmap! precedent); the engine’s @ft_globals_* recorders + the frozen-embed :str namer are back-called through ‘cache` (the tnn_tensor_set_name :str FFI stays on the cache realize runtime path — same discipline as ft_name_last / lora_name_q!). donor_d_in>0 = projection lens (frozen donor-width embed + trainable lens.proj); 0 = standard.

# File 'lib/toy/llm/archs/llama_arch.rb', line 151

def alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied)
  if donor_d_in > 0
    self.t_seq_token_embed = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, donor_d_in)
    cache.name_global!(self.t_seq_token_embed, "token_embd.weight")
    self.t_seq_w_proj = TinyNN.tnn_input_2d_f32_persistent(sess, d_model, donor_d_in)
    cache.ft_add_global_2d(self.t_seq_w_proj, d_model, donor_d_in)
    cache.ft_name_last_global("lens.proj.weight")
  else
    self.t_seq_token_embed = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model)
    cache.ft_add_global_2d(self.t_seq_token_embed, vocab, d_model)
    cache.ft_name_last_global("token_embd.weight")
  end

  self.t_seq_final_norm_gamma = TinyNN.tnn_input_1d_f32_persistent(sess, d_model)
  cache.ft_add_global_1d(self.t_seq_final_norm_gamma)
  cache.ft_name_last_global("output_norm.weight")

  if untied
    self.t_seq_output = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model)
    cache.ft_add_global_2d(self.t_seq_output, vocab, d_model)
    cache.ft_name_last_global("output.weight")
  end
end

#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ `Object`

SEQ-MODE forward orchestration. The per-graph INPUT handles (token_ids, positions) are ALLOCATED BY THE CACHE before this call (cache-owned graph I/O, read by forward() and the uploaders) and passed in; ditto t_rope_freq_factors and t_attn_mask. The arch builds: get_rows(token_embed, token_ids) → x_embed (tap), optional projection-lens matmul(w_proj, x_embed) when seq_donor_d_in>0 (tap), the shared TransformerBlockCtx built ONCE, the block-stacking loop, final RMSNorm (tap), tied/untied logits matmul (tap). Returns the three per-graph output handles in a LlamaArchForwardOut.

# File 'lib/toy/llm/archs/llama_arch.rb', line 184

def build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors,
                  t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads,
                  seq_group_size, seq_has_qkv_bias, seq_weight_dtype,
                  seq_lora_q_enabled, seq_t, seq_b, seq_n_layers,
                  seq_has_untied_output)
  eps   = seq_eps
  scale = 1.0 / Math.sqrt(seq_d_head.to_f)

  # Per-forward block context: the 14 config/handle values the block
  # body reads. Positional class (no keyword_init) — matches the
  # TransformerBlockCtx member order exactly. Built once before the
  # block-stacking loop; shared (read-only) across all blocks.
  ctx = Toy::LLM::Blocks::TransformerBlockCtx.new(
    scale, eps, seq_n_kv, seq_n_heads, seq_group_size,
    seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled,
    t_positions, t_rope_freq_factors, self.seq_rope_cfg,
    seq_t, seq_b, t_attn_mask)

  x_embed = TinyNN.tnn_get_rows(sess, self.t_seq_token_embed, t_token_ids)
  TinyNN.tnn_set_output(x_embed)

  # E2.3 — projection lens. ggml matmul(W, x) with W=[donor_d_in, d_model]
  # and x=[donor_d_in, T] gives [d_model, T] (contraction on ne[0]).
  if self.seq_donor_d_in > 0
    t_proj = TinyNN.tnn_matmul(sess, self.t_seq_w_proj, x_embed)
    TinyNN.tnn_set_output(t_proj)
    t_cur = t_proj
  else
    t_cur = x_embed
  end
  li_g = 0
  while li_g < seq_n_layers
    t_cur = self.seq_blocks_ffi[li_g].build_forward(sess, t_cur, ctx)
    li_g = li_g + 1
  end

  x_final = Toy::LLM::Primitives::RMSNorm.build(sess, t_cur, self.t_seq_final_norm_gamma, eps)
  TinyNN.tnn_set_output(x_final)

  if seq_has_untied_output
    logits = TinyNN.tnn_matmul(sess, self.t_seq_output, x_final)
  else
    logits = TinyNN.tnn_matmul(sess, self.t_seq_token_embed, x_final)
  end
  TinyNN.tnn_set_output(logits)

  Toy::LLM::Archs::LlamaArchForwardOut.new(x_embed, x_final, logits)
end

#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ `Object`

Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type). The cache’s realize_for_mmap formerly ran this block inline (P2.6 pass-2 Step 1); it is moved VERBATIM here (same FFI primitives, same find_index/file_offset/type LITERAL string lookups at runtime, same untied conditional which the GGUF round-trip gate exercises true). The arch already OWNS these accessors (L68), so no new class / Struct / FFI :str at class load. Called ONLY from realize_for_mmap — the random_init globals and full_finetune’s else-branch globals are structurally different and are NOT routed through this helper. Mirrors the seed_blocks! / alloc_trainable_f32_weights! precedents.

# File 'lib/toy/llm/archs/llama_arch.rb', line 121

def load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied)
  eidx = TinyNN.tnn_gguf_find_index(gguf_handle, "token_embd.weight")
  eoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, eidx)
  etyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, eidx)
  @t_seq_token_embed = TinyNN.tnn_input_2d_persistent_mmap(sess,
                         vocab, d_model, etyp, eoff)

  fnidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output_norm.weight")
  fnoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, fnidx)
  @t_seq_final_norm_gamma = TinyNN.tnn_input_1d_persistent_mmap(sess,
                              d_model, 0, fnoff)

  if untied
    oidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output.weight")
    ooff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, oidx)
    otyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, oidx)
    @t_seq_output = TinyNN.tnn_input_2d_persistent_mmap(sess,
                      vocab, d_model, otyp, ooff)
  end
end

#seed_blocks!(n_layers) ⇒ `Object`

Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks. The four cache realize paths each ran this identical loop verbatim (P2.6 Step 2) — seed one block, then push n_layers-1 more — so the SHAPE here matches the former cache loop byte-for-byte (length == n_layers; first element is a fresh block, exactly like the cache’s ‘[TransformerBlock.new]` seed). The arch already OWNS this array (ctor seeds it with one block at L83) and already constructs TransformerBlock.new there, so no new class / Struct / FFI :str at class load. Each realize path now calls this via the cache’s seq_blocks_ffi delegator chain (self.seq_arch).

# File 'lib/toy/llm/archs/llama_arch.rb', line 100

def seed_blocks!(n_layers)
  @seq_blocks_ffi = [Toy::LLM::Blocks::TransformerBlock.new]
  li_init = 1
  while li_init < n_layers
    @seq_blocks_ffi.push(Toy::LLM::Blocks::TransformerBlock.new)
    li_init = li_init + 1
  end
end

Class: Toy::LLM::Archs::LlamaArch

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize ⇒ LlamaArch

Instance Attribute Details

#seq_blocks_ffi ⇒ Object

#seq_donor_d_in ⇒ Object

#seq_rope_cfg ⇒ Object

#t_seq_final_norm_gamma ⇒ Object

#t_seq_output ⇒ Object

#t_seq_token_embed ⇒ Object

#t_seq_w_proj ⇒ Object