Class: Toy::LLM::Archs::LlamaArch

Inherits:

Object

Object
Toy::LLM::Archs::LlamaArch

show all

Defined in:: lib/toy/llm/archs/llama_arch.rb,
lib/toy/llm/archs/llama_arch_cuda.rb,
lib/toy/llm/archs/llama_arch_metal.rb

Overview

The llama-family sequence-mode arch. Owns the arch-level persistent handles (the cache realize paths allocate+assign them via cache delegators). Field names are UNCHANGED from the former cache ivars so the cache-side realize / train / decode / tap walkers keep working by accessor name.

Instance Attribute Summary collapse

#seq_blocks_ffi ⇒ Object

Returns the value of attribute seq_blocks_ffi.
#seq_donor_d_in ⇒ Object

Returns the value of attribute seq_donor_d_in.
#seq_gdn_blocks_ffi ⇒ Object

Returns the value of attribute seq_gdn_blocks_ffi.
#seq_layer_kinds ⇒ Object

Returns the value of attribute seq_layer_kinds.
#seq_layer_specs ⇒ Object

Returns the value of attribute seq_layer_specs.
#seq_rope_cfg ⇒ Object

Returns the value of attribute seq_rope_cfg.
#t_seq_final_norm_gamma ⇒ Object

Returns the value of attribute t_seq_final_norm_gamma.
#t_seq_output ⇒ Object

Returns the value of attribute t_seq_output.
#t_seq_token_embed ⇒ Object

Returns the value of attribute t_seq_token_embed.
#t_seq_w_proj ⇒ Object

Returns the value of attribute t_seq_w_proj.

Instance Method Summary collapse

#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ Object

P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens).
#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ Object

SEQ-MODE forward orchestration.
#initialize ⇒ LlamaArch constructor

A new instance of LlamaArch.
#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ Object

Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type).
#seed_blocks!(n_layers) ⇒ Object
#set_gdn_layer!(idx) ⇒ Object

Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks.

Constructor Details

#initialize ⇒ `LlamaArch`

Returns a new instance of LlamaArch.

# File 'lib/toy/llm/archs/llama_arch.rb', line 92

def initialize
  @t_seq_token_embed      = TinyNN.tnn_null_ptr
  @t_seq_final_norm_gamma = TinyNN.tnn_null_ptr
  @t_seq_output           = TinyNN.tnn_null_ptr
  @t_seq_w_proj           = TinyNN.tnn_null_ptr
  # Seed with one block — matches the former cache init (L112).
  @seq_blocks_ffi         = [Toy::LLM::Blocks::TransformerBlock.new]
  # Phase 3 — parallel seed: one attention spec for the seed block.
  @seq_layer_specs        = [Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)]
  # Phase 5 — parallel int dispatch keys (KIND_ATTENTION for the seed).
  @seq_layer_kinds        = [Toy::LLM::Archs::LayerSpec::KIND_ATTENTION]
  # Phase 5 — parallel GDN-block slots. Seeded with GDNBlock placeholders so
  # the array is MONOMORPHIC (all GDNBlock) — the seam's KIND_GDN call site
  # never sees a mixed null/object array (Spinel poly-array landmine). At
  # KIND_ATTENTION layers the placeholder is simply never invoked.
  @seq_gdn_blocks_ffi     = [Toy::LLM::Blocks::GDNBlock.new]
  @seq_donor_d_in         = 0
  # The cache overwrites seq_rope_cfg with the real RoPE::Cfg before
  # build_forward runs (each realize prologue rebuilds it).
  @seq_rope_cfg           = TinyNN.tnn_null_ptr
end

Instance Attribute Details

#seq_blocks_ffi ⇒ `Object`

Returns the value of attribute seq_blocks_ffi.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_blocks_ffi
  @seq_blocks_ffi
end

#seq_donor_d_in ⇒ `Object`

Returns the value of attribute seq_donor_d_in.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_donor_d_in
  @seq_donor_d_in
end

#seq_gdn_blocks_ffi ⇒ `Object`

Returns the value of attribute seq_gdn_blocks_ffi.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_gdn_blocks_ffi
  @seq_gdn_blocks_ffi
end

#seq_layer_kinds ⇒ `Object`

Returns the value of attribute seq_layer_kinds.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_layer_kinds
  @seq_layer_kinds
end

#seq_layer_specs ⇒ `Object`

Returns the value of attribute seq_layer_specs.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_layer_specs
  @seq_layer_specs
end

#seq_rope_cfg ⇒ `Object`

Returns the value of attribute seq_rope_cfg.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def seq_rope_cfg
  @seq_rope_cfg
end

#t_seq_final_norm_gamma ⇒ `Object`

Returns the value of attribute t_seq_final_norm_gamma.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_final_norm_gamma
  @t_seq_final_norm_gamma
end

#t_seq_output ⇒ `Object`

Returns the value of attribute t_seq_output.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_output
  @t_seq_output
end

#t_seq_token_embed ⇒ `Object`

Returns the value of attribute t_seq_token_embed.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_token_embed
  @t_seq_token_embed
end

#t_seq_w_proj ⇒ `Object`

Returns the value of attribute t_seq_w_proj.



68
69
70

# File 'lib/toy/llm/archs/llama_arch.rb', line 68

def t_seq_w_proj
  @t_seq_w_proj
end

Instance Method Details

#alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied) ⇒ `Object`

P2-finish — the RANDOM-INIT (+ projection-lens) trainable-F32 GLOBAL alloc, lifted VERBATIM from Toy::LLM::Engine::LlamaSeqEngine#realize_for_random_init (alloc + ft_add_global / ft_name_last_global ORDER unchanged → bit-identical graph; gated by train_gate from-scratch + smoke_projection_lens). The arch already OWNS these handles (the load_globals_from_gguf_mmap! precedent); the engine’s @ft_globals_* recorders + the frozen-embed :str namer are back-called through ‘cache` (the tnn_tensor_set_name :str FFI stays on the cache realize runtime path — same discipline as ft_name_last / lora_name_q!). donor_d_in>0 = projection lens (frozen donor-width embed + trainable lens.proj); 0 = standard.

# File 'lib/toy/llm/archs/llama_arch.rb', line 197

def alloc_globals_trainable_f32!(sess, cache, vocab, d_model, donor_d_in, untied)
  if donor_d_in > 0
    self.t_seq_token_embed = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, donor_d_in)
    cache.name_global!(self.t_seq_token_embed, "token_embd.weight")
    self.t_seq_w_proj = TinyNN.tnn_input_2d_f32_persistent(sess, d_model, donor_d_in)
    cache.ft_add_global_2d(self.t_seq_w_proj, d_model, donor_d_in)
    cache.ft_name_last_global("lens.proj.weight")
  else
    self.t_seq_token_embed = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model)
    cache.ft_add_global_2d(self.t_seq_token_embed, vocab, d_model)
    cache.ft_name_last_global("token_embd.weight")
  end

  self.t_seq_final_norm_gamma = TinyNN.tnn_input_1d_f32_persistent(sess, d_model)
  cache.ft_add_global_1d(self.t_seq_final_norm_gamma)
  cache.ft_name_last_global("output_norm.weight")

  if untied
    self.t_seq_output = TinyNN.tnn_input_2d_f32_persistent(sess, vocab, d_model)
    cache.ft_add_global_2d(self.t_seq_output, vocab, d_model)
    cache.ft_name_last_global("output.weight")
  end
end

#build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors, t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads, seq_group_size, seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled, seq_t, seq_b, seq_n_layers, seq_has_untied_output) ⇒ `Object`

SEQ-MODE forward orchestration. The per-graph INPUT handles (token_ids, positions) are ALLOCATED BY THE CACHE before this call (cache-owned graph I/O, read by forward() and the uploaders) and passed in; ditto t_rope_freq_factors and t_attn_mask. The arch builds: get_rows(token_embed, token_ids) → x_embed (tap), optional projection-lens matmul(w_proj, x_embed) when seq_donor_d_in>0 (tap), the shared TransformerBlockCtx built ONCE, the block-stacking loop, final RMSNorm (tap), tied/untied logits matmul (tap). Returns the three per-graph output handles in a LlamaArchForwardOut.

# File 'lib/toy/llm/archs/llama_arch.rb', line 230

def build_forward(sess, t_token_ids, t_positions, t_rope_freq_factors,
                  t_attn_mask, seq_eps, seq_d_head, seq_n_kv, seq_n_heads,
                  seq_group_size, seq_has_qkv_bias, seq_weight_dtype,
                  seq_lora_q_enabled, seq_t, seq_b, seq_n_layers,
                  seq_has_untied_output)
  eps   = seq_eps
  scale = 1.0 / Math.sqrt(seq_d_head.to_f)

  # Per-forward block context: the 14 config/handle values the block
  # body reads. Positional class (no keyword_init) — matches the
  # TransformerBlockCtx member order exactly. Built once before the
  # block-stacking loop; shared (read-only) across all blocks.
  ctx = Toy::LLM::Blocks::TransformerBlockCtx.new(
    scale, eps, seq_n_kv, seq_n_heads, seq_group_size,
    seq_has_qkv_bias, seq_weight_dtype, seq_lora_q_enabled,
    t_positions, t_rope_freq_factors, self.seq_rope_cfg,
    seq_t, seq_b, t_attn_mask)

  x_embed = TinyNN.tnn_get_rows(sess, self.t_seq_token_embed, t_token_ids)
  TinyNN.tnn_set_output(x_embed)

  # E2.3 — projection lens. ggml matmul(W, x) with W=[donor_d_in, d_model]
  # and x=[donor_d_in, T] gives [d_model, T] (contraction on ne[0]).
  if self.seq_donor_d_in > 0
    t_proj = TinyNN.tnn_matmul(sess, self.t_seq_w_proj, x_embed)
    TinyNN.tnn_set_output(t_proj)
    t_cur = t_proj
  else
    t_cur = x_embed
  end
  li_g = 0
  while li_g < seq_n_layers
    # Phase 3 — per-layer descriptor dispatch. The branch compares a FLAT
    # INT (spec.kind) and each arm calls a CONCRETE typed block method, so
    # every .build_forward call site stays monomorphic (one receiver
    # class). KIND_ATTENTION is the only arm wired today; KIND_GDN gets its
    # own arm + its own typed block array in Phase 5. Unknown kinds fail
    # loud rather than silently building the wrong graph (never-mask rule).
    spec_kind = self.seq_layer_kinds[li_g]
    if spec_kind == Toy::LLM::Archs::LayerSpec::KIND_ATTENTION
      t_cur = self.seq_blocks_ffi[li_g].build_forward(sess, t_cur, ctx)
    elsif spec_kind == Toy::LLM::Archs::LayerSpec::KIND_GDN
      # Concrete typed call into the parallel GDN array — the GDN block reads
      # its own dims (set at alloc); seq_t/eps come from the shared ctx.
      t_cur = self.seq_gdn_blocks_ffi[li_g].build_forward(sess, t_cur, seq_t, eps)
    else
      raise "LlamaArch#build_forward: unsupported layer kind #{spec_kind} at layer #{li_g}"
    end
    li_g = li_g + 1
  end

  x_final = Toy::LLM::Primitives::RMSNorm.build(sess, t_cur, self.t_seq_final_norm_gamma, eps)
  TinyNN.tnn_set_output(x_final)

  if seq_has_untied_output
    logits = TinyNN.tnn_matmul(sess, self.t_seq_output, x_final)
  else
    logits = TinyNN.tnn_matmul(sess, self.t_seq_token_embed, x_final)
  end
  TinyNN.tnn_set_output(logits)

  Toy::LLM::Archs::LlamaArchForwardOut.new(x_embed, x_final, logits)
end

#load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied) ⇒ `Object`

Allocate the three arch-owned PERSISTENT global tensors from the mmap’d GGUF: token_embd.weight (2d, native type), output_norm.weight (1d f32), and — when untied — output.weight (2d, native type). The cache’s realize_for_mmap formerly ran this block inline (P2.6 pass-2 Step 1); it is moved VERBATIM here (same FFI primitives, same find_index/file_offset/type LITERAL string lookups at runtime, same untied conditional which the GGUF round-trip gate exercises true). The arch already OWNS these accessors (L68), so no new class / Struct / FFI :str at class load. Called ONLY from realize_for_mmap — the random_init globals and full_finetune’s else-branch globals are structurally different and are NOT routed through this helper. Mirrors the seed_blocks! / alloc_trainable_f32_weights! precedents.

# File 'lib/toy/llm/archs/llama_arch.rb', line 167

def load_globals_from_gguf_mmap!(sess, gguf_handle, vocab, d_model, untied)
  eidx = TinyNN.tnn_gguf_find_index(gguf_handle, "token_embd.weight")
  eoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, eidx)
  etyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, eidx)
  @t_seq_token_embed = TinyNN.tnn_input_2d_persistent_mmap(sess,
                         vocab, d_model, etyp, eoff)

  fnidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output_norm.weight")
  fnoff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, fnidx)
  @t_seq_final_norm_gamma = TinyNN.tnn_input_1d_persistent_mmap(sess,
                              d_model, 0, fnoff)

  if untied
    oidx = TinyNN.tnn_gguf_find_index(gguf_handle, "output.weight")
    ooff = TinyNN.tnn_gguf_tensor_file_offset(gguf_handle, oidx)
    otyp = TinyNN.tnn_gguf_tensor_type(gguf_handle, oidx)
    @t_seq_output = TinyNN.tnn_input_2d_persistent_mmap(sess,
                      vocab, d_model, otyp, ooff)
  end
end

#seed_blocks!(n_layers) ⇒ `Object`

# File 'lib/toy/llm/archs/llama_arch.rb', line 137

def seed_blocks!(n_layers)
  @seq_blocks_ffi = [Toy::LLM::Blocks::TransformerBlock.new]
  # Phase 3 — seed the parallel spec array in lockstep. Every layer is
  # KIND_ATTENTION for now (the homogeneous-Llama refactor gate); Phase 5
  # overwrites individual entries with KIND_GDN for Dragon's pattern.
  @seq_layer_specs = [Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)]
  @seq_gdn_blocks_ffi = [Toy::LLM::Blocks::GDNBlock.new]
  @seq_layer_kinds = [Toy::LLM::Archs::LayerSpec::KIND_ATTENTION]
  li_init = 1
  while li_init < n_layers
    @seq_blocks_ffi.push(Toy::LLM::Blocks::TransformerBlock.new)
    @seq_layer_specs.push(Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION))
    @seq_gdn_blocks_ffi.push(Toy::LLM::Blocks::GDNBlock.new)
    @seq_layer_kinds.push(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)
    li_init = li_init + 1
  end
end

#set_gdn_layer!(idx) ⇒ `Object`

Reset @seq_blocks_ffi and fill it with exactly n_layers fresh TransformerBlocks. The four cache realize paths each ran this identical loop verbatim (P2.6 Step 2) — seed one block, then push n_layers-1 more — so the SHAPE here matches the former cache loop byte-for-byte (length == n_layers; first element is a fresh block, exactly like the cache’s ‘[TransformerBlock.new]` seed). The arch already OWNS this array (ctor seeds it with one block at L83) and already constructs TransformerBlock.new there, so no new class / Struct / FFI :str at class load. Each realize path now calls this via the cache’s seq_blocks_ffi delegator chain (self.seq_arch). Phase 5 hybrid — rebuild the per-layer spec array from a per-layer GDN bool flag, using the LayerSpec CTOR (never the .kind= setter: mutating LayerSpec.kind elsewhere while build_forward reads it trips a Spinel codegen miscompile that corrupts the token-id finalize). Called after seed_blocks!, before alloc. Mark ONE layer as GDN. Takes an INT index (never an array param — a function-parameter array trips the Spinel #688 type-lock landmine, which here manifests as a token-id-finalize codegen miscompile). Mutates the plain int dispatch array element (proven-safe).



133
134
135

# File 'lib/toy/llm/archs/llama_arch.rb', line 133

def set_gdn_layer!(idx)
  @seq_layer_kinds[idx] = Toy::LLM::Archs::LayerSpec::KIND_GDN
end

Class: Toy::LLM::Archs::LlamaArch

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize ⇒ LlamaArch

Instance Attribute Details

#seq_blocks_ffi ⇒ Object

#seq_donor_d_in ⇒ Object

#seq_gdn_blocks_ffi ⇒ Object

#seq_layer_kinds ⇒ Object

#seq_layer_specs ⇒ Object

#seq_rope_cfg ⇒ Object

#t_seq_final_norm_gamma ⇒ Object

#t_seq_output ⇒ Object

#t_seq_token_embed ⇒ Object

#t_seq_w_proj ⇒ Object