Class: Toy::GPT2

Inherits:

Object

Object
Toy::GPT2

show all

Defined in:: lib/toy/models/toy_gpt2.rb

Overview

GPT-2: decoder-only transformer LM. ‘stack` (not `blocks`) is kept as the field name for readability — “the stack of N transformer blocks” — independent of the older Spinel field-name-collapse constraint that originally forced it.

Instance Attribute Summary collapse

#cfg ⇒ Object

Returns the value of attribute cfg.
#final_norm ⇒ Object

Returns the value of attribute final_norm.
#pos_embed ⇒ Object

Returns the value of attribute pos_embed.
#stack ⇒ Object

Returns the value of attribute stack.
#token_embed ⇒ Object

Returns the value of attribute token_embed.

Instance Method Summary collapse

#algorithm ⇒ Object

Phuong–Hutter style algorithm card for the whole model.
#algorithm_card ⇒ Object
#algorithm_card_full ⇒ Object

Recursive card — model + block + sub-ops inlined.
#forward(ids, start_pos) ⇒ Object

ids: Array<Int> (length T), start_pos: Int → logits [T, V].
#initialize(cfg) ⇒ GPT2 constructor

A new instance of GPT2.
#param_count ⇒ Object

Total trainable parameter count.

Constructor Details

#initialize(cfg) ⇒ `GPT2`

Returns a new instance of GPT2.

# File 'lib/toy/models/toy_gpt2.rb', line 78

def initialize(cfg)
  @cfg         = cfg
  @token_embed = Toy::Embedding.new(cfg.vocab, cfg.d_model)
  @pos_embed   = Toy::Embedding.new(cfg.ctx,   cfg.d_model)
  @final_norm  = Toy::LayerNorm.new(cfg.d_model)

  # Block stack: literal-seed + push so Spinel infers
  # PtrArray<Toy::GPT2Block>.
  @stack = [Toy::GPT2Block.new(cfg)]
  li = 1
  while li < cfg.n_layers
    @stack.push(Toy::GPT2Block.new(cfg))
    li += 1
  end
end

Instance Attribute Details

#cfg ⇒ `Object`

Returns the value of attribute cfg.



76
77
78

# File 'lib/toy/models/toy_gpt2.rb', line 76

def cfg
  @cfg
end

#final_norm ⇒ `Object`

Returns the value of attribute final_norm.



76
77
78

# File 'lib/toy/models/toy_gpt2.rb', line 76

def final_norm
  @final_norm
end

#pos_embed ⇒ `Object`

Returns the value of attribute pos_embed.



76
77
78

# File 'lib/toy/models/toy_gpt2.rb', line 76

def pos_embed
  @pos_embed
end

#stack ⇒ `Object`

Returns the value of attribute stack.



76
77
78

# File 'lib/toy/models/toy_gpt2.rb', line 76

def stack
  @stack
end

#token_embed ⇒ `Object`

Returns the value of attribute token_embed.



76
77
78

# File 'lib/toy/models/toy_gpt2.rb', line 76

def token_embed
  @token_embed
end

Instance Method Details

#algorithm ⇒ `Object`

Phuong–Hutter style algorithm card for the whole model. See arXiv:2207.09238 for the formalism. Mamba (arXiv:2312.00752) and FlashAttention (arXiv:2205.14135) Algorithm 1 are the modern exemplars for shape-annotated pseudocode.

‘algorithm` returns the structured form (Toy::Card); `algorithm_card` renders it to the human-readable Phuong–Hutter text. The structured form is what prep/card_to_code.rb consumes for round-trip parsing.

# File 'lib/toy/models/toy_gpt2.rb', line 133

def algorithm
  c = Toy::Card.new("Toy::GPT2.forward(x, p_start)", "HF GPT-2 family")
  c.add_input("x",       "{1..V}^T", "token IDs")
  c.add_input("p_start", "ℕ",        "absolute position of x[0]")
  c.add_output("P",      "R^{T×V}",  "logits")
  c.add_hyper("V",   @cfg.vocab.to_s)
  c.add_hyper("D",   @cfg.d_model.to_s)
  c.add_hyper("H",   @cfg.n_heads.to_s)
  c.add_hyper("D_f", @cfg.d_ff.to_s)
  c.add_hyper("N",   @cfg.n_layers.to_s)
  c.add_hyper("ctx", @cfg.ctx.to_s)
  c.add_param("W_e",         "R^{V×D}",   "token embeddings")
  c.add_param("W_p",         "R^{ctx×D}", "learned absolute positions")
  c.add_param("θ_block_ℓ",   "(ℓ=1..N)",  "per-block; see GPT2Block")
  c.add_param("γ_f, β_f",    "R^D",       "final LayerNorm")
  c.add_param_extra("(total " + Toy.fmt_count(param_count) +
                    ", embeddings tied: logits = e · W_e^⊤)")
  c.step_bind("e", "W_e[x] + W_p[p_start : p_start+T]", "e ∈ R^{T×D}")
  c.step_loop("ℓ ← 1, …, N", "")
  c.step_update("e", "e + Attn(LN(e; γ_ℓ^1, β_ℓ^1, ε); θ_ℓ^attn)",
                "e ∈ R^{T×D}", "")
  c.step_update("e", "e + FFN (LN(e; γ_ℓ^2, β_ℓ^2, ε); θ_ℓ^ffn )",
                "e ∈ R^{T×D}", "")
  c.step_loop_close
  c.step_update("e", "LN(e; γ_f, β_f, ε)", "e ∈ R^{T×D}", "")
  c.step_bind("P", "e · W_e^⊤",            "P ∈ R^{T×V}")
  c.step_return("P")
  c
end

#algorithm_card ⇒ `Object`

163	# File 'lib/toy/models/toy_gpt2.rb', line 163 def algorithm_card; algorithm.render_pseudocode; end

#algorithm_card_full ⇒ `Object`

Recursive card — model + block + sub-ops inlined.

# File 'lib/toy/models/toy_gpt2.rb', line 166

def algorithm_card_full
  blk = @stack[0]
  s = algorithm_card + "\n\n"
  s = s + "─── sub-algorithms ─────────────────────────────────────────────────────\n\n"
  s = s + blk.algorithm_card    + "\n\n"
  s = s + blk.ln1.algorithm_card  + "\n\n"
  s = s + blk.attn.algorithm_card + "\n\n"
  s = s + blk.ffn.algorithm_card
  s
end

#forward(ids, start_pos) ⇒ `Object`

ids: Array<Int> (length T), start_pos: Int → logits [T, V]

# File 'lib/toy/models/toy_gpt2.rb', line 95

def forward(ids, start_pos)
  x = @token_embed.lookup(ids)                              # [T, D]
  x.add!(@pos_embed.slice(start_pos, ids.length))           # [T, D]

  li = 0
  while li < @cfg.n_layers
    x = @stack[li].forward(x)                                   # [T, D]
    li += 1
  end

  x_final = @final_norm.forward(x)                              # [T, D]
  x_final.matmul_t(@token_embed.weight)                      # [T, V]
end

#param_count ⇒ `Object`

Total trainable parameter count. Tied embeddings counted once.

# File 'lib/toy/models/toy_gpt2.rb', line 110

def param_count
  total = @token_embed.param_count + @pos_embed.param_count +
          @final_norm.param_count
  li = 0
  while li < @cfg.n_layers
    total = total + @stack[li].param_count
    li += 1
  end
  total
end

Class: Toy::GPT2

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(cfg) ⇒ GPT2

Instance Attribute Details

#cfg ⇒ Object

#final_norm ⇒ Object

#pos_embed ⇒ Object

#stack ⇒ Object

#token_embed ⇒ Object