Class: TransformerLM

Inherits:

Object

Object
TransformerLM

show all

Defined in:: lib/toy/models/transformer.rb

Overview

TransformerLM

Instance Attribute Summary collapse

#blocks ⇒ Object

Returns the value of attribute blocks.
#cache ⇒ Object

Returns the value of attribute cache.
#context_length ⇒ Object

Returns the value of attribute context_length.
#d_ff ⇒ Object

Returns the value of attribute d_ff.
#d_head ⇒ Object

Returns the value of attribute d_head.
#d_model ⇒ Object

Returns the value of attribute d_model.
#ffn_ffi_caches ⇒ Object

Returns the value of attribute ffn_ffi_caches.
#layer_caches ⇒ Object

Returns the value of attribute layer_caches.
#n_heads ⇒ Object

Returns the value of attribute n_heads.
#n_layers ⇒ Object

Returns the value of attribute n_layers.
#norm_final_gamma ⇒ Object

Returns the value of attribute norm_final_gamma.
#pos_embed ⇒ Object

Returns the value of attribute pos_embed.
#token_embed ⇒ Object

Returns the value of attribute token_embed.
#vocab_size ⇒ Object

Returns the value of attribute vocab_size.
#vocabulary ⇒ Object

Returns the value of attribute vocabulary.

Instance Method Summary collapse

#adam_step_block(p_block, g_block, m_block, v_block, lr, b1, b2, eps, omc1, omc2) ⇒ Object
#adam_step_mat(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object
#adam_step_vec(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object
#apply_causal_mask!(scores, query_offset) ⇒ Object

Causal mask: for each row i, set scores[i, j] = -1e30 for j > query_offset + i.
#apply_gradients_adam(grads, state, lr, beta1, beta2, eps) ⇒ Object

Adam: per-coordinate adaptive learning rate driven by running estimates of the gradient mean (m) and squared mean (v).
#apply_gradients_sgd(grads, lr) ⇒ Object

—– Optimization —–.
#backward(input_ids, target_grads) ⇒ Object

Full backward pass.
#cross_entropy_grad(logits, token_ids) ⇒ Object

Cross-entropy on next-token prediction.
#embed(token_ids, start_pos) ⇒ Object

x = token_embed[token_ids] + pos_embed[start_pos + i].
#embed_backward(token_ids, dx, target_grads) ⇒ Object

Embedding backward: each row of dx routes to its token’s embedding row and to position i’s positional embedding row.
#feed_forward(h, block) ⇒ Object

FFN: gelu(h · W_ff1) · W_ff2.
#feed_forward_backward(d_ff_out, h, ff_cache, block, target_block) ⇒ Object

FFN backward.
#feed_forward_ffi(h, block, ffi_cache) ⇒ Object

Persistent-session FFI variant of feed_forward.
#forward(token_ids) ⇒ Object

Full forward pass.
#generate_from_ids(start_ids, max_tokens, temperature) ⇒ Object

GENERATION — autoregressive sampling from a starting token-id list.
#hsplit_heads(d_concat) ⇒ Object

Split a (T × d_model) matrix back into n_heads × (T × d_head) heads.
#hstack_heads(per_head) ⇒ Object

Concatenate per-head outputs side by side: n_heads × (T × d_head) → (T × d_model).
#initialize(vocab_size, d_model, d_ff, n_heads, n_layers, context_length) ⇒ TransformerLM constructor

A new instance of TransformerLM.
#rms_norm(x, gamma) ⇒ Object

RMSNorm: y_j = gamma_j * x_j / sqrt(mean(x²) + eps), per row.
#rms_norm_backward(x, gamma, rms, dy, target_dgamma) ⇒ Object

RMSNorm backward.
#sample_logits_row(logits, row, temperature) ⇒ Object

Sample a token ID from row ‘row` of `logits` (T × vocab_size flat).
#self_attention(h_in, block) ⇒ Object

Multi-head self-attention.
#self_attention_backward(d_proj, h_in, attn_cache, block, target_block) ⇒ Object

Self-attention backward.
#self_attention_head(h_in, block, head_idx, inv_sqrt) ⇒ Object
#sgd_step_block(p_block, g_block, lr) ⇒ Object
#sgd_step_mat(p, g, lr) ⇒ Object
#sgd_step_vec(p, g, lr) ⇒ Object
#softmax_rows!(m) ⇒ Object

Row-wise softmax with numerical-stability max-shift, in place on ‘m`.
#softmax_rows_backward(softmax_out, d_softmax) ⇒ Object

Row-wise softmax backward (for attention).
#transformer_block(x, block) ⇒ Object

One transformer block (pre-norm).
#transformer_block_backward(dx_out, x_in, block, layer_cache, target_block_grads) ⇒ Object

Backward through one block.
#transformer_block_into(x, block, cache, ffi_cache) ⇒ Object

Same as transformer_block but writes into a pre-existing LayerCache.
#x_in_for_layer(li) ⇒ Object

No ‘train_step` here: Spinel compiles every class method whether or not it has callers.

Constructor Details

#initialize(vocab_size, d_model, d_ff, n_heads, n_layers, context_length) ⇒ `TransformerLM`

Returns a new instance of TransformerLM.

# File 'lib/toy/models/transformer.rb', line 510

def initialize(vocab_size, d_model, d_ff, n_heads, n_layers, context_length)
  @vocab_size     = vocab_size
  @d_model        = d_model
  @d_ff           = d_ff
  @n_heads        = n_heads
  @d_head         = d_model / n_heads
  @n_layers       = n_layers
  @context_length = context_length

  s = 1.0 / Math.sqrt(d_model)

  @token_embed = Mat.new(vocab_size, d_model)
  @token_embed.fill_random(s)

  @pos_embed = Mat.new(context_length, d_model)
  @pos_embed.fill_random(s)

  # Tied embeddings: lm_head is @token_embed used in transposed form
  # at unembed time (logits = x_final · token_embedᵀ). No separate
  # @lm_head matrix; the unembed gradient accumulates into the same
  # token_embed grad slot as the input-side embedding lookup.

  @norm_final_gamma = Array.new(d_model, 1.0)

  # Vocabulary: seeded with one placeholder so Spinel infers it as
  # an array of strings; callers should set the real vocab after construction.
  @vocabulary = ["?"]

  # Inline Block.new in the literal — Spinel's scan_ivars runs before
  # local-variable types are inferred, so storing through a temp would
  # mistype @blocks's element class.
  @blocks = [Block.new(d_model, @d_head, d_ff, n_heads)]
  @blocks[0].fill_random_all(s)
  li = 1
  while li < n_layers
    @blocks.push(Block.new(d_model, @d_head, d_ff, n_heads))
    @blocks[li].fill_random_all(s)
    li += 1
  end

  # Pre-allocate layer caches so the array's element type is fixed at
  # construction time. Forward populates fields on these existing objects.
  @layer_caches = [LayerCache.new]
  li = 1
  while li < n_layers
    @layer_caches.push(LayerCache.new)
    li += 1
  end

  # Per-block persistent FFI caches for feed_forward. Lazily realized
  # on first call so we don't need to decide T (sequence length) at
  # model-construction time. With USE_FFI_MATMUL=false they sit
  # unused; the cost is one cheap object alloc per block.
  @ffn_ffi_caches = [FFNFFICache.new]
  li = 1
  while li < n_layers
    @ffn_ffi_caches.push(FFNFFICache.new)
    li += 1
  end
end

Instance Attribute Details

#blocks ⇒ `Object`

Returns the value of attribute blocks.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def blocks
  @blocks
end

#cache ⇒ `Object`

Returns the value of attribute cache.



823
824
825

# File 'lib/toy/models/transformer.rb', line 823

def cache
  @cache
end

#context_length ⇒ `Object`

Returns the value of attribute context_length.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def context_length
  @context_length
end

#d_ff ⇒ `Object`

Returns the value of attribute d_ff.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def d_ff
  @d_ff
end

#d_head ⇒ `Object`

Returns the value of attribute d_head.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def d_head
  @d_head
end

#d_model ⇒ `Object`

Returns the value of attribute d_model.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def d_model
  @d_model
end

#ffn_ffi_caches ⇒ `Object`

Returns the value of attribute ffn_ffi_caches.



571
572
573

# File 'lib/toy/models/transformer.rb', line 571

def ffn_ffi_caches
  @ffn_ffi_caches
end

#layer_caches ⇒ `Object`

Returns the value of attribute layer_caches.



573
574
575

# File 'lib/toy/models/transformer.rb', line 573

def layer_caches
  @layer_caches
end

#n_heads ⇒ `Object`

Returns the value of attribute n_heads.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def n_heads
  @n_heads
end

#n_layers ⇒ `Object`

Returns the value of attribute n_layers.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def n_layers
  @n_layers
end

#norm_final_gamma ⇒ `Object`

Returns the value of attribute norm_final_gamma.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def norm_final_gamma
  @norm_final_gamma
end

#pos_embed ⇒ `Object`

Returns the value of attribute pos_embed.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def pos_embed
  @pos_embed
end

#token_embed ⇒ `Object`

Returns the value of attribute token_embed.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def token_embed
  @token_embed
end

#vocab_size ⇒ `Object`

Returns the value of attribute vocab_size.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def vocab_size
  @vocab_size
end

#vocabulary ⇒ `Object`

Returns the value of attribute vocabulary.



505
506
507

# File 'lib/toy/models/transformer.rb', line 505

def vocabulary
  @vocabulary
end

Instance Method Details

#adam_step_block(p_block, g_block, m_block, v_block, lr, b1, b2, eps, omc1, omc2) ⇒ `Object`

# File 'lib/toy/models/transformer.rb', line 1249

def adam_step_block(p_block, g_block, m_block, v_block, lr, b1, b2, eps, omc1, omc2)
  self.adam_step_vec(p_block.norm1_gamma, g_block.norm1_gamma,
                     m_block.norm1_gamma, v_block.norm1_gamma,
                     lr, b1, b2, eps, omc1, omc2)
  self.adam_step_vec(p_block.norm2_gamma, g_block.norm2_gamma,
                     m_block.norm2_gamma, v_block.norm2_gamma,
                     lr, b1, b2, eps, omc1, omc2)
  self.adam_step_mat(p_block.w_o,   g_block.w_o,
                     m_block.w_o,   v_block.w_o,
                     lr, b1, b2, eps, omc1, omc2)
  self.adam_step_mat(p_block.w_ff1, g_block.w_ff1,
                     m_block.w_ff1, v_block.w_ff1,
                     lr, b1, b2, eps, omc1, omc2)
  self.adam_step_mat(p_block.w_ff2, g_block.w_ff2,
                     m_block.w_ff2, v_block.w_ff2,
                     lr, b1, b2, eps, omc1, omc2)

  h = 0
  while h < @n_heads
    self.adam_step_mat(p_block.w_q[h], g_block.w_q[h],
                       m_block.w_q[h], v_block.w_q[h],
                       lr, b1, b2, eps, omc1, omc2)
    self.adam_step_mat(p_block.w_k[h], g_block.w_k[h],
                       m_block.w_k[h], v_block.w_k[h],
                       lr, b1, b2, eps, omc1, omc2)
    self.adam_step_mat(p_block.w_v[h], g_block.w_v[h],
                       m_block.w_v[h], v_block.w_v[h],
                       lr, b1, b2, eps, omc1, omc2)
    h += 1
  end
end

#adam_step_mat(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ `Object`

# File 'lib/toy/models/transformer.rb', line 1213

def adam_step_mat(p, g, m, v, lr, b1, b2, eps, omc1, omc2)
  one_minus_b1 = 1.0 - b1
  one_minus_b2 = 1.0 - b2
  n = p.flat.length
  i = 0
  while i < n
    gi = g.flat[i]
    new_m = b1 * m.flat[i] + one_minus_b1 * gi
    new_v = b2 * v.flat[i] + one_minus_b2 * gi * gi
    m.flat[i] = new_m
    v.flat[i] = new_v
    m_hat = new_m / omc1
    v_hat = new_v / omc2
    p.flat[i] -= lr * m_hat / (Math.sqrt(v_hat) + eps)
    i += 1
  end
end

#adam_step_vec(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ `Object`

# File 'lib/toy/models/transformer.rb', line 1231

def adam_step_vec(p, g, m, v, lr, b1, b2, eps, omc1, omc2)
  one_minus_b1 = 1.0 - b1
  one_minus_b2 = 1.0 - b2
  n = p.length
  i = 0
  while i < n
    gi = g[i]
    new_m = b1 * m[i] + one_minus_b1 * gi
    new_v = b2 * v[i] + one_minus_b2 * gi * gi
    m[i] = new_m
    v[i] = new_v
    m_hat = new_m / omc1
    v_hat = new_v / omc2
    p[i] -= lr * m_hat / (Math.sqrt(v_hat) + eps)
    i += 1
  end
end

#apply_causal_mask!(scores, query_offset) ⇒ `Object`

Causal mask: for each row i, set scores[i, j] = -1e30 for j > query_offset + i.

# File 'lib/toy/models/transformer.rb', line 661

def apply_causal_mask!(scores, query_offset)
  t = scores.nrows
  n = scores.ncols
  i = 0
  while i < t
    first_masked = query_offset + i + 1
    j = first_masked
    while j < n
      scores.flat[i * n + j] = NEG_INF_SCORE
      j += 1
    end
    i += 1
  end
end

#apply_gradients_adam(grads, state, lr, beta1, beta2, eps) ⇒ `Object`

Adam: per-coordinate adaptive learning rate driven by running estimates of the gradient mean (m) and squared mean (v).

m  ← β1·m + (1−β1)·g           v  ← β2·v + (1−β2)·g²
m̂ = m / (1 − β1ᵗ)              v̂ = v / (1 − β2ᵗ)
p -= lr · m̂ / (√v̂ + ε)

bc1 / bc2 are kept as running products in AdamState (one multiply per step) rather than recomputing β**t (one pow() per step).

# File 'lib/toy/models/transformer.rb', line 1188

def apply_gradients_adam(grads, state, lr, beta1, beta2, eps)
  state.bc1 = state.bc1 * beta1
  state.bc2 = state.bc2 * beta2
  omc1 = 1.0 - state.bc1
  omc2 = 1.0 - state.bc2

  self.adam_step_mat(@token_embed, grads.token_embed,
                     state.m.token_embed, state.v.token_embed,
                     lr, beta1, beta2, eps, omc1, omc2)
  self.adam_step_mat(@pos_embed, grads.pos_embed,
                     state.m.pos_embed, state.v.pos_embed,
                     lr, beta1, beta2, eps, omc1, omc2)
  self.adam_step_vec(@norm_final_gamma, grads.norm_final_gamma,
                     state.m.norm_final_gamma, state.v.norm_final_gamma,
                     lr, beta1, beta2, eps, omc1, omc2)

  li = 0
  while li < @n_layers
    self.adam_step_block(@blocks[li], grads.blocks[li],
                         state.m.blocks[li], state.v.blocks[li],
                         lr, beta1, beta2, eps, omc1, omc2)
    li += 1
  end
end

#apply_gradients_sgd(grads, lr) ⇒ `Object`

—– Optimization —–

Two optimizers live side-by-side. Plain SGD (apply_gradients_sgd) is what the train_minimal smoke test uses — a few dozen steps to prove forward/backward/update compile and converge, with no extra state. Adam (apply_gradients_adam, below) is what the TinyStories run uses, walking the same parameter inventory but with parallel m/v moment shadows held in AdamState.

# File 'lib/toy/models/transformer.rb', line 1133

def apply_gradients_sgd(grads, lr)
  self.sgd_step_mat(@token_embed, grads.token_embed, lr)
  self.sgd_step_mat(@pos_embed,   grads.pos_embed,   lr)
  self.sgd_step_vec(@norm_final_gamma, grads.norm_final_gamma, lr)

  li = 0
  while li < @n_layers
    self.sgd_step_block(@blocks[li], grads.blocks[li], lr)
    li += 1
  end
end

#backward(input_ids, target_grads) ⇒ `Object`

Full backward pass. Fills ‘target_grads` with this example’s gradients and the loss. Caller is responsible for calling forward(token_ids) first.

# File 'lib/toy/models/transformer.rb', line 1318

def backward(input_ids, target_grads)
  n_pred = input_ids.length - 1
  if n_pred <= 0
    target_grads.loss = 0.0
    return
  end

  loss_res = self.cross_entropy_grad(@cache.logits, input_ids)
  target_grads.loss = loss_res.loss

  # Tied unembed: logits = x_final · token_embedᵀ.
  #   d_token_embed[v,d] += Σ_t dlogits[t,v] · x_final[t,d]
  #     ⇒ dlogits.t_matmul(x_final)   (vocab × d_model)
  #   d_x_final[t,d]      = Σ_v dlogits[t,v] · token_embed[v,d]
  #     ⇒ dlogits.matmul(token_embed) (T × d_model)
  # The unembed gradient is added directly into target_grads.token_embed
  # — embed_backward later adds the input-side row contributions on top.
  d_te_unembed = loss_res.dlogits.t_matmul(@cache.x_final)
  target_grads.token_embed.add!(d_te_unembed)
  dx_final = loss_res.dlogits.matmul(@token_embed)

  # Final RMSNorm. Use `self.` so Spinel's call-site parameter inference
  # picks up the typed args (only fires for explicit-receiver calls).
  dx = self.rms_norm_backward(@cache.x_block_out, @norm_final_gamma,
                              @cache.rms_final, dx_final,
                              target_grads.norm_final_gamma)

  # Each block in reverse.
  li = @n_layers - 1
  while li >= 0
    dx = self.transformer_block_backward(dx, self.x_in_for_layer(li),
                                         @blocks[li], @cache.layers[li],
                                         target_grads.blocks[li])
    li -= 1
  end

  self.embed_backward(input_ids, dx, target_grads)
end

#cross_entropy_grad(logits, token_ids) ⇒ `Object`

Cross-entropy on next-token prediction. dL/dlogits = softmax(logits) - one_hot(target). Loss is averaged over the (T-1) prediction positions.

# File 'lib/toy/models/transformer.rb', line 856

def cross_entropy_grad(logits, token_ids)
  n_pred = token_ids.length - 1
  dlogits = Mat.new(logits.nrows, logits.ncols)
  total_loss = 0.0
  if n_pred <= 0
    return LossResult.new(dlogits, 0.0)
  end
  inv_n = 1.0 / n_pred
  v = logits.ncols

  i = 0
  while i < n_pred
    base = i * v
    mx = logits.flat[base]
    j = 1
    while j < v
      val = logits.flat[base + j]
      if val > mx
        mx = val
      end
      j += 1
    end
    sum = 0.0
    j = 0
    while j < v
      e = Math.exp(logits.flat[base + j] - mx)
      sum += e
      j += 1
    end
    target = token_ids[i + 1]
    target_logit = logits.flat[base + target]
    pt = Math.exp(target_logit - mx) / sum
    if pt < LOG_PROB_FLOOR
      pt = LOG_PROB_FLOOR
    end
    total_loss -= Math.log(pt)

    j = 0
    while j < v
      p = Math.exp(logits.flat[base + j] - mx) / sum
      dlogits.flat[base + j] = p * inv_n
      j += 1
    end
    ti = base + target
    dlogits.flat[ti] = dlogits.flat[ti] - inv_n

    i += 1
  end

  LossResult.new(dlogits, total_loss / n_pred)
end

#embed(token_ids, start_pos) ⇒ `Object`

x = token_embed[token_ids] + pos_embed[start_pos + i]

# File 'lib/toy/models/transformer.rb', line 578

def embed(token_ids, start_pos)
  t = token_ids.length
  out = Mat.new(t, @d_model)
  i = 0
  while i < t
    tok_id = token_ids[i]
    j = 0
    while j < @d_model
      out.flat[i * @d_model + j] =
        @token_embed.flat[tok_id * @d_model + j] +
        @pos_embed.flat[(start_pos + i) * @d_model + j]
      j += 1
    end
    i += 1
  end
  out
end

#embed_backward(token_ids, dx, target_grads) ⇒ `Object`

Embedding backward: each row of dx routes to its token’s embedding row and to position i’s positional embedding row. Repeats accumulate.

# File 'lib/toy/models/transformer.rb', line 1295

def embed_backward(token_ids, dx, target_grads)
  t_seq = token_ids.length
  i = 0
  while i < t_seq
    # Spinel-pin: `.to_i` forces Int when this method is dead code
    # (qwen25_kv doesn't call it). Without a caller to constrain
    # `token_ids`, Spinel boxes `tok_id` as RbVal and breaks the
    # int-context use below. The explicit cast is a no-op at
    # runtime when token_ids[i] is already Int.
    tok_id = token_ids[i].to_i
    j = 0
    while j < @d_model
      pi = i * @d_model + j
      target_grads.token_embed.flat[tok_id * @d_model + j] += dx.flat[pi]
      target_grads.pos_embed.flat[pi]                      += dx.flat[pi]
      j += 1
    end
    i += 1
  end
end

#feed_forward(h, block) ⇒ `Object`

FFN: gelu(h · W_ff1) · W_ff2. Returns (out_mat, FFCache). GeLU uses the tanh approximation: 0.5 x (1 + tanh(c (x + 0.044715 x³))), c = √(2/π).

# File 'lib/toy/models/transformer.rb', line 738

def feed_forward(h, block)
  pre = h.matmul(block.w_ff1)
  hidden = Mat.new(pre.nrows, pre.ncols)
  n = pre.nrows * pre.ncols
  i = 0
  while i < n
    x = pre.flat[i]
    u = GELU_C * (x + GELU_K * x * x * x)
    hidden.flat[i] = 0.5 * x * (1.0 + Math.tanh(u))
    i += 1
  end
  out = hidden.matmul(block.w_ff2)
  FFResult.new(out, FFCache.new(pre, hidden))
end

#feed_forward_backward(d_ff_out, h, ff_cache, block, target_block) ⇒ `Object`

FFN backward. Writes w_ff1, w_ff2 grads into target_block. Returns d_h.

# File 'lib/toy/models/transformer.rb', line 1007

def feed_forward_backward(d_ff_out, h, ff_cache, block, target_block)
  t_seq = h.nrows           # type hint: h is a Mat
  d_w_ff2  = ff_cache.hidden.t_matmul(d_ff_out)
  d_hidden = d_ff_out.matmul_t(block.w_ff2)

  # GeLU' (tanh approximation; see top-of-file GELU_* constants):
  #   gelu(x)  = 0.5 x (1 + t),                t = tanh(u),  u = C (x + K x³)
  #   gelu'(x) = 0.5 (1 + t) + 0.5 x (1 - t²) · C (1 + DK x²)
  # where C = sqrt(2/π), K = 0.044715, DK = 3 K = 0.134145.
  d_pre = Mat.new(d_hidden.nrows, d_hidden.ncols)
  n = d_hidden.nrows * d_hidden.ncols
  i = 0
  while i < n
    x     = ff_cache.pre.flat[i]
    u     = GELU_C * (x + GELU_K * x * x * x)
    t     = Math.tanh(u)
    du_dx = GELU_C * (1.0 + GELU_DK * x * x)
    deriv = 0.5 * (1.0 + t) + 0.5 * x * (1.0 - t * t) * du_dx
    d_pre.flat[i] = d_hidden.flat[i] * deriv
    i += 1
  end

  d_w_ff1 = h.t_matmul(d_pre)
  d_h     = d_pre.matmul_t(block.w_ff1)

  target_block.w_ff1 = d_w_ff1
  target_block.w_ff2 = d_w_ff2
  d_h
end

#feed_forward_ffi(h, block, ffi_cache) ⇒ `Object`

Persistent-session FFI variant of feed_forward. Single ggml session runs the chain ‘mul_mat(w1_t, h) -> gelu -> mul_mat(w2_t, hidden)` in one dispatch; activations live in ggml memory between matmul1 and matmul2 (no host round-trip for GeLU).

Operand-order trick: with matmul1 = mul_mat(w1_t, h), the result’s ne0 is d_ff – which equals matmul2’s k – so the chain composes without an intermediate transpose. All three result tensors then read back as a straight row-major memcpy.

# File 'lib/toy/models/transformer.rb', line 762

def feed_forward_ffi(h, block, ffi_cache)
  t_seq = h.nrows
  d_model = h.ncols
  d_ff = block.w_ff1.ncols

  if !ffi_cache.realized
    ffi_cache.realize_for(t_seq, d_model, d_ff)
  end

  TinyNN.upload_row_major(ffi_cache.sess, ffi_cache.t_h, h)
  TinyNN.stage_transposed_and_upload(ffi_cache.sess, ffi_cache.t_w1_t, block.w_ff1)
  TinyNN.stage_transposed_and_upload(ffi_cache.sess, ffi_cache.t_w2_t, block.w_ff2)
  TinyNN.tnn_compute(ffi_cache.sess)
  pre    = TinyNN.download_row_major(ffi_cache.sess, ffi_cache.t_pre,    t_seq, d_ff)
  hidden = TinyNN.download_row_major(ffi_cache.sess, ffi_cache.t_hidden, t_seq, d_ff)
  out    = TinyNN.download_row_major(ffi_cache.sess, ffi_cache.t_out,    t_seq, d_model)

  FFResult.new(out, FFCache.new(pre, hidden))
end

#forward(token_ids) ⇒ `Object`

Full forward pass. Writes intermediates into @layer_caches and @cache, which are pre-allocated so their types are unambiguous to Spinel. Returns the logits Mat (T × vocab_size).

# File 'lib/toy/models/transformer.rb', line 796

def forward(token_ids)
  cache = ForwardCache.new
  cache.token_ids = token_ids

  x = embed(token_ids, 0)
  cache.x_embed = x

  x_cur = x
  li = 0
  while li < @n_layers
    transformer_block_into(x_cur, @blocks[li], @layer_caches[li], @ffn_ffi_caches[li])
    x_cur = @layer_caches[li].x_out
    li += 1
  end
  cache.layers = @layer_caches
  cache.x_block_out = x_cur

  nr = rms_norm(x_cur, @norm_final_gamma)
  cache.x_final   = nr.y
  cache.rms_final = nr.rms

  # Tied unembed: logits[t,v] = Σ_d x_final[t,d] · token_embed[v,d]
  cache.logits = nr.y.matmul_t(@token_embed)
  @cache = cache
  cache.logits
end

#generate_from_ids(start_ids, max_tokens, temperature) ⇒ `Object`

GENERATION — autoregressive sampling from a starting token-id list.

Tokenizing a prompt string would drag the French tokenizer (which
uses unicode_normalize and complex regex) into the Spinel-compiled
binary; instead we let the caller pre-tokenize and pass IDs.

# File 'lib/toy/models/transformer.rb', line 1393

def generate_from_ids(start_ids, max_tokens, temperature)
  # Anchor start_ids as an IntArray for Spinel param-type inference.
  n_start = start_ids.length
  # Copy start_ids into a fresh IntArray we'll grow.
  tokens = [start_ids[0]]
  i = 1
  while i < n_start
    tokens.push(start_ids[i])
    i += 1
  end

  step = 0
  while step < max_tokens
    ctx_len = tokens.length
    if ctx_len > @context_length
      ctx_len = @context_length
    end
    # Build the trailing-window context.
    ctx = [tokens[tokens.length - ctx_len]]
    j = 1
    while j < ctx_len
      ctx.push(tokens[tokens.length - ctx_len + j])
      j += 1
    end

    logits = self.forward(ctx)
    next_id = self.sample_logits_row(logits, ctx_len - 1, temperature)
    tokens.push(next_id)
    step += 1
  end
  tokens
end

#hsplit_heads(d_concat) ⇒ `Object`

Split a (T × d_model) matrix back into n_heads × (T × d_head) heads.

# File 'lib/toy/models/transformer.rb', line 980

def hsplit_heads(d_concat)
  t_seq = d_concat.nrows
  out = [Mat.new(t_seq, @d_head)]
  h = 1
  while h < @n_heads
    out.push(Mat.new(t_seq, @d_head))
    h += 1
  end
  h = 0
  while h < @n_heads
    base = h * @d_head
    m = out[h]
    i = 0
    while i < t_seq
      j = 0
      while j < @d_head
        m.flat[i * @d_head + j] = d_concat.flat[i * @d_model + (base + j)]
        j += 1
      end
      i += 1
    end
    h += 1
  end
  out
end

#hstack_heads(per_head) ⇒ `Object`

Concatenate per-head outputs side by side: n_heads × (T × d_head) → (T × d_model)

# File 'lib/toy/models/transformer.rb', line 677

def hstack_heads(per_head)
  t = per_head[0].head_out.nrows
  out = Mat.new(t, @d_model)
  h = 0
  while h < @n_heads
    head = per_head[h].head_out
    base = h * @d_head
    i = 0
    while i < t
      j = 0
      while j < @d_head
        out.flat[i * @d_model + (base + j)] = head.flat[i * @d_head + j]
        j += 1
      end
      i += 1
    end
    h += 1
  end
  out
end

#rms_norm(x, gamma) ⇒ `Object`

RMSNorm: y_j = gamma_j * x_j / sqrt(mean(x²) + eps), per row. Returns a NormResult holding the normed Mat and the per-row rms.

# File 'lib/toy/models/transformer.rb', line 598

def rms_norm(x, gamma)
  eps = RMS_EPS_DEFAULT
  d = gamma.length
  t = x.nrows
  rms = Array.new(t, 0.0)
  out = Mat.new(t, d)

  i = 0
  while i < t
    sumsq = 0.0
    j = 0
    while j < d
      v = x.flat[i * d + j]
      sumsq += v * v
      j += 1
    end
    r = Math.sqrt(sumsq / d + eps)
    rms[i] = r
    j = 0
    while j < d
      out.flat[i * d + j] = x.flat[i * d + j] * gamma[j] / r
      j += 1
    end
    i += 1
  end

  NormResult.new(out, rms)
end

#rms_norm_backward(x, gamma, rms, dy, target_dgamma) ⇒ `Object`

RMSNorm backward.

For y = gamma * x / r,  with r = sqrt(mean(x²) + eps):
  dL/dx_k    = (dy_k * gamma_k - x_k * coef) / r,
      coef = (Σ_j dy_j * gamma_j * x_j) / (d * r²)
  dL/dgamma_j (summed over rows) += dy_j * x_j / r

‘rms` is the FloatArray of per-row r values cached from the forward pass — saves recomputing sumsq.

# File 'lib/toy/models/transformer.rb', line 922

def rms_norm_backward(x, gamma, rms, dy, target_dgamma)
  d = gamma.length
  t_seq = x.nrows
  dx = Mat.new(t_seq, d)

  i = 0
  while i < t_seq
    r = rms[i]

    inner = 0.0
    j = 0
    while j < d
      inner += dy.flat[i * d + j] * gamma[j] * x.flat[i * d + j]
      j += 1
    end
    coef = inner / (d * r * r)

    j = 0
    while j < d
      dx.flat[i * d + j] =
        (dy.flat[i * d + j] * gamma[j] - x.flat[i * d + j] * coef) / r
      target_dgamma[j] = target_dgamma[j] +
                         dy.flat[i * d + j] * x.flat[i * d + j] / r
      j += 1
    end
    i += 1
  end

  dx
end

#sample_logits_row(logits, row, temperature) ⇒ `Object`

Sample a token ID from row ‘row` of `logits` (T × vocab_size flat). temperature <= 0 → argmax, else softmax with temperature + cumulative sample. `rand(N).to_f / N` gives a uniform [0,1) under both Spinel (where bare `rand` returns C’s int rand) and CRuby.

# File 'lib/toy/models/transformer.rb', line 1430

def sample_logits_row(logits, row, temperature)
  v = logits.ncols
  base = row * v
  if temperature <= 0.0
    best_id  = 0
    best_val = logits.flat[base]
    j = 1
    while j < v
      val = logits.flat[base + j]
      if val > best_val
        best_val = val
        best_id  = j
      end
      j += 1
    end
    return best_id
  end

  inv_t = 1.0 / temperature

  # Stable-softmax: subtract the max before exp.
  mx = logits.flat[base]
  j = 1
  while j < v
    val = logits.flat[base + j]
    if val > mx
      mx = val
    end
    j += 1
  end

  sum = 0.0
  j = 0
  while j < v
    sum = sum + Math.exp((logits.flat[base + j] - mx) * inv_t)
    j += 1
  end

  r   = (rand(1_000_000).to_f / 1_000_000.0) * sum
  cum = 0.0
  j = 0
  while j < v
    cum = cum + Math.exp((logits.flat[base + j] - mx) * inv_t)
    if r < cum
      return j
    end
    j += 1
  end
  v - 1
end

#self_attention(h_in, block) ⇒ `Object`

Multi-head self-attention. Returns AttnResult.

# File 'lib/toy/models/transformer.rb', line 699

def self_attention(h_in, block)
  # Force h_in's type inference via an early Mat-typed access.
  t_seq = h_in.nrows
  inv_sqrt = 1.0 / Math.sqrt(@d_head)

  # Build per-head caches with the seed-then-push pattern.
  head0 = self_attention_head(h_in, block, 0, inv_sqrt)
  per_head = [head0]
  hi = 1
  while hi < @n_heads
    per_head.push(self_attention_head(h_in, block, hi, inv_sqrt))
    hi += 1
  end

  concat = hstack_heads(per_head)
  proj   = concat.matmul(block.w_o)

  AttnResult.new(proj, AttnCache.new(per_head, concat))
end

#self_attention_backward(d_proj, h_in, attn_cache, block, target_block) ⇒ `Object`

Self-attention backward. Writes per-head w_q/k/v + w_o grads into target_block. Returns d_h_in.

# File 'lib/toy/models/transformer.rb', line 1039

def self_attention_backward(d_proj, h_in, attn_cache, block, target_block)
  t_seq = h_in.nrows        # type hint
  inv_sqrt = 1.0 / Math.sqrt(@d_head)

  # proj = concat · w_o
  d_w_o = attn_cache.concat.t_matmul(d_proj)
  d_concat = d_proj.matmul_t(block.w_o)
  target_block.w_o = d_w_o

  d_outs = self.hsplit_heads(d_concat)

  # Build per-head Q/K/V grads (Mat per head). Seed-then-push for typing.
  d_w_q_heads = [Mat.new(@d_model, @d_head)]
  d_w_k_heads = [Mat.new(@d_model, @d_head)]
  d_w_v_heads = [Mat.new(@d_model, @d_head)]
  h = 1
  while h < @n_heads
    d_w_q_heads.push(Mat.new(@d_model, @d_head))
    d_w_k_heads.push(Mat.new(@d_model, @d_head))
    d_w_v_heads.push(Mat.new(@d_model, @d_head))
    h += 1
  end

  d_h_in = Mat.new(t_seq, @d_model)

  h = 0
  while h < @n_heads
    head = attn_cache.per_head[h]
    d_out_h = d_outs[h]

    # out = attn · V
    d_attn = d_out_h.matmul_t(head.v)
    d_v    = head.attn.t_matmul(d_out_h)

    # softmax row-wise (masked entries had attn = 0 so contribute nothing)
    d_scores = self.softmax_rows_backward(head.attn, d_attn)
    d_scores.scale!(inv_sqrt)

    # scores = Q · Kᵀ
    d_q = d_scores.matmul(head.k)
    d_k = d_scores.transpose.matmul(head.q)

    d_w_q_heads[h] = h_in.t_matmul(d_q)
    d_w_k_heads[h] = h_in.t_matmul(d_k)
    d_w_v_heads[h] = h_in.t_matmul(d_v)

    d_h_in.add!(d_q.matmul_t(block.w_q[h]))
    d_h_in.add!(d_k.matmul_t(block.w_k[h]))
    d_h_in.add!(d_v.matmul_t(block.w_v[h]))

    h += 1
  end

  target_block.w_q = d_w_q_heads
  target_block.w_k = d_w_k_heads
  target_block.w_v = d_w_v_heads
  d_h_in
end

#self_attention_head(h_in, block, head_idx, inv_sqrt) ⇒ `Object`

# File 'lib/toy/models/transformer.rb', line 719

def self_attention_head(h_in, block, head_idx, inv_sqrt)
  q = h_in.matmul(block.w_q[head_idx])
  k = h_in.matmul(block.w_k[head_idx])
  v = h_in.matmul(block.w_v[head_idx])

  # scores = (Q · Kᵀ) / sqrt(d_head)
  scores = q.matmul_t(k)
  scores.scale!(inv_sqrt)
  apply_causal_mask!(scores, 0)

  softmax_rows!(scores)
  head_out = scores.matmul(v)

  HeadCache.new(q, k, v, scores, head_out)
end

#sgd_step_block(p_block, g_block, lr) ⇒ `Object`

# File 'lib/toy/models/transformer.rb', line 1163

def sgd_step_block(p_block, g_block, lr)
  self.sgd_step_vec(p_block.norm1_gamma, g_block.norm1_gamma, lr)
  self.sgd_step_vec(p_block.norm2_gamma, g_block.norm2_gamma, lr)
  self.sgd_step_mat(p_block.w_o,   g_block.w_o,   lr)
  self.sgd_step_mat(p_block.w_ff1, g_block.w_ff1, lr)
  self.sgd_step_mat(p_block.w_ff2, g_block.w_ff2, lr)

  h = 0
  while h < @n_heads
    self.sgd_step_mat(p_block.w_q[h], g_block.w_q[h], lr)
    self.sgd_step_mat(p_block.w_k[h], g_block.w_k[h], lr)
    self.sgd_step_mat(p_block.w_v[h], g_block.w_v[h], lr)
    h += 1
  end
end

#sgd_step_mat(p, g, lr) ⇒ `Object`

# File 'lib/toy/models/transformer.rb', line 1145

def sgd_step_mat(p, g, lr)
  n = p.flat.length
  i = 0
  while i < n
    p.flat[i] -= lr * g.flat[i]
    i += 1
  end
end

#sgd_step_vec(p, g, lr) ⇒ `Object`

# File 'lib/toy/models/transformer.rb', line 1154

def sgd_step_vec(p, g, lr)
  n = p.length
  i = 0
  while i < n
    p[i] -= lr * g[i]
    i += 1
  end
end

#softmax_rows!(m) ⇒ `Object`

Row-wise softmax with numerical-stability max-shift, in place on ‘m`.

# File 'lib/toy/models/transformer.rb', line 628

def softmax_rows!(m)
  t = m.nrows
  n = m.ncols
  i = 0
  while i < t
    base = i * n
    mx = m.flat[base]
    j = 1
    while j < n
      v = m.flat[base + j]
      if v > mx
        mx = v
      end
      j += 1
    end
    sum = 0.0
    j = 0
    while j < n
      e = Math.exp(m.flat[base + j] - mx)
      m.flat[base + j] = e
      sum += e
      j += 1
    end
    j = 0
    while j < n
      m.flat[base + j] = m.flat[base + j] / sum
      j += 1
    end
    i += 1
  end
end

#softmax_rows_backward(softmax_out, d_softmax) ⇒ `Object`

Row-wise softmax backward (for attention).

d_scores[i,j] = attn[i,j] * (d_attn[i,j] - Σk attn[i,k]·d_attn[i,k])

# File 'lib/toy/models/transformer.rb', line 955

def softmax_rows_backward(softmax_out, d_softmax)
  t_seq = softmax_out.nrows
  n = softmax_out.ncols
  out = Mat.new(t_seq, n)
  i = 0
  while i < t_seq
    base = i * n
    s = 0.0
    j = 0
    while j < n
      s += softmax_out.flat[base + j] * d_softmax.flat[base + j]
      j += 1
    end
    j = 0
    while j < n
      out.flat[base + j] = softmax_out.flat[base + j] *
                            (d_softmax.flat[base + j] - s)
      j += 1
    end
    i += 1
  end
  out
end

#transformer_block(x, block) ⇒ `Object`

One transformer block (pre-norm). Returns BlockResult. Locals are explicit so Spinel can type-trace argument types into the called methods (passing ‘nr1.y` directly through doesn’t propagate).

# File 'lib/toy/models/transformer.rb', line 1360

def transformer_block(x, block)
  cache = LayerCache.new

  nr1 = rms_norm(x, block.norm1_gamma)
  h1  = nr1.y
  cache.h_norm1 = h1
  cache.rms1    = nr1.rms

  sa = self_attention(h1, block)
  cache.attn_cache = sa.cache
  x_attn = x.plus(sa.proj)
  cache.x_attn = x_attn

  nr2 = rms_norm(x_attn, block.norm2_gamma)
  h2  = nr2.y
  cache.h_norm2 = h2
  cache.rms2    = nr2.rms

  ff = feed_forward(h2, block)
  cache.ff_cache = ff.cache
  x_out = x_attn.plus(ff.out)
  cache.x_out = x_out

  BlockResult.new(x_out, cache)
end

#transformer_block_backward(dx_out, x_in, block, layer_cache, target_block_grads) ⇒ `Object`

Backward through one block. Writes grads into target_block_grads. Returns d_x_in (Mat).

# File 'lib/toy/models/transformer.rb', line 1100

def transformer_block_backward(dx_out, x_in, block, layer_cache, target_block_grads)
  # x_in is only passed as an arg below — never accessed directly. Spinel's
  # body-usage parameter inference needs at least one method call to type
  # the param. `.nrows` is a Mat-only method, so this anchors x_in's type.
  _x_t = x_in.nrows

  # FFN sublayer residual: x_out = x_attn + ff_out → grad flows to both branches.
  d_h_norm2 = self.feed_forward_backward(dx_out, layer_cache.h_norm2,
                                         layer_cache.ff_cache, block, target_block_grads)
  d_x_attn_via_norm = self.rms_norm_backward(layer_cache.x_attn, block.norm2_gamma,
                                             layer_cache.rms2, d_h_norm2,
                                             target_block_grads.norm2_gamma)
  d_x_attn = dx_out.plus(d_x_attn_via_norm)

  # Attention sublayer residual: x_attn = x_in + attn_proj.
  d_h_norm1 = self.self_attention_backward(d_x_attn, layer_cache.h_norm1,
                                           layer_cache.attn_cache, block, target_block_grads)
  d_x_in_via_norm = self.rms_norm_backward(x_in, block.norm1_gamma,
                                           layer_cache.rms1, d_h_norm1,
                                           target_block_grads.norm1_gamma)

  d_x_attn.plus(d_x_in_via_norm)
end

#transformer_block_into(x, block, cache, ffi_cache) ⇒ `Object`

Same as transformer_block but writes into a pre-existing LayerCache. ffi_cache is the persistent-session FFNFFICache for this block; used only when USE_FFI_MATMUL is true.

# File 'lib/toy/models/transformer.rb', line 828

def transformer_block_into(x, block, cache, ffi_cache)
  nr1 = rms_norm(x, block.norm1_gamma)
  h1  = nr1.y
  cache.h_norm1 = h1
  cache.rms1    = nr1.rms

  sa = self_attention(h1, block)
  cache.attn_cache = sa.cache
  x_attn = x.plus(sa.proj)
  cache.x_attn = x_attn

  nr2 = rms_norm(x_attn, block.norm2_gamma)
  h2  = nr2.y
  cache.h_norm2 = h2
  cache.rms2    = nr2.rms

  if USE_FFI_MATMUL
    ff = feed_forward_ffi(h2, block, ffi_cache)
  else
    ff = feed_forward(h2, block)
  end
  cache.ff_cache = ff.cache
  x_out = x_attn.plus(ff.out)
  cache.x_out = x_out
end

#x_in_for_layer(li) ⇒ `Object`

No ‘train_step` here: Spinel compiles every class method whether or not it has callers. With no callers in the current program its IntArray param defaults to `mrb_int`, and the body’s ‘forward(seq_ids)` then fails to type-check. Each driver inlines the forward / backward / optimizer-step sequence at its top level, which is short and makes the per-step cost obvious. Block i’s input is the previous block’s output, or the embedded input for block 0.



1289
1290
1291

# File 'lib/toy/models/transformer.rb', line 1289

def x_in_for_layer(li)
  li == 0 ? @cache.x_embed : @cache.layers[li - 1].x_out
end

Class: TransformerLM

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

GENERATION — autoregressive sampling from a starting token-id list.

Constructor Details

#initialize(vocab_size, d_model, d_ff, n_heads, n_layers, context_length) ⇒ TransformerLM

Instance Attribute Details

#blocks ⇒ Object

#cache ⇒ Object

#context_length ⇒ Object

#d_ff ⇒ Object

#d_head ⇒ Object

#d_model ⇒ Object

#ffn_ffi_caches ⇒ Object

#layer_caches ⇒ Object

#n_heads ⇒ Object

#n_layers ⇒ Object

#norm_final_gamma ⇒ Object

#pos_embed ⇒ Object

#token_embed ⇒ Object

#vocab_size ⇒ Object

#vocabulary ⇒ Object

Instance Method Details

#adam_step_block(p_block, g_block, m_block, v_block, lr, b1, b2, eps, omc1, omc2) ⇒ Object

#adam_step_mat(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object

#adam_step_vec(p, g, m, v, lr, b1, b2, eps, omc1, omc2) ⇒ Object

#apply_causal_mask!(scores, query_offset) ⇒ Object

#apply_gradients_adam(grads, state, lr, beta1, beta2, eps) ⇒ Object

#apply_gradients_sgd(grads, lr) ⇒ Object

#backward(input_ids, target_grads) ⇒ Object

#cross_entropy_grad(logits, token_ids) ⇒ Object

#embed(token_ids, start_pos) ⇒ Object

#embed_backward(token_ids, dx, target_grads) ⇒ Object

#feed_forward(h, block) ⇒ Object

#feed_forward_backward(d_ff_out, h, ff_cache, block, target_block) ⇒ Object

#feed_forward_ffi(h, block, ffi_cache) ⇒ Object

#forward(token_ids) ⇒ Object

#generate_from_ids(start_ids, max_tokens, temperature) ⇒ Object

#hsplit_heads(d_concat) ⇒ Object

#hstack_heads(per_head) ⇒ Object

#rms_norm(x, gamma) ⇒ Object

#rms_norm_backward(x, gamma, rms, dy, target_dgamma) ⇒ Object

#sample_logits_row(logits, row, temperature) ⇒ Object

#self_attention(h_in, block) ⇒ Object

#self_attention_backward(d_proj, h_in, attn_cache, block, target_block) ⇒ Object

#self_attention_head(h_in, block, head_idx, inv_sqrt) ⇒ Object

#sgd_step_block(p_block, g_block, lr) ⇒ Object

#sgd_step_mat(p, g, lr) ⇒ Object

#sgd_step_vec(p, g, lr) ⇒ Object

#softmax_rows!(m) ⇒ Object

#softmax_rows_backward(softmax_out, d_softmax) ⇒ Object

#transformer_block(x, block) ⇒ Object

#transformer_block_backward(dx_out, x_in, block, layer_cache, target_block_grads) ⇒ Object

#transformer_block_into(x, block, cache, ffi_cache) ⇒ Object

#x_in_for_layer(li) ⇒ Object