Module: Toy::LLM::Primitives::GDN

Defined in:: lib/toy/llm/primitives/gdn.rb

Constant Summary collapse

NAME =

:gdn

Class Method Summary collapse

.decay_gate(sess, a, dt_bias, a_log) ⇒ Object

Log-decay gate: g = -exp(A_log) * softplus(a + dt_bias).
.gated_out(sess, o, z, gamma, eps) ⇒ Object

Gated output norm: GatedRMSNorm(o, z) = rms_norm(o) * gamma * silu(z).
.l2(sess, x, eps) ⇒ Object

L2-normalise a projected q or k along its head dim (the delta rule replaces softmax normalisation with L2-norm).
.l2_train(sess, x, eps) ⇒ Object

TRAINABLE L2 norm over ne0 — composed from ops that each have a ggml backward (mul / sum_rows / scale_bias / sqrt / div), because the fused ‘tnn_l2_norm` (GGML_OP_L2_NORM) has NO backward.
.recur(sess, q, k, v, g, beta, state) ⇒ Object

The recurrence core.
.recur_unrolled(sess, q, k, v, g, beta, state0, s_v, n_heads, head, n_tokens) ⇒ Object

Path-B TRAINABLE recurrence: the gated delta rule expressed as an UNROLLED graph of ops that EACH have a ggml backward (mul / mul_mat / sub / scale / exp / add / reshape) — so training backward comes free and NO fused-kernel backward is needed (ggml has none for GATED_DELTA_NET).
.update_gate(sess, b) ⇒ Object

Update rate: beta = sigmoid(b).
.update_gate_train(sess, b) ⇒ Object

TRAINABLE update gate — sigmoid(b) composed as exp(b)/(1+exp(b)) from ops that each have a ggml backward, because GGML_UNARY_OP_SIGMOID has none.

Class Method Details

.decay_gate(sess, a, dt_bias, a_log) ⇒ `Object`

Log-decay gate: g = -exp(A_log) * softplus(a + dt_bias). a is the projected decay stream [1,H,T,B]; dt_bias and A_log are the block’s per-v-head weights ([1,H,1,1], broadcast). Returned g is the raw LOG-decay the recurrence kernel exps internally. Op order is fixed for ggml broadcast (the [1,H,T,B] softplus term drives the shape; the [1,H,1,1] -exp(A_log) broadcasts onto it).

# File 'lib/toy/llm/primitives/gdn.rb', line 63

def self.decay_gate(sess, a, dt_bias, a_log)
  a_db   = TinyNN.tnn_add(sess, a, dt_bias)
  sp     = TinyNN.tnn_softplus(sess, a_db)
  ea     = TinyNN.tnn_exp(sess, a_log)
  ea_neg = TinyNN.tnn_neg(sess, ea)
  TinyNN.tnn_mul(sess, sp, ea_neg)
end

.gated_out(sess, o, z, gamma, eps) ⇒ `Object`

Gated output norm: GatedRMSNorm(o, z) = rms_norm(o) * gamma * silu(z). o is the per-head token output (block-sliced from recur); z the output-gate stream; gamma the block’s norm weight; eps the Float epsilon. tnn_rms_norm already folds the gamma scale, so this is rms_norm(o,gamma) * silu(z). The normed term drives the shape; silu(z) broadcasts/multiplies. Returns the gated output (input to the block’s out projection).

# File 'lib/toy/llm/primitives/gdn.rb', line 180

def self.gated_out(sess, o, z, gamma, eps)
  n  = TinyNN.tnn_rms_norm(sess, o, gamma, eps)
  sz = TinyNN.tnn_silu(sess, z)
  TinyNN.tnn_mul(sess, n, sz)
end

.l2(sess, x, eps) ⇒ `Object`

L2-normalise a projected q or k along its head dim (the delta rule replaces softmax normalisation with L2-norm). x is the block’s already-projected (and conv’d) q or k tensor; eps the Float epsilon. Returns the normalised handle. Called twice by the block (once for q, once for k).



36
37
38

# File 'lib/toy/llm/primitives/gdn.rb', line 36

def self.l2(sess, x, eps)
  TinyNN.tnn_l2_norm(sess, x, eps)
end

.l2_train(sess, x, eps) ⇒ `Object`

TRAINABLE L2 norm over ne0 — composed from ops that each have a ggml backward (mul / sum_rows / scale_bias / sqrt / div), because the fused ‘tnn_l2_norm` (GGML_OP_L2_NORM) has NO backward. Used by the trainable GDN block; the fused `l2` above stays the inference path.

y = x / sqrt(sum_ne0(x^2) + eps)

# File 'lib/toy/llm/primitives/gdn.rb', line 45

def self.l2_train(sess, x, eps)
  sq     = TinyNN.tnn_mul(sess, x, x)            # x^2
  ss     = TinyNN.tnn_sum_rows(sess, sq)         # sum over ne0 -> [1,...]
  ss_eps = TinyNN.tnn_scale_bias(sess, ss, 1.0, eps)  # + eps
  denom  = TinyNN.tnn_sqrt(sess, ss_eps)         # [1,...]
  # DIV backward does NOT reduce a broadcast src1, so materialise denom to
  # x's full shape first (REPEAT backward sums the grad back correctly);
  # the div is then same-shape.
  denom_full = TinyNN.tnn_repeat(sess, denom, x)
  TinyNN.tnn_div(sess, x, denom_full)
end

.recur(sess, q, k, v, g, beta, state) ⇒ `Object`

The recurrence core. q,k must be L2-normed; beta sigmoid’d; g the raw log-decay; state the [S_v*S_v*H,K,B,1] carry. Returns the packed [S_v*H, T*B + K*S_v*B] output (token outputs then state snapshots). The block slices the leading T*B token columns.



92
93
94

# File 'lib/toy/llm/primitives/gdn.rb', line 92

def self.recur(sess, q, k, v, g, beta, state)
  TinyNN.tnn_gated_delta_net(sess, q, k, v, g, beta, state)
end

.recur_unrolled(sess, q, k, v, g, beta, state0, s_v, n_heads, head, n_tokens) ⇒ `Object`

Path-B TRAINABLE recurrence: the gated delta rule expressed as an UNROLLED graph of ops that EACH have a ggml backward (mul / mul_mat / sub / scale / exp / add / reshape) — so training backward comes free and NO fused-kernel backward is needed (ggml has none for GATED_DELTA_NET). The fused ‘recur` above stays the fast INFERENCE path; this is its train-time twin, gated for numeric parity.

Reproduces the fused kernel’s token outputs for the SCALAR-decay path (g->ne0 == 1, the Dragon/Qwen3-Next per-head gate). Single seq (B=1), single head per call — the block loops heads/seqs around it in Phase 5. Inputs are the packed projection tensors (q,k,v = [S_v,1,T,1]; g,beta = [1,1,T,1]; state0 = [S_v,S_v]); per-token vectors are sliced via views internally (no ptr-array params → no Spinel IntArray-lock landmine). q/k must be pre-L2-normed and beta pre-sigmoid’d by the caller (the kernel contract). Returns [S_v, T] — token outputs concat’d along ne1.

per token t (matching ops.cpp:10731 exactly):
  S = S * exp(g_t)                  decay  (scalar [1,1] broadcast)
  u = matmul(S, k_t)                u[j] = sum_i S[i,j] k[i]
  d = (v_t - u) * beta_t            delta
  S = S + matmul(k_row, d_row)      outer  (k⊗d)[i,j] = k[i] d[j]
  o_t = matmul(S, q_t)              o[j] = sum_i S[i,j] q[i]

The kernel’s 1/√S_v output scale is folded into a SINGLE pre-scale of q (q enters only the output read, never the state, so o = sum_i S (scale·q) is exact). Done once on the contiguous q — NOT per-token on o — because a per-token ggml_scale’s BACKWARD receives a view-shaped grad from the concat and asserts ggml_is_padded_1d (ggml.c:3392). One scale on the whole tensor keeps the backward grad contiguous.

ONE head of the recurrence. q,k,v are the packed [S_v, n_heads, T, 1] projections; g,beta the packed [1, n_heads, T, 1] gates; state0 this head’s [S_v,S_v] carry. ‘head` selects the head; per-token vectors are strided views into the packed tensors (token stride = S_v·n_heads, head base = S_v·head — the ggml [S_v,H,T,B] layout). Returns [S_v, T] for this head; the block concats heads along ne0. n_heads=1/head=0 is the plain single-head case (contiguous per-token, the Phase-4 gate shape).

# File 'lib/toy/llm/primitives/gdn.rb', line 133

def self.recur_unrolled(sess, q, k, v, g, beta, state0, s_v, n_heads, head, n_tokens)
  scale = 1.0 / Math.sqrt(s_v.to_f)
  fbytes = 4                          # sizeof(f32)
  tok_stride  = s_v * n_heads * fbytes # bytes between this head's tokens
  head_base   = s_v * head * fbytes    # byte offset to this head's col 0
  gtok_stride = n_heads * fbytes       # g/beta [1,H,T,1]: token stride
  ghead_base  = head * fbytes
  q_s = TinyNN.tnn_scale(sess, q, scale)   # pre-scaled q (contiguous)
  s_mat = state0
  t_out = TinyNN.tnn_null_ptr
  t = 0
  while t < n_tokens
    # Per-token slices: [S_v,1] vectors (S_v contiguous), [1,1] scalars.
    q_t = TinyNN.tnn_view_2d(sess, q_s, s_v, 1, tok_stride, head_base + t * tok_stride)
    k_t = TinyNN.tnn_view_2d(sess, k,   s_v, 1, tok_stride, head_base + t * tok_stride)
    v_t = TinyNN.tnn_view_2d(sess, v,   s_v, 1, tok_stride, head_base + t * tok_stride)
    g_t = TinyNN.tnn_view_2d(sess, g,    1, 1, gtok_stride, ghead_base + t * gtok_stride)
    b_t = TinyNN.tnn_view_2d(sess, beta, 1, 1, gtok_stride, ghead_base + t * gtok_stride)

    eg    = TinyNN.tnn_exp(sess, g_t)              # [1,1]
    s_dec = TinyNN.tnn_mul(sess, s_mat, eg)        # [S_v,S_v] * [1,1] bcast
    u     = TinyNN.tnn_matmul(sess, s_dec, k_t)    # [S_v,1]  u[j]
    diff  = TinyNN.tnn_sub(sess, v_t, u)           # [S_v,1]
    d     = TinyNN.tnn_mul(sess, diff, b_t)        # [S_v,1] * [1,1] bcast
    k_row = TinyNN.tnn_reshape_2d(sess, k_t, 1, s_v)  # [1,S_v]
    d_row = TinyNN.tnn_reshape_2d(sess, d, 1, s_v)    # [1,S_v]
    outer = TinyNN.tnn_matmul(sess, k_row, d_row)  # [S_v,S_v] [i,j]=k[i]d[j]
    s_mat = TinyNN.tnn_add(sess, s_dec, outer)     # state update
    o_t   = TinyNN.tnn_matmul(sess, s_mat, q_t)    # [S_v,1]  o[j] (already scaled)

    if t == 0
      t_out = o_t
    else
      t_out = TinyNN.tnn_concat(sess, t_out, o_t, 1)  # stack along ne1
    end
    t = t + 1
  end
  t_out
end

.update_gate(sess, b) ⇒ `Object`

Update rate: beta = sigmoid(b). b is the projected update stream [1,H,T,B]. The kernel uses beta directly, so the sigmoid lives here. Returns beta.



74
75
76

# File 'lib/toy/llm/primitives/gdn.rb', line 74

def self.update_gate(sess, b)
  TinyNN.tnn_sigmoid(sess, b)
end

.update_gate_train(sess, b) ⇒ `Object`

TRAINABLE update gate — sigmoid(b) composed as exp(b)/(1+exp(b)) from ops that each have a ggml backward, because GGML_UNARY_OP_SIGMOID has none. Same-shape throughout (no broadcast). The fused ‘update_gate` above (tnn_sigmoid) stays the inference path.

# File 'lib/toy/llm/primitives/gdn.rb', line 82

def self.update_gate_train(sess, b)
  e = TinyNN.tnn_exp(sess, b)                    # exp(b)
  d = TinyNN.tnn_scale_bias(sess, e, 1.0, 1.0)   # 1 + exp(b)
  TinyNN.tnn_div(sess, e, d)
end

Module: Toy::LLM::Primitives::GDN

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.decay_gate(sess, a, dt_bias, a_log) ⇒ Object

.gated_out(sess, o, z, gamma, eps) ⇒ Object

.l2(sess, x, eps) ⇒ Object

.l2_train(sess, x, eps) ⇒ Object

.recur(sess, q, k, v, g, beta, state) ⇒ Object

.recur_unrolled(sess, q, k, v, g, beta, state0, s_v, n_heads, head, n_tokens) ⇒ Object

.update_gate(sess, b) ⇒ Object

.update_gate_train(sess, b) ⇒ Object