Module: Toy::LLM::Primitives::GDN

Defined in:
lib/toy/llm/primitives/gdn.rb

Constant Summary collapse

NAME =
:gdn

Class Method Summary collapse

Class Method Details

.decay_gate(sess, a, dt_bias, a_log) ⇒ Object

Log-decay gate: g = -exp(A_log) * softplus(a + dt_bias). a is the projected decay stream [1,H,T,B]; dt_bias and A_log are the block’s per-v-head weights ([1,H,1,1], broadcast). Returned g is the raw LOG-decay the recurrence kernel exps internally. Op order is fixed for ggml broadcast (the [1,H,T,B] softplus term drives the shape; the [1,H,1,1] -exp(A_log) broadcasts onto it).



63
64
65
66
67
68
69
# File 'lib/toy/llm/primitives/gdn.rb', line 63

def self.decay_gate(sess, a, dt_bias, a_log)
  a_db   = TinyNN.tnn_add(sess, a, dt_bias)
  sp     = TinyNN.tnn_softplus(sess, a_db)
  ea     = TinyNN.tnn_exp(sess, a_log)
  ea_neg = TinyNN.tnn_neg(sess, ea)
  TinyNN.tnn_mul(sess, sp, ea_neg)
end

.gated_out(sess, o, z, gamma, eps) ⇒ Object

Gated output norm: GatedRMSNorm(o, z) = rms_norm(o) * gamma * silu(z). o is the per-head token output (block-sliced from recur); z the output-gate stream; gamma the block’s norm weight; eps the Float epsilon. tnn_rms_norm already folds the gamma scale, so this is rms_norm(o,gamma) * silu(z). The normed term drives the shape; silu(z) broadcasts/multiplies. Returns the gated output (input to the block’s out projection).



180
181
182
183
184
# File 'lib/toy/llm/primitives/gdn.rb', line 180

def self.gated_out(sess, o, z, gamma, eps)
  n  = TinyNN.tnn_rms_norm(sess, o, gamma, eps)
  sz = TinyNN.tnn_silu(sess, z)
  TinyNN.tnn_mul(sess, n, sz)
end

.l2(sess, x, eps) ⇒ Object

L2-normalise a projected q or k along its head dim (the delta rule replaces softmax normalisation with L2-norm). x is the block’s already-projected (and conv’d) q or k tensor; eps the Float epsilon. Returns the normalised handle. Called twice by the block (once for q, once for k).



36
37
38
# File 'lib/toy/llm/primitives/gdn.rb', line 36

def self.l2(sess, x, eps)
  TinyNN.tnn_l2_norm(sess, x, eps)
end

.l2_train(sess, x, eps) ⇒ Object

TRAINABLE L2 norm over ne0 — composed from ops that each have a ggml backward (mul / sum_rows / scale_bias / sqrt / div), because the fused ‘tnn_l2_norm` (GGML_OP_L2_NORM) has NO backward. Used by the trainable GDN block; the fused `l2` above stays the inference path.

y = x / sqrt(sum_ne0(x^2) + eps)


45
46
47
48
49
50
51
52
53
54
55
# File 'lib/toy/llm/primitives/gdn.rb', line 45

def self.l2_train(sess, x, eps)
  sq     = TinyNN.tnn_mul(sess, x, x)            # x^2
  ss     = TinyNN.tnn_sum_rows(sess, sq)         # sum over ne0 -> [1,...]
  ss_eps = TinyNN.tnn_scale_bias(sess, ss, 1.0, eps)  # + eps
  denom  = TinyNN.tnn_sqrt(sess, ss_eps)         # [1,...]
  # DIV backward does NOT reduce a broadcast src1, so materialise denom to
  # x's full shape first (REPEAT backward sums the grad back correctly);
  # the div is then same-shape.
  denom_full = TinyNN.tnn_repeat(sess, denom, x)
  TinyNN.tnn_div(sess, x, denom_full)
end

.recur(sess, q, k, v, g, beta, state) ⇒ Object

The recurrence core. q,k must be L2-normed; beta sigmoid’d; g the raw log-decay; state the [S_v*S_v*H,K,B,1] carry. Returns the packed [S_v*H, T*B + K*S_v*B] output (token outputs then state snapshots). The block slices the leading T*B token columns.



92
93
94
# File 'lib/toy/llm/primitives/gdn.rb', line 92

def self.recur(sess, q, k, v, g, beta, state)
  TinyNN.tnn_gated_delta_net(sess, q, k, v, g, beta, state)
end

.recur_unrolled(sess, q, k, v, g, beta, state0, s_v, n_heads, head, n_tokens) ⇒ Object

Path-B TRAINABLE recurrence: the gated delta rule expressed as an UNROLLED graph of ops that EACH have a ggml backward (mul / mul_mat / sub / scale / exp / add / reshape) — so training backward comes free and NO fused-kernel backward is needed (ggml has none for GATED_DELTA_NET). The fused ‘recur` above stays the fast INFERENCE path; this is its train-time twin, gated for numeric parity.

Reproduces the fused kernel’s token outputs for the SCALAR-decay path (g->ne0 == 1, the Dragon/Qwen3-Next per-head gate). Single seq (B=1), single head per call — the block loops heads/seqs around it in Phase 5. Inputs are the packed projection tensors (q,k,v = [S_v,1,T,1]; g,beta = [1,1,T,1]; state0 = [S_v,S_v]); per-token vectors are sliced via views internally (no ptr-array params → no Spinel IntArray-lock landmine). q/k must be pre-L2-normed and beta pre-sigmoid’d by the caller (the kernel contract). Returns [S_v, T] — token outputs concat’d along ne1.

per token t (matching ops.cpp:10731 exactly):
  S = S * exp(g_t)                  decay  (scalar [1,1] broadcast)
  u = matmul(S, k_t)                u[j] = sum_i S[i,j] k[i]
  d = (v_t - u) * beta_t            delta
  S = S + matmul(k_row, d_row)      outer  (k⊗d)[i,j] = k[i] d[j]
  o_t = matmul(S, q_t)              o[j] = sum_i S[i,j] q[i]

The kernel’s 1/√S_v output scale is folded into a SINGLE pre-scale of q (q enters only the output read, never the state, so o = sum_i S (scale·q) is exact). Done once on the contiguous q — NOT per-token on o — because a per-token ggml_scale’s BACKWARD receives a view-shaped grad from the concat and asserts ggml_is_padded_1d (ggml.c:3392). One scale on the whole tensor keeps the backward grad contiguous.

ONE head of the recurrence. q,k,v are the packed [S_v, n_heads, T, 1] projections; g,beta the packed [1, n_heads, T, 1] gates; state0 this head’s [S_v,S_v] carry. ‘head` selects the head; per-token vectors are strided views into the packed tensors (token stride = S_v·n_heads, head base = S_v·head — the ggml [S_v,H,T,B] layout). Returns [S_v, T] for this head; the block concats heads along ne0. n_heads=1/head=0 is the plain single-head case (contiguous per-token, the Phase-4 gate shape).



133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# File 'lib/toy/llm/primitives/gdn.rb', line 133

def self.recur_unrolled(sess, q, k, v, g, beta, state0, s_v, n_heads, head, n_tokens)
  scale = 1.0 / Math.sqrt(s_v.to_f)
  fbytes = 4                          # sizeof(f32)
  tok_stride  = s_v * n_heads * fbytes # bytes between this head's tokens
  head_base   = s_v * head * fbytes    # byte offset to this head's col 0
  gtok_stride = n_heads * fbytes       # g/beta [1,H,T,1]: token stride
  ghead_base  = head * fbytes
  q_s = TinyNN.tnn_scale(sess, q, scale)   # pre-scaled q (contiguous)
  s_mat = state0
  t_out = TinyNN.tnn_null_ptr
  t = 0
  while t < n_tokens
    # Per-token slices: [S_v,1] vectors (S_v contiguous), [1,1] scalars.
    q_t = TinyNN.tnn_view_2d(sess, q_s, s_v, 1, tok_stride, head_base + t * tok_stride)
    k_t = TinyNN.tnn_view_2d(sess, k,   s_v, 1, tok_stride, head_base + t * tok_stride)
    v_t = TinyNN.tnn_view_2d(sess, v,   s_v, 1, tok_stride, head_base + t * tok_stride)
    g_t = TinyNN.tnn_view_2d(sess, g,    1, 1, gtok_stride, ghead_base + t * gtok_stride)
    b_t = TinyNN.tnn_view_2d(sess, beta, 1, 1, gtok_stride, ghead_base + t * gtok_stride)

    eg    = TinyNN.tnn_exp(sess, g_t)              # [1,1]
    s_dec = TinyNN.tnn_mul(sess, s_mat, eg)        # [S_v,S_v] * [1,1] bcast
    u     = TinyNN.tnn_matmul(sess, s_dec, k_t)    # [S_v,1]  u[j]
    diff  = TinyNN.tnn_sub(sess, v_t, u)           # [S_v,1]
    d     = TinyNN.tnn_mul(sess, diff, b_t)        # [S_v,1] * [1,1] bcast
    k_row = TinyNN.tnn_reshape_2d(sess, k_t, 1, s_v)  # [1,S_v]
    d_row = TinyNN.tnn_reshape_2d(sess, d, 1, s_v)    # [1,S_v]
    outer = TinyNN.tnn_matmul(sess, k_row, d_row)  # [S_v,S_v] [i,j]=k[i]d[j]
    s_mat = TinyNN.tnn_add(sess, s_dec, outer)     # state update
    o_t   = TinyNN.tnn_matmul(sess, s_mat, q_t)    # [S_v,1]  o[j] (already scaled)

    if t == 0
      t_out = o_t
    else
      t_out = TinyNN.tnn_concat(sess, t_out, o_t, 1)  # stack along ne1
    end
    t = t + 1
  end
  t_out
end

.update_gate(sess, b) ⇒ Object

Update rate: beta = sigmoid(b). b is the projected update stream [1,H,T,B]. The kernel uses beta directly, so the sigmoid lives here. Returns beta.



74
75
76
# File 'lib/toy/llm/primitives/gdn.rb', line 74

def self.update_gate(sess, b)
  TinyNN.tnn_sigmoid(sess, b)
end

.update_gate_train(sess, b) ⇒ Object

TRAINABLE update gate — sigmoid(b) composed as exp(b)/(1+exp(b)) from ops that each have a ggml backward, because GGML_UNARY_OP_SIGMOID has none. Same-shape throughout (no broadcast). The fused ‘update_gate` above (tnn_sigmoid) stays the inference path.



82
83
84
85
86
# File 'lib/toy/llm/primitives/gdn.rb', line 82

def self.update_gate_train(sess, b)
  e = TinyNN.tnn_exp(sess, b)                    # exp(b)
  d = TinyNN.tnn_scale_bias(sess, e, 1.0, 1.0)   # 1 + exp(b)
  TinyNN.tnn_div(sess, e, d)
end