Module: Toy::LLM::Primitives::DiffAttention
- Defined in:
- lib/toy/llm/primitives/diff_attention.rb
Constant Summary collapse
- NAME =
:diff_attention
Class Method Summary collapse
-
.combine(sess, a1, a2, lambda_t) ⇒ Object
Combine the two attention maps: A = A1 - lambda*A2.
-
.lambda_scalar(sess, lq1, lk1, lq2, lk2, lambda_init) ⇒ Object
The per-head differential lambda SCALAR: lambda = exp(sum(lq1*lk1)) - exp(sum(lq2*lk2)) + lambda_init lq1/lk1/lq2/lk2 are the learned [head_dim] vectors (block-owned); lambda_init is the depth-constant Float.
-
.subln(sess, o, gamma, eps, one_minus_lambda_init) ⇒ Object
Per-head output sub-norm + the fixed (1 - lambda_init) scaling: O = rms_norm(O, gamma) * (1 - lambda_init).
Class Method Details
.combine(sess, a1, a2, lambda_t) ⇒ Object
Combine the two attention maps: A = A1 - lambda*A2. a1/a2 are the block’s two softmax score maps (same shape); lambda the [1] scalar from ‘lambda_scalar` (broadcasts). a1 drives the shape under ggml broadcast; the lambda*a2 term is subtracted.
53 54 55 56 |
# File 'lib/toy/llm/primitives/diff_attention.rb', line 53 def self.combine(sess, a1, a2, lambda_t) la2 = TinyNN.tnn_mul(sess, a2, lambda_t) TinyNN.tnn_sub(sess, a1, la2) end |
.lambda_scalar(sess, lq1, lk1, lq2, lk2, lambda_init) ⇒ Object
The per-head differential lambda SCALAR:
lambda = exp(sum(lq1*lk1)) - exp(sum(lq2*lk2)) + lambda_init
lq1/lk1/lq2/lk2 are the learned [head_dim] vectors (block-owned); lambda_init is the depth-constant Float. The dot products reduce to a [1] tensor via tnn_sum; the result lambda is a [1] tensor that broadcast-multiplies A2 in ‘combine`. (scale_bias folds the + lambda_init onto the first exp term, so the math is (exp1 + lambda_init) - exp2 = exp1 - exp2 + lambda_init.)
38 39 40 41 42 43 44 45 46 47 |
# File 'lib/toy/llm/primitives/diff_attention.rb', line 38 def self.lambda_scalar(sess, lq1, lk1, lq2, lk2, lambda_init) d1 = TinyNN.tnn_mul(sess, lq1, lk1) s1 = TinyNN.tnn_sum(sess, d1) e1 = TinyNN.tnn_exp(sess, s1) d2 = TinyNN.tnn_mul(sess, lq2, lk2) s2 = TinyNN.tnn_sum(sess, d2) e2 = TinyNN.tnn_exp(sess, s2) e1b = TinyNN.tnn_scale_bias(sess, e1, 1.0, lambda_init) # exp1 + lambda_init TinyNN.tnn_sub(sess, e1b, e2) # (exp1+λ_init) - exp2 end |
.subln(sess, o, gamma, eps, one_minus_lambda_init) ⇒ Object
Per-head output sub-norm + the fixed (1 - lambda_init) scaling:
O = rms_norm(O, gamma) * (1 - lambda_init).
o is the per-head attention output (block-sliced); gamma the subln weight; eps the Float epsilon; one_minus_lambda_init the compile-time Float (1 - lambda_init). tnn_rms_norm folds gamma; scale applies the depth constant. Returns the normed/scaled head output.
64 65 66 67 |
# File 'lib/toy/llm/primitives/diff_attention.rb', line 64 def self.subln(sess, o, gamma, eps, one_minus_lambda_init) n = TinyNN.tnn_rms_norm(sess, o, gamma, eps) TinyNN.tnn_scale(sess, n, one_minus_lambda_init) end |