Class: Ignis::AI::Transformer::RopeGqaAttention

Inherits:

NN::Module

Object
NN::Module
Ignis::AI::Transformer::RopeGqaAttention

show all

Defined in:: lib/nnw/ai/transformer/modern.rb

Overview

Attention with rotary embeddings (RoPE) and grouped-query attention (GQA). No bias on projections (Llama/Qwen convention).

Instance Attribute Summary

Attributes inherited from NN::Module

#training

Instance Method Summary collapse

#forward(x, pos_offset: 0) ⇒ Tensor

[seq, embed_dim].
#initialize(embed_dim, num_heads, num_kv_heads:, head_dim: nil, rope_base: 10000.0, rope_scaling: nil, bias: false, device_id: 0) ⇒ RopeGqaAttention constructor

A new instance of RopeGqaAttention.
#to_s ⇒ String

Methods inherited from NN::Module

#call, #eval!, #load_state_dict, #named_parameters, #num_parameters, #parameters, #state_dict, #to, #train!, #zero_grad!

Constructor Details

#initialize(embed_dim, num_heads, num_kv_heads:, head_dim: nil, rope_base: 10000.0, rope_scaling: nil, bias: false, device_id: 0) ⇒ `RopeGqaAttention`

Returns a new instance of RopeGqaAttention.

Parameters:

embed_dim (Integer)
num_heads (Integer) —

query heads
num_kv_heads (Integer) —

key/value heads (== num_heads ⇒ plain MHA)
head_dim (Integer, nil) (defaults to: nil) —

per-head dim (default embed_dim/num_heads)
rope_base (Float) (defaults to: 10000.0) —

RoPE theta
bias (Boolean) (defaults to: false)
device_id (Integer) (defaults to: 0)

Raises:

(ArgumentError)

# File 'lib/nnw/ai/transformer/modern.rb', line 56

def initialize(embed_dim, num_heads, num_kv_heads:, head_dim: nil,
               rope_base: 10000.0, rope_scaling: nil, bias: false, device_id: 0)
  super()
  raise ArgumentError, "num_heads must be a multiple of num_kv_heads" unless (num_heads % num_kv_heads).zero?
  @embed_dim = embed_dim
  @num_heads = num_heads
  @num_kv_heads = num_kv_heads
  @head_dim = head_dim || (embed_dim / num_heads)
  # Fail early (at construction) rather than silently miscompute later:
  # RoPE needs an even head_dim; the flash kernels cap head_dim at 128.
  raise ArgumentError, "head_dim must be even for RoPE (got #{@head_dim})" unless @head_dim.even?
  raise ArgumentError, "head_dim #{@head_dim} exceeds flash-attention HEAD_DIM_MAX (128)" if @head_dim > 128
  @rope_base = rope_base
  # Precompute the (optionally scaled) inv_freq table once; reused every layer/step.
  @inv_freq = Transformer.compute_inv_freq(@head_dim, rope_base, rope_scaling)
  q_out  = num_heads * @head_dim
  kv_out = num_kv_heads * @head_dim
  @q_proj = register_module("q_proj", NN::Linear.new(embed_dim, q_out, bias: bias, device_id: device_id))
  @k_proj = register_module("k_proj", NN::Linear.new(embed_dim, kv_out, bias: bias, device_id: device_id))
  @v_proj = register_module("v_proj", NN::Linear.new(embed_dim, kv_out, bias: bias, device_id: device_id))
  @o_proj = register_module("o_proj", NN::Linear.new(q_out, embed_dim, bias: bias, device_id: device_id))
end

Instance Method Details

#forward(x, pos_offset: 0) ⇒ `Tensor`

Returns [seq, embed_dim].