Class: Ignis::AI::Tensor
- Inherits:
-
Object
- Object
- Ignis::AI::Tensor
- Defined in:
- lib/nnw/ai/tensor.rb
Overview
Tensor — the user-facing GPU tensor type for AI operations.
Wraps Ignis::Shared::NvArray and adds gradient tracking for autograd. All compute ops record backward functions on the tape when requires_grad is true.
Instance Attribute Summary collapse
-
#_tape_id ⇒ Integer?
Position in current tape.
-
#data ⇒ Ignis::Shared::NvArray
readonly
Underlying GPU data.
-
#grad ⇒ Ignis::Shared::NvArray?
Gradient (same shape as data).
-
#grad_fn ⇒ Proc?
Backward function recorded by the tape.
-
#is_leaf ⇒ Boolean
readonly
True if created by user (not computed).
-
#requires_grad ⇒ Boolean
readonly
Whether this tensor participates in autograd.
Class Method Summary collapse
-
.from_host(ruby_array, shape:, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor
Create a tensor from a Ruby array.
-
.from_nv_array(nv_array, requires_grad: false) ⇒ Tensor
Wrap an existing NvArray.
-
.ones(shape, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor
Create a ones-filled tensor.
-
.rand(shape, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor
Create a tensor with random uniform values in [0, 1).
-
.zeros(shape, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor
Create a zero-filled tensor.
Instance Method Summary collapse
-
#*(other) ⇒ Tensor
Elementwise multiplication (Hadamard): self * other.
-
#+(other) ⇒ Tensor
Elementwise addition: self + other.
-
#-(other) ⇒ Tensor
Elementwise subtraction: self - other.
-
#add_bias(bias) ⇒ Tensor
Row-broadcast bias add: self [rows, cols] + bias [cols] -> [rows, cols].
-
#backward!(grad_output = nil) ⇒ void
Trigger reverse-mode automatic differentiation from this tensor.
-
#decode_sdpa(k, v, num_heads:) ⇒ Tensor
Single-query attention for autoregressive decode with a KV cache.
-
#detach ⇒ Tensor
Create a detached copy (same GPU memory, no grad tracking).
- #device_id ⇒ Integer
- #dtype ⇒ Symbol
-
#gelu ⇒ Tensor
GELU activation (tanh approximation).
-
#initialize(data:, requires_grad: false, grad_fn: nil, is_leaf: true) ⇒ Tensor
constructor
A new instance of Tensor.
-
#item ⇒ Float, Integer
Get scalar value (for single-element tensors).
-
#layer_norm(weight, bias, eps: 1e-5) ⇒ Tensor
Layer normalization.
-
#matmul(other, transpose_b: false) ⇒ Tensor
Matrix multiplication: self @ other.
-
#mean ⇒ Tensor
Mean reduction (all elements → scalar).
- #numel ⇒ Integer
-
#relu ⇒ Tensor
ReLU activation.
-
#reshape(new_shape) ⇒ Tensor
Reshape (zero-copy if contiguous).
-
#rms_norm(weight, eps: 1e-5) ⇒ Tensor
RMSNorm: y = gamma * x / sqrt(mean(x^2) + eps) (Llama/Qwen/Mistral style).
-
#rope(num_heads:, base: 10000.0, pos_offset: 0, inv_freq: nil) ⇒ Tensor
Rotary Position Embedding (RoPE), HF/Llama/Qwen “rotate_half” convention.
-
#sdpa(k, v, num_heads:, num_kv_heads: nil, causal: true) ⇒ Tensor
Multi-head / grouped-query scaled dot-product attention (causal optional), batch = 1.
- #shape ⇒ Array<Integer>
-
#silu ⇒ Tensor
SiLU activation: x * sigmoid(x).
-
#softmax ⇒ Tensor
Softmax along last dimension.
-
#sum ⇒ Tensor
Sum reduction (all elements → scalar).
-
#to_host ⇒ Array<Numeric>
Copy GPU data to host as Ruby Array.
-
#transpose(dim0 = 0, dim1 = 1) ⇒ Tensor
Transpose two dimensions (for 2D tensors).
-
#zero_grad! ⇒ void
Zero out gradients (sets to zeros, not nil — avoids alloc in training loop).
Constructor Details
#initialize(data:, requires_grad: false, grad_fn: nil, is_leaf: true) ⇒ Tensor
Returns a new instance of Tensor.
40 41 42 43 44 45 46 47 |
# File 'lib/nnw/ai/tensor.rb', line 40 def initialize(data:, requires_grad: false, grad_fn: nil, is_leaf: true) @data = data @requires_grad = requires_grad @grad = nil @grad_fn = grad_fn @is_leaf = is_leaf @_tape_id = nil end |
Instance Attribute Details
#_tape_id ⇒ Integer?
Returns position in current tape.
34 35 36 |
# File 'lib/nnw/ai/tensor.rb', line 34 def _tape_id @_tape_id end |
#data ⇒ Ignis::Shared::NvArray (readonly)
Returns underlying GPU data.
19 20 21 |
# File 'lib/nnw/ai/tensor.rb', line 19 def data @data end |
#grad ⇒ Ignis::Shared::NvArray?
Returns gradient (same shape as data).
22 23 24 |
# File 'lib/nnw/ai/tensor.rb', line 22 def grad @grad end |
#grad_fn ⇒ Proc?
Returns backward function recorded by the tape.
28 29 30 |
# File 'lib/nnw/ai/tensor.rb', line 28 def grad_fn @grad_fn end |
#is_leaf ⇒ Boolean (readonly)
Returns true if created by user (not computed).
31 32 33 |
# File 'lib/nnw/ai/tensor.rb', line 31 def is_leaf @is_leaf end |
#requires_grad ⇒ Boolean (readonly)
Returns whether this tensor participates in autograd.
25 26 27 |
# File 'lib/nnw/ai/tensor.rb', line 25 def requires_grad @requires_grad end |
Class Method Details
.from_host(ruby_array, shape:, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor
Create a tensor from a Ruby array.
104 105 106 107 108 |
# File 'lib/nnw/ai/tensor.rb', line 104 def self.from_host(ruby_array, shape:, dtype: :float32, device_id: 0, requires_grad: false) nv = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id) nv.from_host(ruby_array) new(data: nv, requires_grad: requires_grad) end |
.from_nv_array(nv_array, requires_grad: false) ⇒ Tensor
Wrap an existing NvArray.
57 58 59 |
# File 'lib/nnw/ai/tensor.rb', line 57 def self.from_nv_array(nv_array, requires_grad: false) new(data: nv_array, requires_grad: requires_grad) end |
.ones(shape, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor
Create a ones-filled tensor.
79 80 81 82 83 |
# File 'lib/nnw/ai/tensor.rb', line 79 def self.ones(shape, dtype: :float32, device_id: 0, requires_grad: false) nv = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id) nv.from_host(Array.new(nv.numel, 1.0)) new(data: nv, requires_grad: requires_grad) end |
.rand(shape, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor
Create a tensor with random uniform values in [0, 1).
91 92 93 94 95 |
# File 'lib/nnw/ai/tensor.rb', line 91 def self.rand(shape, dtype: :float32, device_id: 0, requires_grad: false) nv = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id) nv.from_host(Array.new(nv.numel) { Kernel.rand }) new(data: nv, requires_grad: requires_grad) end |
.zeros(shape, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor
Create a zero-filled tensor.
67 68 69 70 71 |
# File 'lib/nnw/ai/tensor.rb', line 67 def self.zeros(shape, dtype: :float32, device_id: 0, requires_grad: false) nv = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id) nv.from_host(Array.new(nv.numel, 0.0)) new(data: nv, requires_grad: requires_grad) end |
Instance Method Details
#*(other) ⇒ Tensor
Elementwise multiplication (Hadamard): self * other
385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 |
# File 'lib/nnw/ai/tensor.rb', line 385 def *(other) if other.is_a?(Numeric) return scalar_mul(other) end other = ensure_tensor(other) result_nv = alloc_like(@data) kernel = Ignis::JIT::Kernels::Elementwise.mul_forward n = numel grid = [(n + 255) / 256] kernel.launch(grid: grid, block: [256], args: [@data, other.data, result_nv, n]) result = Tensor.new(data: result_nv, requires_grad: should_track?(other), is_leaf: false) if result.requires_grad saved_self = @data saved_other = other.data Tape.record(result, inputs: [self, other]) do |grad| grad_a = alloc_like(grad) grad_b = alloc_like(grad) mk = Ignis::JIT::Kernels::Elementwise.mul_backward gn = grad.numel g = [(gn + 255) / 256] mk.launch(grid: g, block: [256], args: [grad, saved_other, grad_a, gn]) mk.launch(grid: g, block: [256], args: [grad, saved_self, grad_b, gn]) [grad_a, grad_b] end end result end |
#+(other) ⇒ Tensor
Elementwise addition: self + other
335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 |
# File 'lib/nnw/ai/tensor.rb', line 335 def +(other) other = ensure_tensor(other) result_nv = alloc_like(@data) kernel = Ignis::JIT::Kernels::Elementwise.add_forward n = numel grid = [(n + 255) / 256] kernel.launch(grid: grid, block: [256], args: [@data, other.data, result_nv, n]) result = Tensor.new(data: result_nv, requires_grad: should_track?(other), is_leaf: false) if result.requires_grad Tape.record(result, inputs: [self, other]) do |grad| [grad, grad] # d(a+b)/da = 1, d(a+b)/db = 1 end end result end |
#-(other) ⇒ Tensor
Elementwise subtraction: self - other
358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 |
# File 'lib/nnw/ai/tensor.rb', line 358 def -(other) other = ensure_tensor(other) result_nv = alloc_like(@data) kernel = Ignis::JIT::Kernels::Elementwise.sub_forward n = numel grid = [(n + 255) / 256] kernel.launch(grid: grid, block: [256], args: [@data, other.data, result_nv, n]) result = Tensor.new(data: result_nv, requires_grad: should_track?(other), is_leaf: false) if result.requires_grad Tape.record(result, inputs: [self, other]) do |grad| neg_grad = alloc_like(grad) scale_k = Ignis::JIT::Kernels::Elementwise.scale_forward gn = grad.numel scale_k.launch(grid: [(gn + 255) / 256], block: [256], args: [grad, neg_grad, -1.0, gn]) [grad, neg_grad] end end result end |
#add_bias(bias) ⇒ Tensor
Row-broadcast bias add: self [rows, cols] + bias [cols] -> [rows, cols]. (Linear layer bias; plain Tensor#+ requires equal element counts.)
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
# File 'lib/nnw/ai/tensor.rb', line 173 def add_bias(bias) cols = shape[-1] rows = numel / cols result_nv = alloc_like(@data) kernel = Ignis::JIT::Kernels::Elementwise.add_bias_rows kernel.launch(grid: [(numel + 255) / 256], block: [256], args: [@data, bias.data, result_nv, rows, cols]) result = Tensor.new(data: result_nv, requires_grad: should_track?(bias), is_leaf: false) if result.requires_grad Tape.record(result, inputs: [self, bias]) do |grad| # d/d(input) = grad (passthrough); d/d(bias) = sum over rows grad_bias = zeros_nv([cols]) bk = Ignis::JIT::Kernels::Elementwise.add_backward_broadcast bk.launch(grid: [(cols + 255) / 256], block: [256], args: [grad, grad_bias, rows, cols]) [grad, grad_bias] end end result end |
#backward!(grad_output = nil) ⇒ void
This method returns an undefined value.
Trigger reverse-mode automatic differentiation from this tensor.
775 776 777 778 779 780 781 782 783 784 785 |
# File 'lib/nnw/ai/tensor.rb', line 775 def backward!(grad_output = nil) if grad_output.nil? && numel == 1 # Scalar loss: start with 1.0 grad_output = Ignis::Shared::NvArray.new(shape: [1], dtype: dtype, device_id: device_id) grad_output.from_host([1.0]) end raise ArgumentError, "backward! requires grad_output for non-scalar tensors" if grad_output.nil? Tape.backward!(self, grad_output) end |
#decode_sdpa(k, v, num_heads:) ⇒ Tensor
Single-query attention for autoregressive decode with a KV cache.
self = q [1, embed] (the new token’s query); k, v = cached keys/values
- past+1, embed
-
(every position up to and including the current one). The
new token is the LAST position, so it attends to ALL cached positions — no causal mask is needed. Returns context [1, embed]. No autograd (decode runs under no_grad). Built from the verified column-major GEMM (the 1/sqrt(d) scale folded into alpha) + the numerically-stable softmax_forward kernel, mirroring sdpa’s per-head column layout so head splitting is identical.
305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 |
# File 'lib/nnw/ai/tensor.rb', line 305 def decode_sdpa(k, v, num_heads:) _, = shape tk = k.shape[0] head_dim = / num_heads scale = (1.0 / Math.sqrt(head_dim)).to_f context_nv = zeros_nv([1, ]) sm = Ignis::JIT::Kernels::Attention.softmax_forward num_heads.times do |h| off = h * head_dim qh = slice_cols_nv(@data, off, head_dim, 1, ) # [1, hd] kh = slice_cols_nv(k.data, off, head_dim, tk, ) # [tk, hd] vh = slice_cols_nv(v.data, off, head_dim, tk, ) # [tk, hd] # scores = scale * (qh @ khᵀ) → [1, tk] (alpha folds in the scale) scores = Ignis::LinAlg::Matmul.call(qh, kh, transpose_b: true, alpha: scale) # probs = softmax(scores) along the tk axis → [1, tk] probs = Ignis::Shared::NvArray.new(shape: [1, tk], dtype: :float32, device_id: device_id).to_device sm.launch(grid: [1], block: [1], args: [scores, probs, 1, tk]) # ctx_h = probs @ vh → [1, hd] ctx_h = Ignis::LinAlg::Matmul.call(probs, vh) scatter_cols_nv!(ctx_h, context_nv, off, head_dim, 1, ) end Tensor.new(data: context_nv, requires_grad: false, is_leaf: false) end |
#detach ⇒ Tensor
Create a detached copy (same GPU memory, no grad tracking)
802 803 804 |
# File 'lib/nnw/ai/tensor.rb', line 802 def detach Tensor.new(data: @data, requires_grad: false, is_leaf: true) end |
#device_id ⇒ Integer
130 131 132 |
# File 'lib/nnw/ai/tensor.rb', line 130 def device_id @data.device_id end |
#dtype ⇒ Symbol
120 121 122 |
# File 'lib/nnw/ai/tensor.rb', line 120 def dtype @data.dtype end |
#gelu ⇒ Tensor
GELU activation (tanh approximation)
444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 |
# File 'lib/nnw/ai/tensor.rb', line 444 def gelu result_nv = alloc_like(@data) kernel = Ignis::JIT::Kernels::Activations.gelu_forward n = numel kernel.launch(grid: [(n + 255) / 256], block: [256], args: [@data, result_nv, n]) result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false) if @requires_grad saved_input = @data Tape.record(result, inputs: [self]) do |grad| grad_in = alloc_like(grad) bk = Ignis::JIT::Kernels::Activations.gelu_backward gn = grad.numel bk.launch(grid: [(gn + 255) / 256], block: [256], args: [grad, saved_input, grad_in, gn]) [grad_in] end end result end |
#item ⇒ Float, Integer
Get scalar value (for single-element tensors).
814 815 816 817 |
# File 'lib/nnw/ai/tensor.rb', line 814 def item raise "item() requires a single-element tensor, got shape #{shape}" unless numel == 1 to_host[0] end |
#layer_norm(weight, bias, eps: 1e-5) ⇒ Tensor
Layer normalization
522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 |
# File 'lib/nnw/ai/tensor.rb', line 522 def layer_norm(weight, bias, eps: 1e-5) norm_size = shape[-1] outer_size = numel / norm_size result_nv = alloc_like(@data) # Allocate mean and rstd storage for backward pass mean_nv = Ignis::Shared::NvArray.new(shape: [outer_size], dtype: dtype, device_id: device_id) mean_nv.from_host(Array.new(outer_size, 0.0)) rstd_nv = Ignis::Shared::NvArray.new(shape: [outer_size], dtype: dtype, device_id: device_id) rstd_nv.from_host(Array.new(outer_size, 0.0)) kernel = Ignis::JIT::Kernels::Normalization.layer_norm_forward kernel.launch(grid: [(outer_size + 255) / 256], block: [256], args: [@data, weight.data, bias.data, result_nv, mean_nv, rstd_nv, outer_size, norm_size, eps]) result = Tensor.new(data: result_nv, requires_grad: @requires_grad || weight.requires_grad || bias.requires_grad, is_leaf: false) if result.requires_grad saved_input = @data saved_gamma = weight.data Tape.record(result, inputs: [self, weight, bias]) do |grad| grad_input = alloc_like(grad) grad_gamma = Ignis::Shared::NvArray.new(shape: [norm_size], dtype: dtype, device_id: device_id) grad_gamma.from_host(Array.new(norm_size, 0.0)) grad_beta = Ignis::Shared::NvArray.new(shape: [norm_size], dtype: dtype, device_id: device_id) grad_beta.from_host(Array.new(norm_size, 0.0)) bk = Ignis::JIT::Kernels::Normalization.layer_norm_backward bk.launch(grid: [(outer_size + 255) / 256], block: [256], args: [grad, saved_input, saved_gamma, mean_nv, rstd_nv, grad_input, grad_gamma, grad_beta, outer_size, norm_size]) [grad_input, grad_gamma, grad_beta] end end result end |
#matmul(other, transpose_b: false) ⇒ Tensor
Matrix multiplication: self @ other
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
# File 'lib/nnw/ai/tensor.rb', line 145 def matmul(other, transpose_b: false) result_data = Ignis::LinAlg::Matmul.call(@data, other.data, transpose_b: transpose_b) result = Tensor.new(data: result_data, requires_grad: should_track?(other), is_leaf: false) if result.requires_grad saved_self = @data saved_other = other.data Tape.record(result, inputs: [self, other]) do |grad| if transpose_b # y = A @ Bᵀ ⇒ dA = grad @ B, dB = gradᵀ @ A grad_a = Ignis::LinAlg::Matmul.call(grad, saved_other) grad_b = Ignis::LinAlg::Matmul.call(grad, saved_self, transpose_a: true) else # dA = grad @ Bᵀ, dB = Aᵀ @ grad grad_a = Ignis::LinAlg::Matmul.call(grad, saved_other, transpose_b: true) grad_b = Ignis::LinAlg::Matmul.call(saved_self, grad, transpose_a: true) end [grad_a, grad_b] end end result end |
#mean ⇒ Tensor
Mean reduction (all elements → scalar)
743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 |
# File 'lib/nnw/ai/tensor.rb', line 743 def mean n = numel sum_result = self.sum # Scale by 1/n result_nv = Ignis::Shared::NvArray.new(shape: [1], dtype: dtype, device_id: device_id) result_nv.from_host([0.0]) kernel = Ignis::JIT::Kernels::Elementwise.scale_forward kernel.launch(grid: [1], block: [1], args: [sum_result.data, result_nv, 1.0 / n, 1]) result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false) if @requires_grad orig_shape = shape Tape.record(result, inputs: [self]) do |grad| grad_input = Ignis::Shared::NvArray.new(shape: orig_shape, dtype: dtype, device_id: device_id) grad_input.from_host(Array.new(n, 0.0)) bk = Ignis::JIT::Kernels::Elementwise.broadcast_grad bk.launch(grid: [(n + 255) / 256], block: [256], args: [grad, grad_input, 1.0 / n, n]) [grad_input] end end result end |
#numel ⇒ Integer
125 126 127 |
# File 'lib/nnw/ai/tensor.rb', line 125 def numel @data.numel end |
#relu ⇒ Tensor
ReLU activation
420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 |
# File 'lib/nnw/ai/tensor.rb', line 420 def relu result_nv = alloc_like(@data) kernel = Ignis::JIT::Kernels::Activations.relu_forward(numel) n = numel kernel.launch(grid: [(n + 255) / 256], block: [256], args: [@data, result_nv, n]) result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false) if @requires_grad saved_input = @data Tape.record(result, inputs: [self]) do |grad| grad_in = alloc_like(grad) bk = Ignis::JIT::Kernels::Activations.relu_backward gn = grad.numel bk.launch(grid: [(gn + 255) / 256], block: [256], args: [grad, saved_input, grad_in, gn]) [grad_in] end end result end |
#reshape(new_shape) ⇒ Tensor
Reshape (zero-copy if contiguous)
691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 |
# File 'lib/nnw/ai/tensor.rb', line 691 def reshape(new_shape) new_numel = new_shape.reduce(1, :*) raise ArgumentError, "Cannot reshape #{shape} to #{new_shape}" unless new_numel == numel # View over @data's buffer: non-owning, retains parent so it isn't freed # while the view is alive (and never double-frees the shared allocation). result_nv = Ignis::Shared::NvArray.new(shape: new_shape, dtype: dtype, device_id: device_id, ptr: @data.ptr, parent: @data) result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false) if @requires_grad original_shape = shape Tape.record(result, inputs: [self]) do |grad| # Backward: reshape grad back to original shape (view over grad) grad_reshaped = Ignis::Shared::NvArray.new(shape: original_shape, dtype: dtype, device_id: device_id, ptr: grad.ptr, parent: grad) [grad_reshaped] end end result end |
#rms_norm(weight, eps: 1e-5) ⇒ Tensor
RMSNorm: y = gamma * x / sqrt(mean(x^2) + eps) (Llama/Qwen/Mistral style). No mean-subtraction and no bias (vs LayerNorm). Normalizes the last dim.
568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 |
# File 'lib/nnw/ai/tensor.rb', line 568 def rms_norm(weight, eps: 1e-5) norm_size = shape[-1] outer_size = numel / norm_size result_nv = alloc_like(@data) # rstd per row, saved for backward rstd_nv = Ignis::Shared::NvArray.new(shape: [outer_size], dtype: dtype, device_id: device_id) rstd_nv.zero! fwd = Ignis::JIT::Kernels::Normalization.rms_norm_forward fwd.launch(grid: [(outer_size + 255) / 256], block: [256], args: [@data, weight.data, result_nv, rstd_nv, outer_size, norm_size, eps.to_f]) result = Tensor.new(data: result_nv, requires_grad: @requires_grad || weight.requires_grad, is_leaf: false) if result.requires_grad saved_input = @data saved_gamma = weight.data Tape.record(result, inputs: [self, weight]) do |grad| grad_input = alloc_like(grad) grad_gamma = zeros_nv([norm_size]) bk = Ignis::JIT::Kernels::Normalization.rms_norm_backward bk.launch(grid: [(outer_size + 255) / 256], block: [256], args: [grad, saved_input, saved_gamma, rstd_nv, grad_input, grad_gamma, outer_size, norm_size]) [grad_input, grad_gamma] end end result end |
#rope(num_heads:, base: 10000.0, pos_offset: 0, inv_freq: nil) ⇒ Tensor
Rotary Position Embedding (RoPE), HF/Llama/Qwen “rotate_half” convention. self is [seq, num_heads*head_dim]; rotates each head’s dims by its absolute position. No learned parameters — the backward is the same rotation with the sin sign flipped (orthogonal rotation ⇒ R^T = R(-θ)). Applied to Q and K.
616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 |
# File 'lib/nnw/ai/tensor.rb', line 616 def rope(num_heads:, base: 10000.0, pos_offset: 0, inv_freq: nil) seq, = shape head_dim = / num_heads # rotate_half RoPE pairs dim i with i+head_dim/2, so it is only well-defined # for EVEN head_dim. With an odd head_dim the pairing collides (one dim is # used twice, another never), giving a non-orthogonal map whose forward AND # gradient are silently wrong. No real architecture uses odd head_dim — fail # loud rather than miscompute. raise ArgumentError, "RoPE requires an even head_dim (got #{head_dim} = #{}/#{num_heads}); " \ "rotate_half is only defined for paired dimensions" unless head_dim.even? half = head_dim / 2 invf_nv = case inv_freq when Ignis::Shared::NvArray then inv_freq when Array then nv_from_floats(inv_freq) else nv_from_floats((0...half).map { |i| base.to_f**(-2.0 * i / head_dim) }) end out_nv = alloc_like(@data) total = seq * k = Ignis::JIT::Kernels::Attention.rope_apply k.launch(grid: [(total + 255) / 256], block: [256], args: [@data, out_nv, seq, num_heads, head_dim, pos_offset, invf_nv, 1.0]) result = Tensor.new(data: out_nv, requires_grad: @requires_grad, is_leaf: false) if result.requires_grad Tape.record(result, inputs: [self]) do |grad| gin = alloc_like(grad) # backward = forward rotation with negated sin (transpose of an orthogonal rotation) k.launch(grid: [(total + 255) / 256], block: [256], args: [grad, gin, seq, num_heads, head_dim, pos_offset, invf_nv, -1.0]) [gin] end end result end |
#sdpa(k, v, num_heads:, num_kv_heads: nil, causal: true) ⇒ Tensor
Multi-head / grouped-query scaled dot-product attention (causal optional), batch = 1. self = Q [seq, num_heads*head_dim]; k, v = [seq, num_kv_heads*head_dim]. Returns context [seq, num_heads*head_dim].
With num_kv_heads == num_heads this is standard multi-head attention. With num_kv_heads < num_heads it is Grouped-Query Attention (Llama-2-70B, Llama-3, Qwen2/3, SmolLM3): each KV head is shared by group_size = num_heads/num_kv_heads query heads. Each query head runs the Flash-Attention-2 kernel against its group’s KV head. In the backward, the group_size query heads that share a KV head ACCUMULATE into that head’s dK/dV (scatter-add); dQ heads are disjoint.
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 |
# File 'lib/nnw/ai/tensor.rb', line 213 def sdpa(k, v, num_heads:, num_kv_heads: nil, causal: true) num_kv_heads ||= num_heads raise ArgumentError, "num_heads (#{num_heads}) must be a multiple of num_kv_heads (#{num_kv_heads})" \ unless (num_heads % num_kv_heads).zero? seq, = shape # embed = num_heads * head_dim head_dim = / num_heads # The flash-attention kernels store per-head rows in fixed [HEAD_DIM_MAX=128] # register arrays and clamp every dim loop to d < 128. For head_dim > 128 # they would silently drop dims 128.. from scores/output/gradients with no # error. Targets (Qwen3/Llama/SmolLM/Phi) use head_dim ≤ 128; fail loud above # that rather than miscompute. (decode_sdpa uses cuBLAS+softmax and has no cap.) raise ArgumentError, "head_dim #{head_dim} exceeds flash-attention HEAD_DIM_MAX (128); " \ "larger heads are not yet supported by the flash kernels" if head_dim > 128 = num_kv_heads * head_dim group_size = num_heads / num_kv_heads scale = (1.0 / Math.sqrt(head_dim)).to_f cmask = causal ? 1 : 0 context_nv = zeros_nv([seq, ]) fwd = Ignis::JIT::Kernels::Attention.flash_attention_forward q_tiles = (seq + 63) / 64 num_heads.times do |h| qoff = h * head_dim koff = (h / group_size) * head_dim # the KV head this query head attends to qh = slice_cols_nv(@data, qoff, head_dim, seq, ) kh = slice_cols_nv(k.data, koff, head_dim, seq, ) vh = slice_cols_nv(v.data, koff, head_dim, seq, ) oh = zeros_nv([seq, head_dim]) fwd.launch(grid: [q_tiles], block: [64], args: [qh, kh, vh, oh, seq, head_dim, scale, cmask]) scatter_cols_nv!(oh, context_nv, qoff, head_dim, seq, ) end result = Tensor.new(data: context_nv, requires_grad: @requires_grad || should_track?(k) || should_track?(v), is_leaf: false) if result.requires_grad sq = @data sk = k.data sv = v.data so = context_nv Tape.record(result, inputs: [self, k, v]) do |grad| d_q = zeros_nv([seq, ]) d_k = zeros_nv([seq, ]) d_v = zeros_nv([seq, ]) bwd = Ignis::JIT::Kernels::Attention.flash_attention_backward blk = (seq + 255) / 256 num_heads.times do |h| qoff = h * head_dim koff = (h / group_size) * head_dim qh = slice_cols_nv(sq, qoff, head_dim, seq, ) kh = slice_cols_nv(sk, koff, head_dim, seq, ) vh = slice_cols_nv(sv, koff, head_dim, seq, ) oh = slice_cols_nv(so, qoff, head_dim, seq, ) doh = slice_cols_nv(grad, qoff, head_dim, seq, ) dqh = zeros_nv([seq, head_dim]) dkh = zeros_nv([seq, head_dim]) dvh = zeros_nv([seq, head_dim]) bwd.launch(grid: [blk], block: [256], args: [qh, kh, vh, oh, doh, dqh, dkh, dvh, seq, head_dim, scale, cmask]) # dQ heads are disjoint → overwrite. dK/dV heads are SHARED across the # group → accumulate (add into a zero-initialized buffer). For MHA # (group_size==1) the KV columns are disjoint too, so add-into-zero is # numerically identical to the previous overwrite — no regression. scatter_cols_nv!(dqh, d_q, qoff, head_dim, seq, ) scatter_cols_add_nv!(dkh, d_k, koff, head_dim, seq, ) scatter_cols_add_nv!(dvh, d_v, koff, head_dim, seq, ) end [d_q, d_k, d_v] end end result end |
#shape ⇒ Array<Integer>
115 116 117 |
# File 'lib/nnw/ai/tensor.rb', line 115 def shape @data.shape end |
#silu ⇒ Tensor
SiLU activation: x * sigmoid(x)
468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 |
# File 'lib/nnw/ai/tensor.rb', line 468 def silu result_nv = alloc_like(@data) kernel = Ignis::JIT::Kernels::Activations.silu_forward n = numel kernel.launch(grid: [(n + 255) / 256], block: [256], args: [@data, result_nv, n]) result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false) if @requires_grad saved_input = @data Tape.record(result, inputs: [self]) do |grad| grad_in = alloc_like(grad) bk = Ignis::JIT::Kernels::Activations.silu_backward gn = grad.numel bk.launch(grid: [(gn + 255) / 256], block: [256], args: [grad, saved_input, grad_in, gn]) [grad_in] end end result end |
#softmax ⇒ Tensor
Softmax along last dimension
492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 |
# File 'lib/nnw/ai/tensor.rb', line 492 def softmax last_dim = shape[-1] outer_size = numel / last_dim result_nv = alloc_like(@data) kernel = Ignis::JIT::Kernels::Attention.softmax_forward kernel.launch(grid: [(outer_size + 255) / 256], block: [256], args: [@data, result_nv, outer_size, last_dim]) result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false) if @requires_grad saved_output = result_nv Tape.record(result, inputs: [self]) do |grad| grad_in = alloc_like(grad) bk = Ignis::JIT::Kernels::Attention.softmax_backward bk.launch(grid: [(outer_size + 255) / 256], block: [256], args: [grad, saved_output, grad_in, outer_size, last_dim]) [grad_in] end end result end |
#sum ⇒ Tensor
Sum reduction (all elements → scalar)
716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 |
# File 'lib/nnw/ai/tensor.rb', line 716 def sum n = numel result_nv = Ignis::Shared::NvArray.new(shape: [1], dtype: dtype, device_id: device_id) result_nv.from_host([0.0]) kernel = Ignis::JIT::Kernels::Elementwise.sum_reduce kernel.launch(grid: [1], block: [1], args: [@data, result_nv, 1, n]) result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false) if @requires_grad orig_shape = shape Tape.record(result, inputs: [self]) do |grad| # Gradient of sum is broadcast of 1.0 to original shape grad_input = Ignis::Shared::NvArray.new(shape: orig_shape, dtype: dtype, device_id: device_id) grad_input.from_host(Array.new(n, 0.0)) bk = Ignis::JIT::Kernels::Elementwise.broadcast_grad bk.launch(grid: [(n + 255) / 256], block: [256], args: [grad, grad_input, 1.0, n]) [grad_input] end end result end |
#to_host ⇒ Array<Numeric>
Copy GPU data to host as Ruby Array.
808 809 810 |
# File 'lib/nnw/ai/tensor.rb', line 808 def to_host @data.to_host end |
#transpose(dim0 = 0, dim1 = 1) ⇒ Tensor
Transpose two dimensions (for 2D tensors)
660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 |
# File 'lib/nnw/ai/tensor.rb', line 660 def transpose(dim0 = 0, dim1 = 1) raise ArgumentError, "transpose requires 2D tensor" unless shape.length == 2 rows = shape[0] cols = shape[1] result_nv = Ignis::Shared::NvArray.new(shape: [cols, rows], dtype: dtype, device_id: device_id) result_nv.to_device # transpose_2d writes every element — alloc only, no host zeroing kernel = Ignis::JIT::Kernels::Elementwise.transpose_2d grid_x = (cols + 31) / 32 grid_y = (rows + 31) / 32 kernel.launch(grid: [grid_x, grid_y], block: [32, 8], args: [@data, result_nv, rows, cols]) result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false) if @requires_grad Tape.record(result, inputs: [self]) do |grad| # Backward of transpose is transpose grad_t = alloc_like(@data) kernel_t = Ignis::JIT::Kernels::Elementwise.transpose_2d kernel_t.launch(grid: [grid_y, grid_x], block: [32, 8], args: [grad, grad_t, cols, rows]) [grad_t] end end result end |
#zero_grad! ⇒ void
This method returns an undefined value.
Zero out gradients (sets to zeros, not nil — avoids alloc in training loop)
789 790 791 792 793 794 795 796 797 798 |
# File 'lib/nnw/ai/tensor.rb', line 789 def zero_grad! if @grad n = @grad.numel fill_k = Ignis::JIT::Kernels::Elementwise.fill fill_k.launch(grid: [(n + 255) / 256], block: [256], args: [@grad, 0.0, n]) else @grad = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id) @grad.from_host(Array.new(numel, 0.0)) end end |