Class: Ignis::AI::Tensor

Inherits:
Object
  • Object
show all
Defined in:
lib/nnw/ai/tensor.rb

Overview

Tensor — the user-facing GPU tensor type for AI operations.

Wraps Ignis::Shared::NvArray and adds gradient tracking for autograd. All compute ops record backward functions on the tape when requires_grad is true.

Examples:

Forward + backward

a = Ignis::AI::Tensor.from_host([1.0, 2.0, 3.0], shape: [3], requires_grad: true)
b = a * a   # b = a^2
b.sum.backward!
a.grad  # => [2.0, 4.0, 6.0]

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(data:, requires_grad: false, grad_fn: nil, is_leaf: true) ⇒ Tensor

Returns a new instance of Tensor.

Parameters:

  • data (Ignis::Shared::NvArray)
  • requires_grad (Boolean) (defaults to: false)
  • grad_fn (Proc, nil) (defaults to: nil)
  • is_leaf (Boolean) (defaults to: true)


40
41
42
43
44
45
46
47
# File 'lib/nnw/ai/tensor.rb', line 40

def initialize(data:, requires_grad: false, grad_fn: nil, is_leaf: true)
  @data = data
  @requires_grad = requires_grad
  @grad = nil
  @grad_fn = grad_fn
  @is_leaf = is_leaf
  @_tape_id = nil
end

Instance Attribute Details

#_tape_idInteger?

Returns position in current tape.

Returns:

  • (Integer, nil)

    position in current tape



34
35
36
# File 'lib/nnw/ai/tensor.rb', line 34

def _tape_id
  @_tape_id
end

#dataIgnis::Shared::NvArray (readonly)

Returns underlying GPU data.

Returns:

  • (Ignis::Shared::NvArray)

    underlying GPU data



19
20
21
# File 'lib/nnw/ai/tensor.rb', line 19

def data
  @data
end

#gradIgnis::Shared::NvArray?

Returns gradient (same shape as data).

Returns:

  • (Ignis::Shared::NvArray, nil)

    gradient (same shape as data)



22
23
24
# File 'lib/nnw/ai/tensor.rb', line 22

def grad
  @grad
end

#grad_fnProc?

Returns backward function recorded by the tape.

Returns:

  • (Proc, nil)

    backward function recorded by the tape



28
29
30
# File 'lib/nnw/ai/tensor.rb', line 28

def grad_fn
  @grad_fn
end

#is_leafBoolean (readonly)

Returns true if created by user (not computed).

Returns:

  • (Boolean)

    true if created by user (not computed)



31
32
33
# File 'lib/nnw/ai/tensor.rb', line 31

def is_leaf
  @is_leaf
end

#requires_gradBoolean (readonly)

Returns whether this tensor participates in autograd.

Returns:

  • (Boolean)

    whether this tensor participates in autograd



25
26
27
# File 'lib/nnw/ai/tensor.rb', line 25

def requires_grad
  @requires_grad
end

Class Method Details

.from_host(ruby_array, shape:, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor

Create a tensor from a Ruby array.

Parameters:

  • ruby_array (Array<Numeric>)
  • shape (Array<Integer>)
  • dtype (Symbol) (defaults to: :float32)
  • device_id (Integer) (defaults to: 0)
  • requires_grad (Boolean) (defaults to: false)

Returns:



104
105
106
107
108
# File 'lib/nnw/ai/tensor.rb', line 104

def self.from_host(ruby_array, shape:, dtype: :float32, device_id: 0, requires_grad: false)
  nv = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id)
  nv.from_host(ruby_array)
  new(data: nv, requires_grad: requires_grad)
end

.from_nv_array(nv_array, requires_grad: false) ⇒ Tensor

Wrap an existing NvArray.

Parameters:

  • nv_array (Ignis::Shared::NvArray)
  • requires_grad (Boolean) (defaults to: false)

Returns:



57
58
59
# File 'lib/nnw/ai/tensor.rb', line 57

def self.from_nv_array(nv_array, requires_grad: false)
  new(data: nv_array, requires_grad: requires_grad)
end

.ones(shape, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor

Create a ones-filled tensor.

Parameters:

  • shape (Array<Integer>)
  • dtype (Symbol) (defaults to: :float32)
  • device_id (Integer) (defaults to: 0)
  • requires_grad (Boolean) (defaults to: false)

Returns:



79
80
81
82
83
# File 'lib/nnw/ai/tensor.rb', line 79

def self.ones(shape, dtype: :float32, device_id: 0, requires_grad: false)
  nv = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id)
  nv.from_host(Array.new(nv.numel, 1.0))
  new(data: nv, requires_grad: requires_grad)
end

.rand(shape, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor

Create a tensor with random uniform values in [0, 1).

Parameters:

  • shape (Array<Integer>)
  • dtype (Symbol) (defaults to: :float32)
  • device_id (Integer) (defaults to: 0)
  • requires_grad (Boolean) (defaults to: false)

Returns:



91
92
93
94
95
# File 'lib/nnw/ai/tensor.rb', line 91

def self.rand(shape, dtype: :float32, device_id: 0, requires_grad: false)
  nv = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id)
  nv.from_host(Array.new(nv.numel) { Kernel.rand })
  new(data: nv, requires_grad: requires_grad)
end

.zeros(shape, dtype: :float32, device_id: 0, requires_grad: false) ⇒ Tensor

Create a zero-filled tensor.

Parameters:

  • shape (Array<Integer>)
  • dtype (Symbol) (defaults to: :float32)
  • device_id (Integer) (defaults to: 0)
  • requires_grad (Boolean) (defaults to: false)

Returns:



67
68
69
70
71
# File 'lib/nnw/ai/tensor.rb', line 67

def self.zeros(shape, dtype: :float32, device_id: 0, requires_grad: false)
  nv = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id)
  nv.from_host(Array.new(nv.numel, 0.0))
  new(data: nv, requires_grad: requires_grad)
end

Instance Method Details

#*(other) ⇒ Tensor

Elementwise multiplication (Hadamard): self * other

Parameters:

Returns:



385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
# File 'lib/nnw/ai/tensor.rb', line 385

def *(other)
  if other.is_a?(Numeric)
    return scalar_mul(other)
  end

  other = ensure_tensor(other)
  result_nv = alloc_like(@data)

  kernel = Ignis::JIT::Kernels::Elementwise.mul_forward
  n = numel
  grid = [(n + 255) / 256]
  kernel.launch(grid: grid, block: [256], args: [@data, other.data, result_nv, n])

  result = Tensor.new(data: result_nv, requires_grad: should_track?(other), is_leaf: false)

  if result.requires_grad
    saved_self = @data
    saved_other = other.data
    Tape.record(result, inputs: [self, other]) do |grad|
      grad_a = alloc_like(grad)
      grad_b = alloc_like(grad)
      mk = Ignis::JIT::Kernels::Elementwise.mul_backward
      gn = grad.numel
      g = [(gn + 255) / 256]
      mk.launch(grid: g, block: [256], args: [grad, saved_other, grad_a, gn])
      mk.launch(grid: g, block: [256], args: [grad, saved_self, grad_b, gn])
      [grad_a, grad_b]
    end
  end

  result
end

#+(other) ⇒ Tensor

Elementwise addition: self + other

Parameters:

Returns:



335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
# File 'lib/nnw/ai/tensor.rb', line 335

def +(other)
  other = ensure_tensor(other)
  result_nv = alloc_like(@data)

  kernel = Ignis::JIT::Kernels::Elementwise.add_forward
  n = numel
  grid = [(n + 255) / 256]
  kernel.launch(grid: grid, block: [256], args: [@data, other.data, result_nv, n])

  result = Tensor.new(data: result_nv, requires_grad: should_track?(other), is_leaf: false)

  if result.requires_grad
    Tape.record(result, inputs: [self, other]) do |grad|
      [grad, grad]  # d(a+b)/da = 1, d(a+b)/db = 1
    end
  end

  result
end

#-(other) ⇒ Tensor

Elementwise subtraction: self - other

Parameters:

Returns:



358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
# File 'lib/nnw/ai/tensor.rb', line 358

def -(other)
  other = ensure_tensor(other)
  result_nv = alloc_like(@data)

  kernel = Ignis::JIT::Kernels::Elementwise.sub_forward
  n = numel
  grid = [(n + 255) / 256]
  kernel.launch(grid: grid, block: [256], args: [@data, other.data, result_nv, n])

  result = Tensor.new(data: result_nv, requires_grad: should_track?(other), is_leaf: false)

  if result.requires_grad
    Tape.record(result, inputs: [self, other]) do |grad|
      neg_grad = alloc_like(grad)
      scale_k = Ignis::JIT::Kernels::Elementwise.scale_forward
      gn = grad.numel
      scale_k.launch(grid: [(gn + 255) / 256], block: [256], args: [grad, neg_grad, -1.0, gn])
      [grad, neg_grad]
    end
  end

  result
end

#add_bias(bias) ⇒ Tensor

Row-broadcast bias add: self [rows, cols] + bias [cols] -> [rows, cols]. (Linear layer bias; plain Tensor#+ requires equal element counts.)

Parameters:

Returns:



173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# File 'lib/nnw/ai/tensor.rb', line 173

def add_bias(bias)
  cols = shape[-1]
  rows = numel / cols
  result_nv = alloc_like(@data)

  kernel = Ignis::JIT::Kernels::Elementwise.add_bias_rows
  kernel.launch(grid: [(numel + 255) / 256], block: [256],
                args: [@data, bias.data, result_nv, rows, cols])

  result = Tensor.new(data: result_nv, requires_grad: should_track?(bias), is_leaf: false)

  if result.requires_grad
    Tape.record(result, inputs: [self, bias]) do |grad|
      # d/d(input) = grad (passthrough); d/d(bias) = sum over rows
      grad_bias = zeros_nv([cols])
      bk = Ignis::JIT::Kernels::Elementwise.add_backward_broadcast
      bk.launch(grid: [(cols + 255) / 256], block: [256], args: [grad, grad_bias, rows, cols])
      [grad, grad_bias]
    end
  end

  result
end

#backward!(grad_output = nil) ⇒ void

This method returns an undefined value.

Trigger reverse-mode automatic differentiation from this tensor.

Parameters:

  • grad_output (Ignis::Shared::NvArray, nil) (defaults to: nil)

    initial gradient

Raises:

  • (ArgumentError)


775
776
777
778
779
780
781
782
783
784
785
# File 'lib/nnw/ai/tensor.rb', line 775

def backward!(grad_output = nil)
  if grad_output.nil? && numel == 1
    # Scalar loss: start with 1.0
    grad_output = Ignis::Shared::NvArray.new(shape: [1], dtype: dtype, device_id: device_id)
    grad_output.from_host([1.0])
  end

  raise ArgumentError, "backward! requires grad_output for non-scalar tensors" if grad_output.nil?

  Tape.backward!(self, grad_output)
end

#decode_sdpa(k, v, num_heads:) ⇒ Tensor

Single-query attention for autoregressive decode with a KV cache.

self = q [1, embed] (the new token’s query); k, v = cached keys/values

past+1, embed

(every position up to and including the current one). The

new token is the LAST position, so it attends to ALL cached positions — no causal mask is needed. Returns context [1, embed]. No autograd (decode runs under no_grad). Built from the verified column-major GEMM (the 1/sqrt(d) scale folded into alpha) + the numerically-stable softmax_forward kernel, mirroring sdpa’s per-head column layout so head splitting is identical.

Parameters:

  • k (Tensor)

    cached keys [past+1, embed]

  • v (Tensor)

    cached values [past+1, embed]

  • num_heads (Integer)

Returns:

  • (Tensor)

    context [1, embed]



305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# File 'lib/nnw/ai/tensor.rb', line 305

def decode_sdpa(k, v, num_heads:)
  _, embed = shape
  tk = k.shape[0]
  head_dim = embed / num_heads
  scale = (1.0 / Math.sqrt(head_dim)).to_f
  context_nv = zeros_nv([1, embed])

  sm = Ignis::JIT::Kernels::Attention.softmax_forward
  num_heads.times do |h|
    off = h * head_dim
    qh = slice_cols_nv(@data, off, head_dim, 1, embed)   # [1, hd]
    kh = slice_cols_nv(k.data, off, head_dim, tk, embed) # [tk, hd]
    vh = slice_cols_nv(v.data, off, head_dim, tk, embed) # [tk, hd]

    # scores = scale * (qh @ khᵀ) → [1, tk]  (alpha folds in the scale)
    scores = Ignis::LinAlg::Matmul.call(qh, kh, transpose_b: true, alpha: scale)
    # probs = softmax(scores) along the tk axis → [1, tk]
    probs = Ignis::Shared::NvArray.new(shape: [1, tk], dtype: :float32, device_id: device_id).to_device
    sm.launch(grid: [1], block: [1], args: [scores, probs, 1, tk])
    # ctx_h = probs @ vh → [1, hd]
    ctx_h = Ignis::LinAlg::Matmul.call(probs, vh)
    scatter_cols_nv!(ctx_h, context_nv, off, head_dim, 1, embed)
  end

  Tensor.new(data: context_nv, requires_grad: false, is_leaf: false)
end

#detachTensor

Create a detached copy (same GPU memory, no grad tracking)

Returns:



802
803
804
# File 'lib/nnw/ai/tensor.rb', line 802

def detach
  Tensor.new(data: @data, requires_grad: false, is_leaf: true)
end

#device_idInteger

Returns:

  • (Integer)


130
131
132
# File 'lib/nnw/ai/tensor.rb', line 130

def device_id
  @data.device_id
end

#dtypeSymbol

Returns:

  • (Symbol)


120
121
122
# File 'lib/nnw/ai/tensor.rb', line 120

def dtype
  @data.dtype
end

#geluTensor

GELU activation (tanh approximation)

Returns:



444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
# File 'lib/nnw/ai/tensor.rb', line 444

def gelu
  result_nv = alloc_like(@data)
  kernel = Ignis::JIT::Kernels::Activations.gelu_forward
  n = numel
  kernel.launch(grid: [(n + 255) / 256], block: [256], args: [@data, result_nv, n])

  result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false)

  if @requires_grad
    saved_input = @data
    Tape.record(result, inputs: [self]) do |grad|
      grad_in = alloc_like(grad)
      bk = Ignis::JIT::Kernels::Activations.gelu_backward
      gn = grad.numel
      bk.launch(grid: [(gn + 255) / 256], block: [256], args: [grad, saved_input, grad_in, gn])
      [grad_in]
    end
  end

  result
end

#itemFloat, Integer

Get scalar value (for single-element tensors).

Returns:

  • (Float, Integer)


814
815
816
817
# File 'lib/nnw/ai/tensor.rb', line 814

def item
  raise "item() requires a single-element tensor, got shape #{shape}" unless numel == 1
  to_host[0]
end

#layer_norm(weight, bias, eps: 1e-5) ⇒ Tensor

Layer normalization

Parameters:

  • weight (Tensor)

    gamma parameter

  • bias (Tensor)

    beta parameter

  • eps (Float) (defaults to: 1e-5)

    epsilon for numerical stability

Returns:



522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
# File 'lib/nnw/ai/tensor.rb', line 522

def layer_norm(weight, bias, eps: 1e-5)
  norm_size = shape[-1]
  outer_size = numel / norm_size
  result_nv = alloc_like(@data)

  # Allocate mean and rstd storage for backward pass
  mean_nv = Ignis::Shared::NvArray.new(shape: [outer_size], dtype: dtype, device_id: device_id)
  mean_nv.from_host(Array.new(outer_size, 0.0))
  rstd_nv = Ignis::Shared::NvArray.new(shape: [outer_size], dtype: dtype, device_id: device_id)
  rstd_nv.from_host(Array.new(outer_size, 0.0))

  kernel = Ignis::JIT::Kernels::Normalization.layer_norm_forward
  kernel.launch(grid: [(outer_size + 255) / 256], block: [256],
                args: [@data, weight.data, bias.data, result_nv, mean_nv, rstd_nv,
                       outer_size, norm_size, eps])

  result = Tensor.new(data: result_nv,
                      requires_grad: @requires_grad || weight.requires_grad || bias.requires_grad,
                      is_leaf: false)

  if result.requires_grad
    saved_input = @data
    saved_gamma = weight.data
    Tape.record(result, inputs: [self, weight, bias]) do |grad|
      grad_input = alloc_like(grad)
      grad_gamma = Ignis::Shared::NvArray.new(shape: [norm_size], dtype: dtype, device_id: device_id)
      grad_gamma.from_host(Array.new(norm_size, 0.0))
      grad_beta = Ignis::Shared::NvArray.new(shape: [norm_size], dtype: dtype, device_id: device_id)
      grad_beta.from_host(Array.new(norm_size, 0.0))

      bk = Ignis::JIT::Kernels::Normalization.layer_norm_backward
      bk.launch(grid: [(outer_size + 255) / 256], block: [256],
                args: [grad, saved_input, saved_gamma, mean_nv, rstd_nv,
                       grad_input, grad_gamma, grad_beta, outer_size, norm_size])
      [grad_input, grad_gamma, grad_beta]
    end
  end

  result
end

#matmul(other, transpose_b: false) ⇒ Tensor

Matrix multiplication: self @ other

Parameters:

  • other (Tensor)
  • other (Tensor)
  • transpose_b (Boolean) (defaults to: false)

    compute self @ other^T (cuBLAS transposes in the GEMM — avoids materializing other^T, which for the LM head was a 765ms/forward transpose of a 38M-element weight). Used by Linear.

Returns:



145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# File 'lib/nnw/ai/tensor.rb', line 145

def matmul(other, transpose_b: false)
  result_data = Ignis::LinAlg::Matmul.call(@data, other.data, transpose_b: transpose_b)
  result = Tensor.new(data: result_data, requires_grad: should_track?(other), is_leaf: false)

  if result.requires_grad
    saved_self = @data
    saved_other = other.data
    Tape.record(result, inputs: [self, other]) do |grad|
      if transpose_b
        # y = A @ Bᵀ  ⇒  dA = grad @ B,  dB = gradᵀ @ A
        grad_a = Ignis::LinAlg::Matmul.call(grad, saved_other)
        grad_b = Ignis::LinAlg::Matmul.call(grad, saved_self, transpose_a: true)
      else
        # dA = grad @ Bᵀ,  dB = Aᵀ @ grad
        grad_a = Ignis::LinAlg::Matmul.call(grad, saved_other, transpose_b: true)
        grad_b = Ignis::LinAlg::Matmul.call(saved_self, grad, transpose_a: true)
      end
      [grad_a, grad_b]
    end
  end

  result
end

#meanTensor

Mean reduction (all elements → scalar)

Returns:



743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
# File 'lib/nnw/ai/tensor.rb', line 743

def mean
  n = numel
  sum_result = self.sum
  # Scale by 1/n
  result_nv = Ignis::Shared::NvArray.new(shape: [1], dtype: dtype, device_id: device_id)
  result_nv.from_host([0.0])
  kernel = Ignis::JIT::Kernels::Elementwise.scale_forward
  kernel.launch(grid: [1], block: [1], args: [sum_result.data, result_nv, 1.0 / n, 1])

  result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false)

  if @requires_grad
    orig_shape = shape
    Tape.record(result, inputs: [self]) do |grad|
      grad_input = Ignis::Shared::NvArray.new(shape: orig_shape, dtype: dtype, device_id: device_id)
      grad_input.from_host(Array.new(n, 0.0))
      bk = Ignis::JIT::Kernels::Elementwise.broadcast_grad
      bk.launch(grid: [(n + 255) / 256], block: [256], args: [grad, grad_input, 1.0 / n, n])
      [grad_input]
    end
  end

  result
end

#numelInteger

Returns:

  • (Integer)


125
126
127
# File 'lib/nnw/ai/tensor.rb', line 125

def numel
  @data.numel
end

#reluTensor

ReLU activation

Returns:



420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
# File 'lib/nnw/ai/tensor.rb', line 420

def relu
  result_nv = alloc_like(@data)
  kernel = Ignis::JIT::Kernels::Activations.relu_forward(numel)
  n = numel
  kernel.launch(grid: [(n + 255) / 256], block: [256], args: [@data, result_nv, n])

  result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false)

  if @requires_grad
    saved_input = @data
    Tape.record(result, inputs: [self]) do |grad|
      grad_in = alloc_like(grad)
      bk = Ignis::JIT::Kernels::Activations.relu_backward
      gn = grad.numel
      bk.launch(grid: [(gn + 255) / 256], block: [256], args: [grad, saved_input, grad_in, gn])
      [grad_in]
    end
  end

  result
end

#reshape(new_shape) ⇒ Tensor

Reshape (zero-copy if contiguous)

Parameters:

  • new_shape (Array<Integer>)

Returns:

Raises:

  • (ArgumentError)


691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
# File 'lib/nnw/ai/tensor.rb', line 691

def reshape(new_shape)
  new_numel = new_shape.reduce(1, :*)
  raise ArgumentError, "Cannot reshape #{shape} to #{new_shape}" unless new_numel == numel

  # View over @data's buffer: non-owning, retains parent so it isn't freed
  # while the view is alive (and never double-frees the shared allocation).
  result_nv = Ignis::Shared::NvArray.new(shape: new_shape, dtype: dtype, device_id: device_id,
                                       ptr: @data.ptr, parent: @data)
  result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false)

  if @requires_grad
    original_shape = shape
    Tape.record(result, inputs: [self]) do |grad|
      # Backward: reshape grad back to original shape (view over grad)
      grad_reshaped = Ignis::Shared::NvArray.new(shape: original_shape, dtype: dtype,
                                                 device_id: device_id, ptr: grad.ptr, parent: grad)
      [grad_reshaped]
    end
  end

  result
end

#rms_norm(weight, eps: 1e-5) ⇒ Tensor

RMSNorm: y = gamma * x / sqrt(mean(x^2) + eps) (Llama/Qwen/Mistral style). No mean-subtraction and no bias (vs LayerNorm). Normalizes the last dim.

Parameters:

  • weight (Tensor)

    gamma scale [norm_size]

  • eps (Float) (defaults to: 1e-5)

Returns:



568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
# File 'lib/nnw/ai/tensor.rb', line 568

def rms_norm(weight, eps: 1e-5)
  norm_size = shape[-1]
  outer_size = numel / norm_size
  result_nv = alloc_like(@data)

  # rstd per row, saved for backward
  rstd_nv = Ignis::Shared::NvArray.new(shape: [outer_size], dtype: dtype, device_id: device_id)
  rstd_nv.zero!

  fwd = Ignis::JIT::Kernels::Normalization.rms_norm_forward
  fwd.launch(grid: [(outer_size + 255) / 256], block: [256],
             args: [@data, weight.data, result_nv, rstd_nv, outer_size, norm_size, eps.to_f])

  result = Tensor.new(data: result_nv,
                      requires_grad: @requires_grad || weight.requires_grad,
                      is_leaf: false)

  if result.requires_grad
    saved_input = @data
    saved_gamma = weight.data
    Tape.record(result, inputs: [self, weight]) do |grad|
      grad_input = alloc_like(grad)
      grad_gamma = zeros_nv([norm_size])
      bk = Ignis::JIT::Kernels::Normalization.rms_norm_backward
      bk.launch(grid: [(outer_size + 255) / 256], block: [256],
                args: [grad, saved_input, saved_gamma, rstd_nv,
                       grad_input, grad_gamma, outer_size, norm_size])
      [grad_input, grad_gamma]
    end
  end

  result
end

#rope(num_heads:, base: 10000.0, pos_offset: 0, inv_freq: nil) ⇒ Tensor

Rotary Position Embedding (RoPE), HF/Llama/Qwen “rotate_half” convention. self is [seq, num_heads*head_dim]; rotates each head’s dims by its absolute position. No learned parameters — the backward is the same rotation with the sin sign flipped (orthogonal rotation ⇒ R^T = R(-θ)). Applied to Q and K.

Parameters:

  • num_heads (Integer)
  • base (Float) (defaults to: 10000.0)

    rotary base θ (Llama/Qwen use 10000; long-context models larger)

  • pos_offset (Integer) (defaults to: 0)

    absolute position of row 0 (for KV-cache decode)

  • num_heads (Integer)
  • base (Float) (defaults to: 10000.0)

    rotary base θ (used only when inv_freq is nil)

  • pos_offset (Integer) (defaults to: 0)

    absolute position of row 0 (for KV-cache decode)

  • inv_freq (Ignis::Shared::NvArray, Array<Float>, nil) (defaults to: nil)

    precomputed [head_dim/2] inverse frequencies. nil ⇒ standard base^(-2i/head_dim). Pass a remapped table for RoPE scaling (llama3/NTK/YaRN).

Returns:

Raises:

  • (ArgumentError)


616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
# File 'lib/nnw/ai/tensor.rb', line 616

def rope(num_heads:, base: 10000.0, pos_offset: 0, inv_freq: nil)
  seq, embed = shape
  head_dim = embed / num_heads
  # rotate_half RoPE pairs dim i with i+head_dim/2, so it is only well-defined
  # for EVEN head_dim. With an odd head_dim the pairing collides (one dim is
  # used twice, another never), giving a non-orthogonal map whose forward AND
  # gradient are silently wrong. No real architecture uses odd head_dim — fail
  # loud rather than miscompute.
  raise ArgumentError,
        "RoPE requires an even head_dim (got #{head_dim} = #{embed}/#{num_heads}); " \
        "rotate_half is only defined for paired dimensions" unless head_dim.even?

  half = head_dim / 2
  invf_nv = case inv_freq
            when Ignis::Shared::NvArray then inv_freq
            when Array then nv_from_floats(inv_freq)
            else nv_from_floats((0...half).map { |i| base.to_f**(-2.0 * i / head_dim) })
            end

  out_nv = alloc_like(@data)
  total = seq * embed
  k = Ignis::JIT::Kernels::Attention.rope_apply
  k.launch(grid: [(total + 255) / 256], block: [256],
           args: [@data, out_nv, seq, num_heads, head_dim, pos_offset, invf_nv, 1.0])

  result = Tensor.new(data: out_nv, requires_grad: @requires_grad, is_leaf: false)

  if result.requires_grad
    Tape.record(result, inputs: [self]) do |grad|
      gin = alloc_like(grad)
      # backward = forward rotation with negated sin (transpose of an orthogonal rotation)
      k.launch(grid: [(total + 255) / 256], block: [256],
               args: [grad, gin, seq, num_heads, head_dim, pos_offset, invf_nv, -1.0])
      [gin]
    end
  end

  result
end

#sdpa(k, v, num_heads:, num_kv_heads: nil, causal: true) ⇒ Tensor

Multi-head / grouped-query scaled dot-product attention (causal optional), batch = 1. self = Q [seq, num_heads*head_dim]; k, v = [seq, num_kv_heads*head_dim]. Returns context [seq, num_heads*head_dim].

With num_kv_heads == num_heads this is standard multi-head attention. With num_kv_heads < num_heads it is Grouped-Query Attention (Llama-2-70B, Llama-3, Qwen2/3, SmolLM3): each KV head is shared by group_size = num_heads/num_kv_heads query heads. Each query head runs the Flash-Attention-2 kernel against its group’s KV head. In the backward, the group_size query heads that share a KV head ACCUMULATE into that head’s dK/dV (scatter-add); dQ heads are disjoint.

Parameters:

  • k (Tensor)
  • v (Tensor)
  • num_heads (Integer)

    number of query heads

  • num_kv_heads (Integer, nil) (defaults to: nil)

    number of K/V heads (nil ⇒ num_heads = MHA)

  • causal (Boolean) (defaults to: true)

Returns:

  • (Tensor)

    context [seq, num_heads*head_dim]

Raises:

  • (ArgumentError)


213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
# File 'lib/nnw/ai/tensor.rb', line 213

def sdpa(k, v, num_heads:, num_kv_heads: nil, causal: true)
  num_kv_heads ||= num_heads
  raise ArgumentError, "num_heads (#{num_heads}) must be a multiple of num_kv_heads (#{num_kv_heads})" \
    unless (num_heads % num_kv_heads).zero?

  seq, embed = shape                 # embed = num_heads * head_dim
  head_dim = embed / num_heads
  # The flash-attention kernels store per-head rows in fixed [HEAD_DIM_MAX=128]
  # register arrays and clamp every dim loop to d < 128. For head_dim > 128
  # they would silently drop dims 128.. from scores/output/gradients with no
  # error. Targets (Qwen3/Llama/SmolLM/Phi) use head_dim ≤ 128; fail loud above
  # that rather than miscompute. (decode_sdpa uses cuBLAS+softmax and has no cap.)
  raise ArgumentError,
        "head_dim #{head_dim} exceeds flash-attention HEAD_DIM_MAX (128); " \
        "larger heads are not yet supported by the flash kernels" if head_dim > 128
  embed_kv = num_kv_heads * head_dim
  group_size = num_heads / num_kv_heads
  scale = (1.0 / Math.sqrt(head_dim)).to_f
  cmask = causal ? 1 : 0
  context_nv = zeros_nv([seq, embed])

  fwd = Ignis::JIT::Kernels::Attention.flash_attention_forward
  q_tiles = (seq + 63) / 64
  num_heads.times do |h|
    qoff = h * head_dim
    koff = (h / group_size) * head_dim  # the KV head this query head attends to
    qh = slice_cols_nv(@data, qoff, head_dim, seq, embed)
    kh = slice_cols_nv(k.data, koff, head_dim, seq, embed_kv)
    vh = slice_cols_nv(v.data, koff, head_dim, seq, embed_kv)
    oh = zeros_nv([seq, head_dim])
    fwd.launch(grid: [q_tiles], block: [64],
               args: [qh, kh, vh, oh, seq, head_dim, scale, cmask])
    scatter_cols_nv!(oh, context_nv, qoff, head_dim, seq, embed)
  end

  result = Tensor.new(data: context_nv,
                      requires_grad: @requires_grad || should_track?(k) || should_track?(v),
                      is_leaf: false)

  if result.requires_grad
    sq = @data
    sk = k.data
    sv = v.data
    so = context_nv
    Tape.record(result, inputs: [self, k, v]) do |grad|
      d_q = zeros_nv([seq, embed])
      d_k = zeros_nv([seq, embed_kv])
      d_v = zeros_nv([seq, embed_kv])
      bwd = Ignis::JIT::Kernels::Attention.flash_attention_backward
      blk = (seq + 255) / 256
      num_heads.times do |h|
        qoff = h * head_dim
        koff = (h / group_size) * head_dim
        qh = slice_cols_nv(sq, qoff, head_dim, seq, embed)
        kh = slice_cols_nv(sk, koff, head_dim, seq, embed_kv)
        vh = slice_cols_nv(sv, koff, head_dim, seq, embed_kv)
        oh = slice_cols_nv(so, qoff, head_dim, seq, embed)
        doh = slice_cols_nv(grad, qoff, head_dim, seq, embed)
        dqh = zeros_nv([seq, head_dim])
        dkh = zeros_nv([seq, head_dim])
        dvh = zeros_nv([seq, head_dim])
        bwd.launch(grid: [blk], block: [256],
                   args: [qh, kh, vh, oh, doh, dqh, dkh, dvh, seq, head_dim, scale, cmask])
        # dQ heads are disjoint → overwrite. dK/dV heads are SHARED across the
        # group → accumulate (add into a zero-initialized buffer). For MHA
        # (group_size==1) the KV columns are disjoint too, so add-into-zero is
        # numerically identical to the previous overwrite — no regression.
        scatter_cols_nv!(dqh, d_q, qoff, head_dim, seq, embed)
        scatter_cols_add_nv!(dkh, d_k, koff, head_dim, seq, embed_kv)
        scatter_cols_add_nv!(dvh, d_v, koff, head_dim, seq, embed_kv)
      end
      [d_q, d_k, d_v]
    end
  end

  result
end

#shapeArray<Integer>

Returns:

  • (Array<Integer>)


115
116
117
# File 'lib/nnw/ai/tensor.rb', line 115

def shape
  @data.shape
end

#siluTensor

SiLU activation: x * sigmoid(x)

Returns:



468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
# File 'lib/nnw/ai/tensor.rb', line 468

def silu
  result_nv = alloc_like(@data)
  kernel = Ignis::JIT::Kernels::Activations.silu_forward
  n = numel
  kernel.launch(grid: [(n + 255) / 256], block: [256], args: [@data, result_nv, n])

  result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false)

  if @requires_grad
    saved_input = @data
    Tape.record(result, inputs: [self]) do |grad|
      grad_in = alloc_like(grad)
      bk = Ignis::JIT::Kernels::Activations.silu_backward
      gn = grad.numel
      bk.launch(grid: [(gn + 255) / 256], block: [256], args: [grad, saved_input, grad_in, gn])
      [grad_in]
    end
  end

  result
end

#softmaxTensor

Softmax along last dimension

Returns:



492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
# File 'lib/nnw/ai/tensor.rb', line 492

def softmax
  last_dim = shape[-1]
  outer_size = numel / last_dim
  result_nv = alloc_like(@data)

  kernel = Ignis::JIT::Kernels::Attention.softmax_forward
  kernel.launch(grid: [(outer_size + 255) / 256], block: [256],
                args: [@data, result_nv, outer_size, last_dim])

  result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false)

  if @requires_grad
    saved_output = result_nv
    Tape.record(result, inputs: [self]) do |grad|
      grad_in = alloc_like(grad)
      bk = Ignis::JIT::Kernels::Attention.softmax_backward
      bk.launch(grid: [(outer_size + 255) / 256], block: [256],
                args: [grad, saved_output, grad_in, outer_size, last_dim])
      [grad_in]
    end
  end

  result
end

#sumTensor

Sum reduction (all elements → scalar)

Returns:



716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
# File 'lib/nnw/ai/tensor.rb', line 716

def sum
  n = numel
  result_nv = Ignis::Shared::NvArray.new(shape: [1], dtype: dtype, device_id: device_id)
  result_nv.from_host([0.0])

  kernel = Ignis::JIT::Kernels::Elementwise.sum_reduce
  kernel.launch(grid: [1], block: [1], args: [@data, result_nv, 1, n])

  result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false)

  if @requires_grad
    orig_shape = shape
    Tape.record(result, inputs: [self]) do |grad|
      # Gradient of sum is broadcast of 1.0 to original shape
      grad_input = Ignis::Shared::NvArray.new(shape: orig_shape, dtype: dtype, device_id: device_id)
      grad_input.from_host(Array.new(n, 0.0))
      bk = Ignis::JIT::Kernels::Elementwise.broadcast_grad
      bk.launch(grid: [(n + 255) / 256], block: [256], args: [grad, grad_input, 1.0, n])
      [grad_input]
    end
  end

  result
end

#to_hostArray<Numeric>

Copy GPU data to host as Ruby Array.

Returns:

  • (Array<Numeric>)


808
809
810
# File 'lib/nnw/ai/tensor.rb', line 808

def to_host
  @data.to_host
end

#transpose(dim0 = 0, dim1 = 1) ⇒ Tensor

Transpose two dimensions (for 2D tensors)

Parameters:

  • dim0 (Integer) (defaults to: 0)
  • dim1 (Integer) (defaults to: 1)

Returns:

Raises:

  • (ArgumentError)


660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
# File 'lib/nnw/ai/tensor.rb', line 660

def transpose(dim0 = 0, dim1 = 1)
  raise ArgumentError, "transpose requires 2D tensor" unless shape.length == 2

  rows = shape[0]
  cols = shape[1]
  result_nv = Ignis::Shared::NvArray.new(shape: [cols, rows], dtype: dtype, device_id: device_id)
  result_nv.to_device # transpose_2d writes every element — alloc only, no host zeroing

  kernel = Ignis::JIT::Kernels::Elementwise.transpose_2d
  grid_x = (cols + 31) / 32
  grid_y = (rows + 31) / 32
  kernel.launch(grid: [grid_x, grid_y], block: [32, 8], args: [@data, result_nv, rows, cols])

  result = Tensor.new(data: result_nv, requires_grad: @requires_grad, is_leaf: false)

  if @requires_grad
    Tape.record(result, inputs: [self]) do |grad|
      # Backward of transpose is transpose
      grad_t = alloc_like(@data)
      kernel_t = Ignis::JIT::Kernels::Elementwise.transpose_2d
      kernel_t.launch(grid: [grid_y, grid_x], block: [32, 8], args: [grad, grad_t, cols, rows])
      [grad_t]
    end
  end

  result
end

#zero_grad!void

This method returns an undefined value.

Zero out gradients (sets to zeros, not nil — avoids alloc in training loop)



789
790
791
792
793
794
795
796
797
798
# File 'lib/nnw/ai/tensor.rb', line 789

def zero_grad!
  if @grad
    n = @grad.numel
    fill_k = Ignis::JIT::Kernels::Elementwise.fill
    fill_k.launch(grid: [(n + 255) / 256], block: [256], args: [@grad, 0.0, n])
  else
    @grad = Ignis::Shared::NvArray.new(shape: shape, dtype: dtype, device_id: device_id)
    @grad.from_host(Array.new(numel, 0.0))
  end
end