Module: Ignis::JIT::Kernels::Elementwise

Defined in:: lib/nvruby/jit/kernels/elementwise.rb

Overview

Elementwise CUDA kernels for AI tensor operations. Includes arithmetic ops, initialization, and embedding ops.

Class Method Summary collapse

.accumulate ⇒ Ignis::JIT::Kernel

Accumulate gradients: dst += src (for gradient accumulation).
.add_backward_broadcast ⇒ Ignis::JIT::Kernel

Elementwise addition backward: grad passes through to both inputs (identity for add — no separate kernel needed, but useful for scalar broadcast).
.add_bias_rows ⇒ Ignis::JIT::Kernel

Row-broadcast bias add: out[r, c] = a[r, c] + bias (a is [rows, cols], bias is [cols]).
.add_forward ⇒ Ignis::JIT::Kernel

Elementwise addition forward: c = a + b.
.affine_forward ⇒ Ignis::JIT::Kernel

Affine transform: output = input * scale + shift (fp32).
.bf16_to_f32 ⇒ Ignis::JIT::Kernel

Dequantize bfloat16 → float32 on-device.
.broadcast_grad ⇒ Ignis::JIT::Kernel

Broadcast scalar gradient back to original shape.
.fill ⇒ Ignis::JIT::Kernel

Fill tensor with a constant value.
.gather_rows ⇒ Ignis::JIT::Kernel

Gather rows for Embedding forward: output = weight[indices].
.kaiming_uniform_init ⇒ Ignis::JIT::Kernel

Kaiming uniform initialization: U(-bound, bound) Uses cuRAND-style Philox counter-based generator for reproducibility.
.max_forward ⇒ Ignis::JIT::Kernel

Elementwise maximum: c = max(a, b) (used by collective reductions).
.min_forward ⇒ Ignis::JIT::Kernel

Elementwise minimum: c = min(a, b) (used by collective reductions).
.mul_backward ⇒ Ignis::JIT::Kernel

Elementwise multiply backward for first operand: grad_a = grad * b.
.mul_forward ⇒ Ignis::JIT::Kernel

Elementwise multiplication forward: c = a * b (Hadamard product).
.scale_forward ⇒ Ignis::JIT::Kernel

Scalar multiplication: output = input * scalar.
.scatter_add ⇒ Ignis::JIT::Kernel

Scatter add for Embedding backward: weight_grad[indices] += grad Uses atomicAdd for thread safety.
.scatter_cols ⇒ Ignis::JIT::Kernel

Inverse of slice_cols: dst[r, col_off + c] = src[r, c].
.scatter_cols_add ⇒ Ignis::JIT::Kernel

Accumulating scatter: dst[r, col_off + c] += src[r, c].
.slice_cols ⇒ Ignis::JIT::Kernel

Copy a contiguous column range [col_off, col_off+len) from each row.
.sub_forward ⇒ Ignis::JIT::Kernel

Elementwise subtraction forward: c = a - b.
.sum_reduce ⇒ Ignis::JIT::Kernel

Sum reduction along the last dimension.
.transpose_2d ⇒ Ignis::JIT::Kernel

Transpose 2D matrix: output = input Tiled for coalesced memory access.

Class Method Details

.accumulate ⇒ `Ignis::JIT::Kernel`

Accumulate gradients: dst += src (for gradient accumulation)

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 251

def accumulate
  source = <<~CUDA
    extern "C" __global__
    void accumulate(float* __restrict__ dst,
                    const float* __restrict__ src,
                    const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        dst[idx] += src[idx];
      }
    }
  CUDA
  compile_cached(source, "accumulate")
end

.add_backward_broadcast ⇒ `Ignis::JIT::Kernel`

Elementwise addition backward: grad passes through to both inputs (identity for add — no separate kernel needed, but useful for scalar broadcast)

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 31

def add_backward_broadcast
  source = <<~CUDA
    extern "C" __global__
    void add_backward_broadcast(const float* __restrict__ grad_output,
                                float* __restrict__ grad_bias,
                                const int batch_size,
                                const int features) {
      int f = blockIdx.x * blockDim.x + threadIdx.x;
      if (f < features) {
        float sum = 0.0f;
        for (int b = 0; b < batch_size; b++) {
          sum += grad_output[b * features + f];
        }
        grad_bias[f] = sum;
      }
    }
  CUDA
  compile_cached(source, "add_backward_broadcast")
end

.add_bias_rows ⇒ `Ignis::JIT::Kernel`

Row-broadcast bias add: out[r, c] = a[r, c] + bias (a is [rows, cols], bias is [cols]). Linear layer bias.

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 458

def add_bias_rows
  source = <<~CUDA
    extern "C" __global__
    void add_bias_rows(const float* __restrict__ a,
                       const float* __restrict__ bias,
                       float* __restrict__ out,
                       const int rows, const int cols) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      int total = rows * cols;
      if (idx < total) {
        out[idx] = a[idx] + bias[idx % cols];
      }
    }
  CUDA
  compile_cached(source, "add_bias_rows")
end

.add_forward ⇒ `Ignis::JIT::Kernel`

Elementwise addition forward: c = a + b

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 12

def add_forward
  source = <<~CUDA
    extern "C" __global__
    void add_forward(const float* __restrict__ a,
                     const float* __restrict__ b,
                     float* __restrict__ c,
                     const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        c[idx] = a[idx] + b[idx];
      }
    }
  CUDA
  compile_cached(source, "add_forward")
end

.affine_forward ⇒ `Ignis::JIT::Kernel`

Affine transform: output = input * scale + shift (fp32). Used e.g. to map cuRAND U[0,1) into U[low, high).

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 439

def affine_forward
  source = <<~CUDA
    extern "C" __global__
    void affine_forward(const float* __restrict__ input,
                        float* __restrict__ output,
                        const float scale, const float shift,
                        const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        output[idx] = input[idx] * scale + shift;
      }
    }
  CUDA
  compile_cached(source, "affine_forward")
end

.bf16_to_f32 ⇒ `Ignis::JIT::Kernel`

Dequantize bfloat16 → float32 on-device. bf16 is exactly the top 16 bits of an fp32 value (same sign/exponent, truncated mantissa), so widening is lossless: float32_bits = uint16(bf16) << 16. Lets us load bf16 checkpoints (e.g. Llama) into fp32 weights without materializing a giant host array.

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 420

def bf16_to_f32
  source = <<~CUDA
    extern "C" __global__
    void bf16_to_f32(const unsigned short* __restrict__ src,
                     float* __restrict__ dst,
                     const int n) {
      int i = blockIdx.x * blockDim.x + threadIdx.x;
      if (i < n) {
        unsigned int bits = ((unsigned int)src[i]) << 16;
        dst[i] = __uint_as_float(bits);
      }
    }
  CUDA
  compile_cached(source, "bf16_to_f32")
end

.broadcast_grad ⇒ `Ignis::JIT::Kernel`

Broadcast scalar gradient back to original shape

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 290

def broadcast_grad
  source = <<~CUDA
    extern "C" __global__
    void broadcast_grad(const float* __restrict__ grad_output,
                        float* __restrict__ grad_input,
                        const float scale,
                        const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        grad_input[idx] = grad_output[0] * scale;
      }
    }
  CUDA
  compile_cached(source, "broadcast_grad")
end

.fill ⇒ `Ignis::JIT::Kernel`

Fill tensor with a constant value

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 161

def fill
  source = <<~CUDA
    extern "C" __global__
    void fill(float* __restrict__ output,
              const float value,
              const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        output[idx] = value;
      }
    }
  CUDA
  compile_cached(source, "fill")
end

.gather_rows ⇒ `Ignis::JIT::Kernel`

Gather rows for Embedding forward: output = weight[indices]

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 204

def gather_rows
  source = <<~CUDA
    extern "C" __global__
    void gather_rows(const float* __restrict__ weight,
                     const int* __restrict__ indices,
                     float* __restrict__ output,
                     const int num_indices,
                     const int embed_dim) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      int total = num_indices * embed_dim;
      if (idx < total) {
        int row = idx / embed_dim;
        int col = idx % embed_dim;
        int src_row = indices[row];
        output[idx] = weight[src_row * embed_dim + col];
      }
    }
  CUDA
  compile_cached(source, "gather_rows")
end

.kaiming_uniform_init ⇒ `Ignis::JIT::Kernel`

Kaiming uniform initialization: U(-bound, bound) Uses cuRAND-style Philox counter-based generator for reproducibility

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 179

def kaiming_uniform_init
  source = <<~CUDA
    extern "C" __global__
    void kaiming_uniform_init(float* __restrict__ output,
                              const float bound,
                              const unsigned long long seed,
                              const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        unsigned long long state = seed + (unsigned long long)idx;
        state ^= state >> 33;
        state *= 0xff51afd7ed558ccdULL;
        state ^= state >> 33;
        state *= 0xc4ceb9fe1a85ec53ULL;
        state ^= state >> 33;
        float u = (float)(state & 0xFFFFFFFF) / 4294967296.0f;
        output[idx] = (2.0f * u - 1.0f) * bound;
      }
    }
  CUDA
  compile_cached(source, "kaiming_uniform_init")
end

.max_forward ⇒ `Ignis::JIT::Kernel`

Elementwise maximum: c = max(a, b) (used by collective reductions)

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 125

def max_forward
  source = <<~CUDA
    extern "C" __global__
    void max_forward(const float* __restrict__ a,
                     const float* __restrict__ b,
                     float* __restrict__ c,
                     const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        c[idx] = fmaxf(a[idx], b[idx]);
      }
    }
  CUDA
  compile_cached(source, "max_forward")
end

.min_forward ⇒ `Ignis::JIT::Kernel`

Elementwise minimum: c = min(a, b) (used by collective reductions)

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 107

def min_forward
  source = <<~CUDA
    extern "C" __global__
    void min_forward(const float* __restrict__ a,
                     const float* __restrict__ b,
                     float* __restrict__ c,
                     const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        c[idx] = fminf(a[idx], b[idx]);
      }
    }
  CUDA
  compile_cached(source, "min_forward")
end

.mul_backward ⇒ `Ignis::JIT::Kernel`

Elementwise multiply backward for first operand: grad_a = grad * b

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 89

def mul_backward
  source = <<~CUDA
    extern "C" __global__
    void mul_backward(const float* __restrict__ grad_output,
                      const float* __restrict__ other,
                      float* __restrict__ grad_input,
                      const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        grad_input[idx] = grad_output[idx] * other[idx];
      }
    }
  CUDA
  compile_cached(source, "mul_backward")
end

.mul_forward ⇒ `Ignis::JIT::Kernel`

Elementwise multiplication forward: c = a * b (Hadamard product)

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 71

def mul_forward
  source = <<~CUDA
    extern "C" __global__
    void mul_forward(const float* __restrict__ a,
                     const float* __restrict__ b,
                     float* __restrict__ c,
                     const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        c[idx] = a[idx] * b[idx];
      }
    }
  CUDA
  compile_cached(source, "mul_forward")
end

.scale_forward ⇒ `Ignis::JIT::Kernel`

Scalar multiplication: output = input * scalar

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 143

def scale_forward
  source = <<~CUDA
    extern "C" __global__
    void scale_forward(const float* __restrict__ input,
                       float* __restrict__ output,
                       const float scalar,
                       const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        output[idx] = input[idx] * scalar;
      }
    }
  CUDA
  compile_cached(source, "scale_forward")
end

.scatter_add ⇒ `Ignis::JIT::Kernel`

Scatter add for Embedding backward: weight_grad[indices] += grad Uses atomicAdd for thread safety

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 228

def scatter_add
  source = <<~CUDA
    extern "C" __global__
    void scatter_add(const float* __restrict__ grad_output,
                     const int* __restrict__ indices,
                     float* __restrict__ grad_weight,
                     const int num_indices,
                     const int embed_dim) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      int total = num_indices * embed_dim;
      if (idx < total) {
        int row = idx / embed_dim;
        int col = idx % embed_dim;
        int dst_row = indices[row];
        atomicAdd(&grad_weight[dst_row * embed_dim + col], grad_output[idx]);
      }
    }
  CUDA
  compile_cached(source, "scatter_add")
end

.scatter_cols ⇒ `Ignis::JIT::Kernel`

Inverse of slice_cols: dst[r, col_off + c] = src[r, c]. Used to scatter per-head [seq, head_dim] results back into [seq, embed].

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 370

def scatter_cols
  source = <<~CUDA
    extern "C" __global__
    void scatter_cols(const float* __restrict__ src,
                      float* __restrict__ dst,
                      const int rows, const int total_cols,
                      const int col_off, const int len) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      int total = rows * len;
      if (idx < total) {
        int r = idx / len;
        int c = idx % len;
        dst[r * total_cols + col_off + c] = src[idx];
      }
    }
  CUDA
  compile_cached(source, "scatter_cols")
end

.scatter_cols_add ⇒ `Ignis::JIT::Kernel`

Accumulating scatter: dst[r, col_off + c] = src[r, c]. Used for GQA backward, where the group_size query heads sharing one KV head each contribute to the same dK/dV columns — their gradients must SUM, not overwrite. (Columns are disjoint across rows, so no atomics are needed: each (r, col_offc) is written by exactly one thread here; accumulation across heads happens via separate launches into the buffer.)

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 396

def scatter_cols_add
  source = <<~CUDA
    extern "C" __global__
    void scatter_cols_add(const float* __restrict__ src,
                          float* __restrict__ dst,
                          const int rows, const int total_cols,
                          const int col_off, const int len) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      int total = rows * len;
      if (idx < total) {
        int r = idx / len;
        int c = idx % len;
        dst[r * total_cols + col_off + c] += src[idx];
      }
    }
  CUDA
  compile_cached(source, "scatter_cols_add")
end

.slice_cols ⇒ `Ignis::JIT::Kernel`

Copy a contiguous column range [col_off, col_off+len) from each row. dst[r, c] = src[r, col_off + c] (dst is [rows, len], src is [rows, total_cols]). Used to split [seq, embed] projections into per-head [seq, head_dim] slices.

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 348

def slice_cols
  source = <<~CUDA
    extern "C" __global__
    void slice_cols(const float* __restrict__ src,
                    float* __restrict__ dst,
                    const int rows, const int total_cols,
                    const int col_off, const int len) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      int total = rows * len;
      if (idx < total) {
        int r = idx / len;
        int c = idx % len;
        dst[idx] = src[r * total_cols + col_off + c];
      }
    }
  CUDA
  compile_cached(source, "slice_cols")
end

.sub_forward ⇒ `Ignis::JIT::Kernel`

Elementwise subtraction forward: c = a - b

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 53

def sub_forward
  source = <<~CUDA
    extern "C" __global__
    void sub_forward(const float* __restrict__ a,
                     const float* __restrict__ b,
                     float* __restrict__ c,
                     const int n) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < n) {
        c[idx] = a[idx] - b[idx];
      }
    }
  CUDA
  compile_cached(source, "sub_forward")
end

.sum_reduce ⇒ `Ignis::JIT::Kernel`

Sum reduction along the last dimension

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 268

def sum_reduce
  source = <<~CUDA
    extern "C" __global__
    void sum_reduce(const float* __restrict__ input,
                    float* __restrict__ output,
                    const int outer_size,
                    const int reduce_size) {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < outer_size) {
        float sum = 0.0f;
        for (int j = 0; j < reduce_size; j++) {
          sum += input[idx * reduce_size + j];
        }
        output[idx] = sum;
      }
    }
  CUDA
  compile_cached(source, "sum_reduce")
end

.transpose_2d ⇒ `Ignis::JIT::Kernel`

Transpose 2D matrix: output = input Tiled for coalesced memory access

Returns:

(Ignis::JIT::Kernel)

# File 'lib/nvruby/jit/kernels/elementwise.rb', line 309

def transpose_2d
  source = <<~CUDA
    #define TILE_DIM 32
    #define BLOCK_ROWS 8

    extern "C" __global__
    void transpose_2d(const float* __restrict__ input,
                      float* __restrict__ output,
                      const int rows,
                      const int cols) {
      __shared__ float tile[TILE_DIM][TILE_DIM + 1];

      int x = blockIdx.x * TILE_DIM + threadIdx.x;
      int y = blockIdx.y * TILE_DIM + threadIdx.y;

      for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < cols && (y + j) < rows) {
          tile[threadIdx.y + j][threadIdx.x] = input[(y + j) * cols + x];
        }
      }
      __syncthreads();

      x = blockIdx.y * TILE_DIM + threadIdx.x;
      y = blockIdx.x * TILE_DIM + threadIdx.y;

      for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < rows && (y + j) < cols) {
          output[(y + j) * rows + x] = tile[threadIdx.x][threadIdx.y + j];
        }
      }
    }
  CUDA
  compile_cached(source, "transpose_2d")
end

Module: Ignis::JIT::Kernels::Elementwise

Overview

Class Method Summary collapse

Class Method Details

.accumulate ⇒ Ignis::JIT::Kernel

.add_backward_broadcast ⇒ Ignis::JIT::Kernel

.add_bias_rows ⇒ Ignis::JIT::Kernel

.add_forward ⇒ Ignis::JIT::Kernel

.affine_forward ⇒ Ignis::JIT::Kernel

.bf16_to_f32 ⇒ Ignis::JIT::Kernel

.broadcast_grad ⇒ Ignis::JIT::Kernel

.fill ⇒ Ignis::JIT::Kernel

.gather_rows ⇒ Ignis::JIT::Kernel

.kaiming_uniform_init ⇒ Ignis::JIT::Kernel

.max_forward ⇒ Ignis::JIT::Kernel

.min_forward ⇒ Ignis::JIT::Kernel

.mul_backward ⇒ Ignis::JIT::Kernel

.mul_forward ⇒ Ignis::JIT::Kernel

.scale_forward ⇒ Ignis::JIT::Kernel

.scatter_add ⇒ Ignis::JIT::Kernel

.scatter_cols ⇒ Ignis::JIT::Kernel

.scatter_cols_add ⇒ Ignis::JIT::Kernel

.slice_cols ⇒ Ignis::JIT::Kernel

.sub_forward ⇒ Ignis::JIT::Kernel

.sum_reduce ⇒ Ignis::JIT::Kernel

.transpose_2d ⇒ Ignis::JIT::Kernel

.accumulate ⇒ `Ignis::JIT::Kernel`

.add_backward_broadcast ⇒ `Ignis::JIT::Kernel`

.add_bias_rows ⇒ `Ignis::JIT::Kernel`

.add_forward ⇒ `Ignis::JIT::Kernel`

.affine_forward ⇒ `Ignis::JIT::Kernel`

.bf16_to_f32 ⇒ `Ignis::JIT::Kernel`

.broadcast_grad ⇒ `Ignis::JIT::Kernel`

.fill ⇒ `Ignis::JIT::Kernel`

.gather_rows ⇒ `Ignis::JIT::Kernel`

.kaiming_uniform_init ⇒ `Ignis::JIT::Kernel`

.max_forward ⇒ `Ignis::JIT::Kernel`

.min_forward ⇒ `Ignis::JIT::Kernel`

.mul_backward ⇒ `Ignis::JIT::Kernel`

.mul_forward ⇒ `Ignis::JIT::Kernel`

.scale_forward ⇒ `Ignis::JIT::Kernel`

.scatter_add ⇒ `Ignis::JIT::Kernel`

.scatter_cols ⇒ `Ignis::JIT::Kernel`

.scatter_cols_add ⇒ `Ignis::JIT::Kernel`

.slice_cols ⇒ `Ignis::JIT::Kernel`

.sub_forward ⇒ `Ignis::JIT::Kernel`

.sum_reduce ⇒ `Ignis::JIT::Kernel`

.transpose_2d ⇒ `Ignis::JIT::Kernel`