Module: Ignis::Epilogues

Defined in:
lib/nvruby/epilogues.rb

Overview

Advanced fused epilogues for GPU operations Provides GELU, ReLU, SiLU, Bias addition as fused CUDA kernels

Examples:

Apply GELU activation

output = Ignis::Epilogues.gelu(input)

Fused GEMM + GELU + Bias

output = Ignis::Epilogues.gemm_gelu_bias(a, b, bias)

Defined Under Namespace

Modules: Kernels

Constant Summary collapse

GELU_COEF_A =

GELU approximation constant

0.7978845608028654
GELU_COEF_B =
0.044715

Class Method Summary collapse

Class Method Details

.bias_add(input, bias, out: nil) ⇒ NvArray

Add bias to tensor

Parameters:

  • input (NvArray)

    Input tensor (rows x cols)

  • bias (NvArray)

    Bias vector (cols)

  • out (NvArray, nil) (defaults to: nil)

    Output tensor

Returns:

  • (NvArray)


250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
# File 'lib/nvruby/epilogues.rb', line 250

def bias_add(input, bias, out: nil)
  CUDA::RuntimeAPI.ensure_loaded!

  shape = input.shape
  rows = shape.size == 1 ? 1 : shape[0]
  cols = shape.size == 1 ? shape[0] : shape[1]
  n = rows * cols
  device = input.respond_to?(:device_index) ? input.device_index : 0

  out ||= Ignis::NvArray.zeros(input.shape, dtype: input.dtype, device: device)

  kernel = get_kernel(:bias_add, Kernels::BIAS_ADD_KERNEL, "bias_add")

  block_size = 256
  grid_size = (n + block_size - 1) / block_size

  kernel.launch(
    grid: [grid_size, 1, 1],
    block: [block_size, 1, 1],
    args: [input.device_ffi_ptr, bias.device_ffi_ptr, out.device_ffi_ptr, rows, cols]
  )

  CUDA::RuntimeAPI.cudaDeviceSynchronize
  out
end

.gelu(input, out: nil) ⇒ NvArray

Apply GELU activation (approximation)

Parameters:

  • input (NvArray)

    Input tensor

  • out (NvArray, nil) (defaults to: nil)

    Output tensor (optional)

Returns:

  • (NvArray)

    Output with GELU applied



185
186
187
# File 'lib/nvruby/epilogues.rb', line 185

def gelu(input, out: nil)
  apply_unary(input, out, :gelu, Kernels::GELU_KERNEL, "gelu_forward")
end

.gelu_bias(input, bias, out: nil) ⇒ NvArray

Fused GELU + Bias

Parameters:

  • input (NvArray)

    Input tensor

  • bias (NvArray)

    Bias vector

  • out (NvArray, nil) (defaults to: nil)

    Output tensor

Returns:

  • (NvArray)


282
283
284
# File 'lib/nvruby/epilogues.rb', line 282

def gelu_bias(input, bias, out: nil)
  apply_fused_bias(input, bias, out, :gelu_bias, Kernels::GELU_BIAS_KERNEL, "gelu_bias_forward")
end

.gelu_exact(input, out: nil) ⇒ NvArray

Apply exact GELU activation

Parameters:

  • input (NvArray)

    Input tensor

  • out (NvArray, nil) (defaults to: nil)

    Output tensor

Returns:

  • (NvArray)


194
195
196
# File 'lib/nvruby/epilogues.rb', line 194

def gelu_exact(input, out: nil)
  apply_unary(input, out, :gelu_exact, Kernels::GELU_EXACT_KERNEL, "gelu_exact_forward")
end

.gemm_epilogue(a, b, epilogue:, bias: nil) ⇒ NvArray

Fused GEMM + epilogue

Parameters:

  • a (NvArray)

    First matrix

  • b (NvArray)

    Second matrix

  • epilogue (Symbol)

    :gelu, :relu, :silu

  • bias (NvArray, nil) (defaults to: nil)

    Optional bias

Returns:

  • (NvArray)


359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
# File 'lib/nvruby/epilogues.rb', line 359

def gemm_epilogue(a, b, epilogue:, bias: nil)
  # Perform GEMM
  c = Ignis::LinAlg.matmul(a, b)

  # Apply epilogue
  result = case epilogue
           when :gelu
             bias ? gelu_bias(c, bias) : gelu(c)
           when :relu
             temp = bias ? bias_add(c, bias) : c
             relu(temp)
           when :silu
             bias ? silu_bias(c, bias) : silu(c)
           else
             bias ? bias_add(c, bias) : c
           end

  result
end

.leaky_relu(input, negative_slope: 0.01, out: nil) ⇒ NvArray

Apply Leaky ReLU activation

Parameters:

  • input (NvArray)

    Input tensor

  • negative_slope (Float) (defaults to: 0.01)

    Slope for negative values

  • out (NvArray, nil) (defaults to: nil)

    Output tensor

Returns:

  • (NvArray)


222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# File 'lib/nvruby/epilogues.rb', line 222

def leaky_relu(input, negative_slope: 0.01, out: nil)
  CUDA::RuntimeAPI.ensure_loaded!

  n = input.size
  device = input.respond_to?(:device_index) ? input.device_index : 0
  out ||= Ignis::NvArray.zeros(input.shape, dtype: input.dtype, device: device)

  kernel = get_kernel(:leaky_relu, Kernels::LEAKY_RELU_KERNEL, "leaky_relu_forward")

  block_size = 256
  grid_size = (n + block_size - 1) / block_size

  kernel.launch(
    grid: [grid_size, 1, 1],
    block: [block_size, 1, 1],
    args: [input.device_ffi_ptr, out.device_ffi_ptr, n, negative_slope]
  )

  CUDA::RuntimeAPI.cudaDeviceSynchronize
  out
end

.relu(input, out: nil) ⇒ NvArray

Apply ReLU activation

Parameters:

  • input (NvArray)

    Input tensor

  • out (NvArray, nil) (defaults to: nil)

    Output tensor

Returns:

  • (NvArray)


212
213
214
# File 'lib/nvruby/epilogues.rb', line 212

def relu(input, out: nil)
  apply_unary(input, out, :relu, Kernels::RELU_KERNEL, "relu_forward")
end

.residual_add(input, residual, out: nil) ⇒ NvArray

Residual addition

Parameters:

  • input (NvArray)

    Input tensor

  • residual (NvArray)

    Residual tensor

  • out (NvArray, nil) (defaults to: nil)

    Output tensor

Returns:

  • (NvArray)


302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
# File 'lib/nvruby/epilogues.rb', line 302

def residual_add(input, residual, out: nil)
  CUDA::RuntimeAPI.ensure_loaded!

  n = input.size
  device = input.respond_to?(:device_index) ? input.device_index : 0
  out ||= Ignis::NvArray.zeros(input.shape, dtype: input.dtype, device: device)

  kernel = get_kernel(:residual_add, Kernels::RESIDUAL_ADD_KERNEL, "residual_add")

  block_size = 256
  grid_size = (n + block_size - 1) / block_size

  kernel.launch(
    grid: [grid_size, 1, 1],
    block: [block_size, 1, 1],
    args: [input.device_ffi_ptr, residual.device_ffi_ptr, out.device_ffi_ptr, n]
  )

  CUDA::RuntimeAPI.cudaDeviceSynchronize
  out
end

.scale(input, factor, out: nil) ⇒ NvArray

Scale tensor by factor

Parameters:

  • input (NvArray)

    Input tensor

  • factor (Float)

    Scale factor

  • out (NvArray, nil) (defaults to: nil)

    Output tensor

Returns:

  • (NvArray)


330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
# File 'lib/nvruby/epilogues.rb', line 330

def scale(input, factor, out: nil)
  CUDA::RuntimeAPI.ensure_loaded!

  n = input.size
  device = input.respond_to?(:device_index) ? input.device_index : 0
  out ||= Ignis::NvArray.zeros(input.shape, dtype: input.dtype, device: device)

  kernel = get_kernel(:scale, Kernels::SCALE_KERNEL, "scale")

  block_size = 256
  grid_size = (n + block_size - 1) / block_size

  kernel.launch(
    grid: [grid_size, 1, 1],
    block: [block_size, 1, 1],
    args: [input.device_ffi_ptr, out.device_ffi_ptr, factor, n]
  )

  CUDA::RuntimeAPI.cudaDeviceSynchronize
  out
end

.silu(input, out: nil) ⇒ NvArray

Apply SiLU (Swish) activation

Parameters:

  • input (NvArray)

    Input tensor

  • out (NvArray, nil) (defaults to: nil)

    Output tensor

Returns:

  • (NvArray)


203
204
205
# File 'lib/nvruby/epilogues.rb', line 203

def silu(input, out: nil)
  apply_unary(input, out, :silu, Kernels::SILU_KERNEL, "silu_forward")
end

.silu_bias(input, bias, out: nil) ⇒ NvArray

Fused SiLU + Bias

Parameters:

  • input (NvArray)

    Input tensor

  • bias (NvArray)

    Bias vector

  • out (NvArray, nil) (defaults to: nil)

    Output tensor

Returns:

  • (NvArray)


292
293
294
# File 'lib/nvruby/epilogues.rb', line 292

def silu_bias(input, bias, out: nil)
  apply_fused_bias(input, bias, out, :silu_bias, Kernels::SILU_BIAS_KERNEL, "silu_bias_forward")
end