Module: Ignis::Collective::Algorithms::ReductionOps

Defined in:
lib/nvruby/collective/algorithms/reduction_ops.rb

Overview

Reduction operations for collective primitives These operations combine tensor elements during reduce/allreduce

Constant Summary collapse

OPS =

Valid reduction operations.

%i[sum prod min max avg].freeze

Class Method Summary collapse

Class Method Details

.avg(a, b, result, count, dtype, stream = nil, _n_participants = nil) ⇒ Object

Average step. NOTE: averaging is “sum across all ranks, then divide by the participant count ONCE at the end”. The per-pair reduction step is therefore a plain sum; the caller (Communicator) performs the final divide-by-N. (Previously this silently returned a sum with no divide.)



38
39
40
# File 'lib/nvruby/collective/algorithms/reduction_ops.rb', line 38

def self.avg(a, b, result, count, dtype, stream = nil, _n_participants = nil)
  execute(:sum, a, b, result, count, dtype, stream)
end

.execute(op, a, b, result, count, dtype, stream = nil) ⇒ void

This method returns an undefined value.

Execute reduction operation by name: result = op(a, b), elementwise.

Parameters:

  • op (Symbol)

    :sum, :prod, :min, :max, or :avg (avg == sum per step)

  • a (FFI::Pointer)

    First operand (device pointer)

  • b (FFI::Pointer)

    Second operand (device pointer)

  • result (FFI::Pointer)

    Result buffer (may alias a for in-place)

  • count (Integer)

    Element count

  • dtype (Symbol)

    Data type

  • stream (FFI::Pointer, nil) (defaults to: nil)

    CUDA stream

Raises:

  • (ArgumentError)


51
52
53
54
55
56
57
58
59
60
61
62
63
64
# File 'lib/nvruby/collective/algorithms/reduction_ops.rb', line 51

def self.execute(op, a, b, result, count, dtype, stream = nil)
  reduce = (op == :avg ? :sum : op)
  raise ArgumentError, "Unknown reduction operation: #{op}" unless %i[sum prod min max].include?(reduce)
  return if count.zero?

  if dtype == :float32
    gpu_elementwise(reduce, a, b, result, count)
  else
    # Non-fp32 dtypes use the (correct, slower) host path: the fused JIT
    # kernels are typed `float`, so reinterpreting fp16/fp64/int buffers
    # through them would be wrong.
    host_elementwise_fallback(host_op(reduce), a, b, result, count, dtype)
  end
end

.max(a, b, result, count, dtype, stream = nil) ⇒ Object

Element-wise maximum



30
31
32
# File 'lib/nvruby/collective/algorithms/reduction_ops.rb', line 30

def self.max(a, b, result, count, dtype, stream = nil)
  execute(:max, a, b, result, count, dtype, stream)
end

.min(a, b, result, count, dtype, stream = nil) ⇒ Object

Element-wise minimum



25
26
27
# File 'lib/nvruby/collective/algorithms/reduction_ops.rb', line 25

def self.min(a, b, result, count, dtype, stream = nil)
  execute(:min, a, b, result, count, dtype, stream)
end

.prod(a, b, result, count, dtype, stream = nil) ⇒ Object

Multiply all elements (a * b)



20
21
22
# File 'lib/nvruby/collective/algorithms/reduction_ops.rb', line 20

def self.prod(a, b, result, count, dtype, stream = nil)
  execute(:prod, a, b, result, count, dtype, stream)
end

.sum(a, b, result, count, dtype, stream = nil) ⇒ Object

Sum all elements (a + b)



15
16
17
# File 'lib/nvruby/collective/algorithms/reduction_ops.rb', line 15

def self.sum(a, b, result, count, dtype, stream = nil)
  execute(:sum, a, b, result, count, dtype, stream)
end