Class: Ignis::Collective::Communicator

Inherits:

Object

Object
Ignis::Collective::Communicator

show all

Defined in:: lib/nvruby/collective/communicator.rb

Overview

Primary user-facing abstraction for collective operations Provides AllReduce, Broadcast, Reduce, and other collective primitives

Constant Summary collapse

REDUCTION_OPS = Reduction operations

[:sum, :prod, :min, :max, :avg].freeze

Instance Attribute Summary collapse

#device_manager ⇒ DeviceManager readonly

Device manager.
#gpu_ids ⇒ Array<Integer> readonly

GPU device IDs in this communicator.
#rank ⇒ Integer readonly

Rank of this communicator (for multi-process).
#transport_selector ⇒ TransportSelector readonly

Transport selector.
#world_size ⇒ Integer readonly

Total number of ranks.

Instance Method Summary collapse

#all_gather(tensors, stream: nil) ⇒ Array<Array<NvArray>>

AllGather - gather tensors from all GPUs to all GPUs.
#all_reduce(tensors, op: :sum, stream: nil) ⇒ Array<NvArray>

Perform AllReduce operation - reduce and distribute result to all GPUs.
#all_reduce_async(tensors, op: :sum, stream:) ⇒ Array<NvArray>

Async AllReduce - requires explicit synchronization.
#all_to_all(send_buffers, recv_buffers, chunk_size:, stream: nil) ⇒ void

AllToAll - full exchange between all GPUs Each GPU sends N chunks (one to each GPU) and receives N chunks (one from each GPU).
#barrier ⇒ void

Barrier synchronization across all GPUs.
#broadcast(tensor, root: 0, stream: nil) ⇒ Array<NvArray>

Broadcast tensor from root GPU to all GPUs.
#destroy! ⇒ void

Clean up all resources.
#initialize(gpu_ids:, rank: 0, world_size: 1) ⇒ Communicator constructor

Create a new communicator for the specified GPUs.
#initialize! ⇒ self

Initialize the communicator (detect topology, enable P2P, etc.).
#inspect ⇒ String

Detailed inspection.
#performance_summary ⇒ Hash

Get performance summary.
#ready? ⇒ Boolean

Check if communicator is ready.
#recv(buffer, src_rank:, size:, stream: nil) ⇒ void

Point-to-point receive (no-op, actual receive happens in send_recv).
#reduce(tensors, root: 0, op: :sum, stream: nil) ⇒ NvArray

Reduce tensors to root GPU.
#reduce_scatter(tensors, op: :sum, stream: nil) ⇒ Array<FFI::Pointer>

ReduceScatter - reduce and scatter result.
#send(tensor, dest_rank:, size: nil, stream: nil) ⇒ void

Point-to-point send from current rank to destination.
#send_recv(buffer, src_rank:, dst_buffer:, dst_rank:, size:, stream: nil) ⇒ void

Point-to-point send from specific source rank.
#to_s ⇒ String

Human-readable description.
#topology ⇒ Topology::Matrix

Get the topology matrix.

Constructor Details

#initialize(gpu_ids:, rank: 0, world_size: 1) ⇒ `Communicator`

Create a new communicator for the specified GPUs

Parameters:

gpu_ids (Array<Integer>) —

GPU device IDs to include
rank (Integer) (defaults to: 0) —

Rank of this process (default 0 for single-process)
world_size (Integer) (defaults to: 1) —

Total ranks (default 1 for single-process)

# File 'lib/nvruby/collective/communicator.rb', line 37

def initialize(gpu_ids:, rank: 0, world_size: 1)
  @gpu_ids = gpu_ids.dup.freeze
  @rank = rank
  @world_size = world_size

  validate_gpu_ids!

  @device_manager = DeviceManager.new(device_ids: @gpu_ids)
  @transport_selector = TransportSelector.new(@gpu_ids)
  @ring_order = nil
  @initialized = false
end

Instance Attribute Details

#device_manager ⇒ `DeviceManager` (readonly)

Returns Device manager.

Returns:

(DeviceManager) —

Device manager



22
23
24

# File 'lib/nvruby/collective/communicator.rb', line 22

def device_manager
  @device_manager
end

#gpu_ids ⇒ `Array<Integer>` (readonly)

Returns GPU device IDs in this communicator.

Returns:

(Array<Integer>) —

GPU device IDs in this communicator



19
20
21

# File 'lib/nvruby/collective/communicator.rb', line 19

def gpu_ids
  @gpu_ids
end

#rank ⇒ `Integer` (readonly)

Returns Rank of this communicator (for multi-process).

Returns:

(Integer) —

Rank of this communicator (for multi-process)



28
29
30

# File 'lib/nvruby/collective/communicator.rb', line 28

def rank
  @rank
end

#transport_selector ⇒ `TransportSelector` (readonly)

Returns Transport selector.

Returns:

(TransportSelector) —

Transport selector



25
26
27

# File 'lib/nvruby/collective/communicator.rb', line 25

def transport_selector
  @transport_selector
end

#world_size ⇒ `Integer` (readonly)

Returns Total number of ranks.

Returns:

(Integer) —

Total number of ranks



31
32
33

# File 'lib/nvruby/collective/communicator.rb', line 31

def world_size
  @world_size
end

Instance Method Details

#all_gather(tensors, stream: nil) ⇒ `Array<Array<NvArray>>`

AllGather - gather tensors from all GPUs to all GPUs

Parameters:

tensors (Array<NvArray>) —

One tensor per GPU (each may be different size)
stream (CUDA::Stream, nil) (defaults to: nil) —

Optional CUDA stream

Returns:

(Array<Array<NvArray>>) —

Gathered tensors on each GPU

# File 'lib/nvruby/collective/communicator.rb', line 130

def all_gather(tensors, stream: nil)
  validate_tensors!(tensors)
  ensure_initialized!

  return [tensors] if @gpu_ids.size == 1

  # TODO: Implement ring all-gather
  simple_all_gather(tensors, stream)
end

#all_reduce(tensors, op: :sum, stream: nil) ⇒ `Array<NvArray>`

Perform AllReduce operation - reduce and distribute result to all GPUs

Parameters:

tensors (Array<NvArray>) —

One tensor per GPU
op (Symbol) (defaults to: :sum) —

Reduction operation (:sum, :prod, :min, :max)
stream (CUDA::Stream, nil) (defaults to: nil) —

Optional CUDA stream

Returns:

(Array<NvArray>) —

Reduced tensors (same references as input)

# File 'lib/nvruby/collective/communicator.rb', line 69

def all_reduce(tensors, op: :sum, stream: nil)
  validate_operation!(op)
  validate_tensors!(tensors)
  ensure_initialized!

  # Single GPU case - no-op
  return tensors if @gpu_ids.size == 1

  # Use Ring AllReduce for multi-GPU
  ring_all_reduce(tensors, op, stream)
end

#all_reduce_async(tensors, op: :sum, stream:) ⇒ `Array<NvArray>`

Async AllReduce - requires explicit synchronization

Parameters:

tensors (Array<NvArray>) —

One tensor per GPU
op (Symbol) (defaults to: :sum) —

Reduction operation
stream (CUDA::Stream) —

CUDA stream for async execution

Returns:

(Array<NvArray>) —

Tensors (result available after sync)

Raises:

(ArgumentError)

# File 'lib/nvruby/collective/communicator.rb', line 86

def all_reduce_async(tensors, op: :sum, stream:)
  raise ArgumentError, "Stream required for async operation" unless stream

  all_reduce(tensors, op: op, stream: stream)
end

#all_to_all(send_buffers, recv_buffers, chunk_size:, stream: nil) ⇒ `void`

This method returns an undefined value.

AllToAll - full exchange between all GPUs Each GPU sends N chunks (one to each GPU) and receives N chunks (one from each GPU)

Parameters:

send_buffers (Array<Array<FFI::Pointer>>) —

N×N array: send_buffers[dst]
recv_buffers (Array<Array<FFI::Pointer>>) —

N×N array: recv_buffers[src]
chunk_size (Integer) —

Size of each chunk in bytes
stream (CUDA::Stream, nil) (defaults to: nil) —

Optional CUDA stream

# File 'lib/nvruby/collective/communicator.rb', line 196

def all_to_all(send_buffers, recv_buffers, chunk_size:, stream: nil)
  ensure_initialized!

  n = @gpu_ids.size
  return if n == 1

  streams = stream ? [stream] * n : create_null_streams(n)

  # Phase 1: Copy local data (GPU[i] → GPU[i])
  n.times do |rank|
    gpu_id = @gpu_ids[rank]
    CUDA::RuntimeAPI.cudaSetDevice(gpu_id)
    stream_ptr = get_stream_ptr(streams[rank])

    CUDA::RuntimeAPI.cudaMemcpyAsync(
      recv_buffers[rank][rank],
      send_buffers[rank][rank],
      chunk_size,
      CUDA::RuntimeAPI::MEMCPY_DEVICE_TO_DEVICE,
      stream_ptr
    )
  end

  # Phase 2: N-1 rounds of pairwise exchange
  (n - 1).times do |round|
    n.times do |rank|
      gpu_id = @gpu_ids[rank]
      
      # Calculate partner for this round (rotation pattern)
      partner = (rank + round + 1) % n
      partner_gpu = @gpu_ids[partner]

      stream_ptr = get_stream_ptr(streams[rank])

      # Send to partner
      transport = @transport_selector.select_transport(gpu_id, partner_gpu)
      
      if transport.is_a?(Transport::P2PTransport)
        transport.copy_async(
          recv_buffers[partner][rank],  # Partner receives from me
          send_buffers[rank][partner],  # I send to partner
          chunk_size,
          stream_ptr
        )
      end
    end

    # Synchronize after each round
    synchronize_all_streams!(streams)
  end
end

#barrier ⇒ `void`

This method returns an undefined value.

Barrier synchronization across all GPUs

# File 'lib/nvruby/collective/communicator.rb', line 337

def barrier
  ensure_initialized!
  @device_manager.synchronize_all!
end

#broadcast(tensor, root: 0, stream: nil) ⇒ `Array<NvArray>`

Broadcast tensor from root GPU to all GPUs

Parameters:

tensor (NvArray) —

Source tensor on root GPU
root (Integer) (defaults to: 0) —

Root GPU index (default 0)
stream (CUDA::Stream, nil) (defaults to: nil) —

Optional CUDA stream

Returns:

(Array<NvArray>) —

Tensors on all GPUs with broadcasted data

# File 'lib/nvruby/collective/communicator.rb', line 97

def broadcast(tensor, root: 0, stream: nil)
  ensure_initialized!
  validate_gpu_index!(root)

  return [tensor] if @gpu_ids.size == 1

  # TODO: Implement tree broadcast algorithm
  # For now, use simple fan-out from root
  simple_broadcast(tensor, root, stream)
end

#destroy! ⇒ `void`

This method returns an undefined value.

Clean up all resources

# File 'lib/nvruby/collective/communicator.rb', line 364

def destroy!
  @transport_selector.destroy!
  @device_manager.destroy!
  @initialized = false
end

#initialize! ⇒ `self`

Initialize the communicator (detect topology, enable P2P, etc.)

Returns:

(self)

# File 'lib/nvruby/collective/communicator.rb', line 52

def initialize!
  return self if @initialized

  @device_manager.initialize!
  @device_manager.enable_all_p2p_access!
  @transport_selector.initialize!
  @ring_order = @transport_selector.optimal_ring_order

  @initialized = true
  self
end

#inspect ⇒ `String`

Returns Detailed inspection.

Returns:

(String) —

Detailed inspection

# File 'lib/nvruby/collective/communicator.rb', line 377

def inspect
  "#<Ignis::Collective::Communicator " \
    "gpu_ids=#{@gpu_ids} " \
    "rank=#{@rank}/#{@world_size} " \
    "initialized=#{@initialized}>"
end

#performance_summary ⇒ `Hash`

Get performance summary

Returns:

(Hash) —

Performance statistics



358
359
360

# File 'lib/nvruby/collective/communicator.rb', line 358

def performance_summary
  @transport_selector.performance_summary
end

#ready? ⇒ `Boolean`

Check if communicator is ready

Returns:

(Boolean) —

True if initialized

# File 'lib/nvruby/collective/communicator.rb', line 344

def ready?
  @initialized &&
    @device_manager.ready? &&
    @transport_selector.ready?
end

#recv(buffer, src_rank:, size:, stream: nil) ⇒ `void`

This method returns an undefined value.

Point-to-point receive (no-op, actual receive happens in send_recv)

Parameters:

buffer (FFI::Pointer) —

Buffer to receive into
src_rank (Integer) —

Source rank
size (Integer) —

Expected size in bytes
stream (CUDA::Stream, nil) (defaults to: nil) —

Optional CUDA stream

# File 'lib/nvruby/collective/communicator.rb', line 327

def recv(buffer, src_rank:, size:, stream: nil)
  ensure_initialized!
  validate_gpu_index!(src_rank)
  # Actual data transfer happens via send_recv from sender side
  # This just marks the receive buffer as ready
  barrier
end

#reduce(tensors, root: 0, op: :sum, stream: nil) ⇒ `NvArray`

Reduce tensors to root GPU

Parameters:

tensors (Array<NvArray>) —

One tensor per GPU
root (Integer) (defaults to: 0) —

Root GPU index
op (Symbol) (defaults to: :sum) —

Reduction operation
stream (CUDA::Stream, nil) (defaults to: nil) —

Optional CUDA stream

Returns:

(NvArray) —

Reduced tensor on root GPU

# File 'lib/nvruby/collective/communicator.rb', line 114

def reduce(tensors, root: 0, op: :sum, stream: nil)
  validate_operation!(op)
  validate_tensors!(tensors)
  ensure_initialized!
  validate_gpu_index!(root)

  return tensors[0] if @gpu_ids.size == 1

  # TODO: Implement tree reduce algorithm
  simple_reduce(tensors, root, op, stream)
end

#reduce_scatter(tensors, op: :sum, stream: nil) ⇒ `Array<FFI::Pointer>`

ReduceScatter - reduce and scatter result

Parameters:

tensors (Array<NvArray>) —

One tensor per GPU
op (Symbol) (defaults to: :sum) —

Reduction operation
stream (CUDA::Stream, nil) (defaults to: nil) —

Optional CUDA stream

Returns:

(Array<FFI::Pointer>) —

Scattered reduced chunks (chunk size = total_size / N)

# File 'lib/nvruby/collective/communicator.rb', line 145

def reduce_scatter(tensors, op: :sum, stream: nil)
  validate_operation!(op)
  validate_tensors!(tensors)
  ensure_initialized!

  return tensors if @gpu_ids.size == 1

  ring = Algorithms::Ring.new(
    ring_order: @ring_order,
    transport_selector: @transport_selector
  )

  buffers = tensors.map { |t| device_buffer(t) }
  sizes = tensors.map { |t| byte_size_of(t) }

  dtype = if tensors[0].respond_to?(:dtype)
            tensors[0].dtype
          else
            :float32
          end

  # Calculate chunk size
  total_size = sizes[0]
  chunk_size = ring.calculate_chunk_size(total_size)

  # Allocate result buffers
  result_buffers = @gpu_ids.map do |gpu_id|
    allocate_buffer_on_device(gpu_id, chunk_size)
  end

  streams = stream ? [stream] * @gpu_ids.size : create_null_streams(@gpu_ids.size)

  ring.reduce_scatter(
    buffers: buffers,
    result_buffers: result_buffers,
    sizes: sizes,
    dtype: dtype,
    op: op,
    streams: streams
  )

  result_buffers
end

#send(tensor, dest_rank:, size: nil, stream: nil) ⇒ `void`

This method returns an undefined value.

Point-to-point send from current rank to destination

Parameters:

tensor (NvArray, FFI::Pointer) —

Data to send
dest_rank (Integer) —

Destination rank (index in gpu_ids)
size (Integer, nil) (defaults to: nil) —

Size in bytes (inferred from tensor if nil)
stream (CUDA::Stream, nil) (defaults to: nil) —

Optional CUDA stream

# File 'lib/nvruby/collective/communicator.rb', line 254

def send(tensor, dest_rank:, size: nil, stream: nil)
  ensure_initialized!
  validate_gpu_index!(dest_rank)

  src_rank = 0  # Default sender is rank 0
  src_gpu = @gpu_ids[src_rank]
  dst_gpu = @gpu_ids[dest_rank]

  return if src_rank == dest_rank

  buffer = device_buffer(tensor)
  byte_size = size || byte_size_of(tensor)

  transport = @transport_selector.select_transport(src_gpu, dst_gpu)
  stream_ptr = stream ? get_stream_ptr(stream) : FFI::Pointer::NULL

  if transport.is_a?(Transport::P2PTransport)
    # P2P copy requires destination buffer
    # Assumes tensor has been pre-allocated on dest
    raise ArgumentError, "P2P send requires pre-allocated recv buffer on dest"
  end
end

#send_recv(buffer, src_rank:, dst_buffer:, dst_rank:, size:, stream: nil) ⇒ `void`

This method returns an undefined value.

Point-to-point send from specific source rank

Parameters:

buffer (FFI::Pointer) —

Source buffer on src_rank GPU
src_rank (Integer) —

Source rank
dst_buffer (FFI::Pointer) —

Destination buffer on dst_rank GPU
dst_rank (Integer) —

Destination rank
size (Integer) —

Size in bytes
stream (CUDA::Stream, nil) (defaults to: nil) —

Optional CUDA stream

# File 'lib/nvruby/collective/communicator.rb', line 285

def send_recv(buffer, src_rank:, dst_buffer:, dst_rank:, size:, stream: nil)
  ensure_initialized!
  validate_gpu_index!(src_rank)
  validate_gpu_index!(dst_rank)

  return if src_rank == dst_rank

  src_gpu = @gpu_ids[src_rank]
  dst_gpu = @gpu_ids[dst_rank]

  transport = @transport_selector.select_transport(src_gpu, dst_gpu)
  stream_ptr = stream ? get_stream_ptr(stream) : FFI::Pointer::NULL

  if transport.is_a?(Transport::P2PTransport)
    # Set source device context
    CUDA::RuntimeAPI.cudaSetDevice(src_gpu)
    transport.copy_async(dst_buffer, buffer, size, stream_ptr)
  elsif transport.is_a?(Transport::IPCTransport)
    # For IPC, export/import handles
    handle = transport.export_handle(buffer)
    CUDA::RuntimeAPI.cudaSetDevice(dst_gpu)
    mapped = transport.import_handle(handle)
    
    # Copy from mapped to destination
    CUDA::RuntimeAPI.cudaMemcpyAsync(
      dst_buffer,
      mapped,
      size,
      CUDA::RuntimeAPI::MEMCPY_DEVICE_TO_DEVICE,
      stream_ptr
    )
    
    transport.close_imported_handle(mapped)
  end
end

#to_s ⇒ `String`

Returns Human-readable description.

Returns:

(String) —

Human-readable description

# File 'lib/nvruby/collective/communicator.rb', line 371

def to_s
  status = @initialized ? "ready" : "uninitialized"
  "Communicator[#{@gpu_ids.size} GPUs, #{status}]"
end

#topology ⇒ `Topology::Matrix`

Get the topology matrix

Returns:

(Topology::Matrix) —

Topology information



352
353
354

# File 'lib/nvruby/collective/communicator.rb', line 352

def topology
  @device_manager.topology&.matrix
end

Class: Ignis::Collective::Communicator

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(gpu_ids:, rank: 0, world_size: 1) ⇒ Communicator

Instance Attribute Details

#device_manager ⇒ DeviceManager (readonly)

#gpu_ids ⇒ Array<Integer> (readonly)

#rank ⇒ Integer (readonly)

#transport_selector ⇒ TransportSelector (readonly)

#world_size ⇒ Integer (readonly)

Instance Method Details

#all_gather(tensors, stream: nil) ⇒ Array<Array<NvArray>>

#all_reduce(tensors, op: :sum, stream: nil) ⇒ Array<NvArray>

#all_reduce_async(tensors, op: :sum, stream:) ⇒ Array<NvArray>

#all_to_all(send_buffers, recv_buffers, chunk_size:, stream: nil) ⇒ void

#barrier ⇒ void

#broadcast(tensor, root: 0, stream: nil) ⇒ Array<NvArray>

#destroy! ⇒ void

#initialize! ⇒ self

#inspect ⇒ String

#performance_summary ⇒ Hash

#ready? ⇒ Boolean

#recv(buffer, src_rank:, size:, stream: nil) ⇒ void

#reduce(tensors, root: 0, op: :sum, stream: nil) ⇒ NvArray

#reduce_scatter(tensors, op: :sum, stream: nil) ⇒ Array<FFI::Pointer>

#send(tensor, dest_rank:, size: nil, stream: nil) ⇒ void

#send_recv(buffer, src_rank:, dst_buffer:, dst_rank:, size:, stream: nil) ⇒ void

#to_s ⇒ String

#topology ⇒ Topology::Matrix

#initialize(gpu_ids:, rank: 0, world_size: 1) ⇒ `Communicator`

#device_manager ⇒ `DeviceManager` (readonly)

#gpu_ids ⇒ `Array<Integer>` (readonly)

#rank ⇒ `Integer` (readonly)

#transport_selector ⇒ `TransportSelector` (readonly)

#world_size ⇒ `Integer` (readonly)

#all_gather(tensors, stream: nil) ⇒ `Array<Array<NvArray>>`

#all_reduce(tensors, op: :sum, stream: nil) ⇒ `Array<NvArray>`

#all_reduce_async(tensors, op: :sum, stream:) ⇒ `Array<NvArray>`

#all_to_all(send_buffers, recv_buffers, chunk_size:, stream: nil) ⇒ `void`

#barrier ⇒ `void`

#broadcast(tensor, root: 0, stream: nil) ⇒ `Array<NvArray>`

#destroy! ⇒ `void`

#initialize! ⇒ `self`

#inspect ⇒ `String`

#performance_summary ⇒ `Hash`

#ready? ⇒ `Boolean`

#recv(buffer, src_rank:, size:, stream: nil) ⇒ `void`

#reduce(tensors, root: 0, op: :sum, stream: nil) ⇒ `NvArray`

#reduce_scatter(tensors, op: :sum, stream: nil) ⇒ `Array<FFI::Pointer>`

#send(tensor, dest_rank:, size: nil, stream: nil) ⇒ `void`

#send_recv(buffer, src_rank:, dst_buffer:, dst_rank:, size:, stream: nil) ⇒ `void`

#to_s ⇒ `String`

#topology ⇒ `Topology::Matrix`