Module: Ignis::Collective

Defined in:
lib/nvruby/collective.rb,
lib/nvruby/collective/topology.rb,
lib/nvruby/collective/array_ops.rb,
lib/nvruby/collective/communicator.rb,
lib/nvruby/collective/p2p_bindings.rb,
lib/nvruby/collective/vmm_bindings.rb,
lib/nvruby/collective/device_manager.rb,
lib/nvruby/collective/health_monitor.rb,
lib/nvruby/collective/net/nd_adapter.rb,
lib/nvruby/collective/transport/base.rb,
lib/nvruby/collective/algorithms/ring.rb,
lib/nvruby/collective/algorithms/tree.rb,
lib/nvruby/collective/net/nd_bindings.rb,
lib/nvruby/collective/nvarray_adapter.rb,
lib/nvruby/collective/dynamic_optimizer.rb,
lib/nvruby/collective/net/rdma_transport.rb,
lib/nvruby/collective/transport_selector.rb,
lib/nvruby/collective/communicator_healer.rb,
lib/nvruby/collective/resilient_transport.rb,
lib/nvruby/collective/algorithms/pipeliner.rb,
lib/nvruby/collective/transport/ipc_transport.rb,
lib/nvruby/collective/transport/p2p_transport.rb,
lib/nvruby/collective/transport/rio_transport.rb,
lib/nvruby/collective/transport/tcp_transport.rb,
lib/nvruby/collective/algorithms/reduction_ops.rb,
lib/nvruby/collective/transport/rdma_transports.rb,
lib/nvruby/collective/transport/vmm_ipc_structs.rb,
lib/nvruby/collective/algorithms/topology_router.rb,
lib/nvruby/collective/transport/vmm_ipc_transport.rb,
lib/nvruby/collective/algorithms/double_binary_tree.rb,
lib/nvruby/collective/transport/host_staged_transport.rb

Defined Under Namespace

Modules: Algorithms, ArrayOps, NetworkDirect, NvArrayAdapter, P2PBindings, Topology, Transport, VMMBindings Classes: Communicator, CommunicatorError, CommunicatorHealer, DeviceManager, DynamicOptimizer, Error, HealingError, HealthMonitor, ResilientTransport, TopologyError, TransportError, TransportSelector

Constant Summary collapse

VERSION =

NvCCL (Ignis Collective Communications Library) version Note: NvCCL is NOT NCCL — this is an original design.

'0.1.0'

Class Method Summary collapse

Class Method Details

.available_devicesArray<CUDA::Device>

Get list of available GPUs.

Returns:

  • (Array<CUDA::Device>)


83
84
85
86
87
# File 'lib/nvruby/collective.rb', line 83

def available_devices
  CUDA::Device.list
rescue StandardError
  []
end

.boot!void

This method returns an undefined value.

Boot the NvCCL collective layer.

Enumerates CUDA devices, detects topology, registers RecoveryProtocol callbacks, subscribes to EventBus events, and starts the HealthMonitor.



49
50
51
52
53
54
55
56
57
58
# File 'lib/nvruby/collective.rb', line 49

def boot!
  return if @booted

  register_recovery_callbacks!
  subscribe_event_bus!
  start_health_monitor!

  @booted = true
  $stderr.puts "[NvCCL] Booted — #{device_count} GPU(s) detected"
end

.booted?Boolean

Returns:

  • (Boolean)


61
62
63
# File 'lib/nvruby/collective.rb', line 61

def booted?
  @booted
end

.create_communicator(gpu_ids:, rank: 0, world_size: 1) ⇒ Communicator

Create a new communicator for the specified GPUs.

Parameters:

  • gpu_ids (Array<Integer>)

    GPU device IDs to include

  • rank (Integer) (defaults to: 0)

    Process rank (for multi-process, default 0)

  • world_size (Integer) (defaults to: 1)

    Total processes (default 1)

Returns:



71
72
73
# File 'lib/nvruby/collective.rb', line 71

def create_communicator(gpu_ids:, rank: 0, world_size: 1)
  Communicator.new(gpu_ids: gpu_ids, rank: rank, world_size: world_size)
end

.detect_topologyTopology::Detector

Detect GPU topology for all available devices.

Returns:



77
78
79
# File 'lib/nvruby/collective.rb', line 77

def detect_topology
  Topology::Detector.new
end

.device_countInteger

Returns:

  • (Integer)


90
91
92
93
94
# File 'lib/nvruby/collective.rb', line 90

def device_count
  CUDA::Device.count
rescue StandardError
  0
end

.multi_gpu_available?Boolean

Check if multi-GPU is available.

Returns:

  • (Boolean)


98
99
100
# File 'lib/nvruby/collective.rb', line 98

def multi_gpu_available?
  device_count > 1
end