Module: Ignis::Collective
- Defined in:
- lib/nvruby/collective.rb,
lib/nvruby/collective/topology.rb,
lib/nvruby/collective/array_ops.rb,
lib/nvruby/collective/communicator.rb,
lib/nvruby/collective/p2p_bindings.rb,
lib/nvruby/collective/vmm_bindings.rb,
lib/nvruby/collective/device_manager.rb,
lib/nvruby/collective/health_monitor.rb,
lib/nvruby/collective/net/nd_adapter.rb,
lib/nvruby/collective/transport/base.rb,
lib/nvruby/collective/algorithms/ring.rb,
lib/nvruby/collective/algorithms/tree.rb,
lib/nvruby/collective/net/nd_bindings.rb,
lib/nvruby/collective/nvarray_adapter.rb,
lib/nvruby/collective/dynamic_optimizer.rb,
lib/nvruby/collective/net/rdma_transport.rb,
lib/nvruby/collective/transport_selector.rb,
lib/nvruby/collective/communicator_healer.rb,
lib/nvruby/collective/resilient_transport.rb,
lib/nvruby/collective/algorithms/pipeliner.rb,
lib/nvruby/collective/transport/ipc_transport.rb,
lib/nvruby/collective/transport/p2p_transport.rb,
lib/nvruby/collective/transport/rio_transport.rb,
lib/nvruby/collective/transport/tcp_transport.rb,
lib/nvruby/collective/algorithms/reduction_ops.rb,
lib/nvruby/collective/transport/rdma_transports.rb,
lib/nvruby/collective/transport/vmm_ipc_structs.rb,
lib/nvruby/collective/algorithms/topology_router.rb,
lib/nvruby/collective/transport/vmm_ipc_transport.rb,
lib/nvruby/collective/algorithms/double_binary_tree.rb,
lib/nvruby/collective/transport/host_staged_transport.rb
Defined Under Namespace
Modules: Algorithms, ArrayOps, NetworkDirect, NvArrayAdapter, P2PBindings, Topology, Transport, VMMBindings Classes: Communicator, CommunicatorError, CommunicatorHealer, DeviceManager, DynamicOptimizer, Error, HealingError, HealthMonitor, ResilientTransport, TopologyError, TransportError, TransportSelector
Constant Summary collapse
- VERSION =
NvCCL (Ignis Collective Communications Library) version Note: NvCCL is NOT NCCL — this is an original design.
'0.1.0'
Class Method Summary collapse
-
.available_devices ⇒ Array<CUDA::Device>
Get list of available GPUs.
-
.boot! ⇒ void
Boot the NvCCL collective layer.
- .booted? ⇒ Boolean
-
.create_communicator(gpu_ids:, rank: 0, world_size: 1) ⇒ Communicator
Create a new communicator for the specified GPUs.
-
.detect_topology ⇒ Topology::Detector
Detect GPU topology for all available devices.
- .device_count ⇒ Integer
-
.multi_gpu_available? ⇒ Boolean
Check if multi-GPU is available.
Class Method Details
.available_devices ⇒ Array<CUDA::Device>
Get list of available GPUs.
83 84 85 86 87 |
# File 'lib/nvruby/collective.rb', line 83 def available_devices CUDA::Device.list rescue StandardError [] end |
.boot! ⇒ void
This method returns an undefined value.
Boot the NvCCL collective layer.
Enumerates CUDA devices, detects topology, registers RecoveryProtocol callbacks, subscribes to EventBus events, and starts the HealthMonitor.
49 50 51 52 53 54 55 56 57 58 |
# File 'lib/nvruby/collective.rb', line 49 def boot! return if @booted register_recovery_callbacks! subscribe_event_bus! start_health_monitor! @booted = true $stderr.puts "[NvCCL] Booted — #{device_count} GPU(s) detected" end |
.booted? ⇒ Boolean
61 62 63 |
# File 'lib/nvruby/collective.rb', line 61 def booted? @booted end |
.create_communicator(gpu_ids:, rank: 0, world_size: 1) ⇒ Communicator
Create a new communicator for the specified GPUs.
71 72 73 |
# File 'lib/nvruby/collective.rb', line 71 def create_communicator(gpu_ids:, rank: 0, world_size: 1) Communicator.new(gpu_ids: gpu_ids, rank: rank, world_size: world_size) end |
.detect_topology ⇒ Topology::Detector
Detect GPU topology for all available devices.
77 78 79 |
# File 'lib/nvruby/collective.rb', line 77 def detect_topology Topology::Detector.new end |
.device_count ⇒ Integer
90 91 92 93 94 |
# File 'lib/nvruby/collective.rb', line 90 def device_count CUDA::Device.count rescue StandardError 0 end |
.multi_gpu_available? ⇒ Boolean
Check if multi-GPU is available.
98 99 100 |
# File 'lib/nvruby/collective.rb', line 98 def multi_gpu_available? device_count > 1 end |