Class: Ignis::Collective::TransportSelector
- Inherits:
-
Object
- Object
- Ignis::Collective::TransportSelector
- Defined in:
- lib/nvruby/collective/transport_selector.rb
Overview
Automatic transport selector - NCCL-style Detects topology at init time and selects optimal transport for each GPU pair
Constant Summary collapse
- TRANSPORT_PRIORITY =
Transport types ranked by performance (highest first)
[ :nvlink, # NVLink - 900 GB/s :pcie_p2p, # PCIe P2P - 32 GB/s :cuda_vmm_ipc, # cuMem VMM IPC - 25 GB/s :cuda_ipc, # Legacy CUDA IPC - 20 GB/s :host_staged, # Host staging - 12 GB/s :rio_network, # Windows RIO - 100 Gbps :tcp, # TCP fallback - variable ].freeze
- TRANSPORT_CLASSES =
Map interconnect types to transport classes
{ nvlink: Transport::P2PTransport, pcie_p2p: Transport::P2PTransport, cuda_ipc: Transport::IPCTransport, cuda_vmm_ipc: Transport::IPCTransport, host_staged: nil, # TODO: Implement SHMTransport rio_network: nil, # TODO: Implement RIOTransport tcp: nil, # TODO: Implement TCPTransport }.freeze
Instance Attribute Summary collapse
-
#device_ids ⇒ Array<Integer>
readonly
GPU device IDs in this communicator.
-
#topology ⇒ Topology::Detector
readonly
Topology detector.
-
#transport_matrix ⇒ Hash<Array<Integer>, Transport::Base>
readonly
Transport matrix.
Instance Method Summary collapse
-
#destroy! ⇒ void
Clean up all transports.
-
#initialize(device_ids) ⇒ TransportSelector
constructor
Create transport selector for given GPUs.
-
#initialize! ⇒ void
Initialize all transports.
-
#optimal_ring_order ⇒ Array<Integer>
Get optimal ring order based on topology.
-
#performance_summary ⇒ Hash
Get performance summary for logging.
-
#ready? ⇒ Boolean
Check if all transports are ready.
-
#select_transport(src, dst) ⇒ Transport::Base?
Get optimal transport for a GPU pair.
-
#to_s ⇒ String
Human-readable summary.
-
#transport_type(src, dst) ⇒ Symbol?
Get transport type for a GPU pair.
Constructor Details
#initialize(device_ids) ⇒ TransportSelector
Create transport selector for given GPUs
46 47 48 49 50 51 |
# File 'lib/nvruby/collective/transport_selector.rb', line 46 def initialize(device_ids) @device_ids = device_ids.dup.freeze @topology = Topology::Detector.new(device_ids: @device_ids) @transport_matrix = {} @initialized = false end |
Instance Attribute Details
#device_ids ⇒ Array<Integer> (readonly)
Returns GPU device IDs in this communicator.
36 37 38 |
# File 'lib/nvruby/collective/transport_selector.rb', line 36 def device_ids @device_ids end |
#topology ⇒ Topology::Detector (readonly)
Returns Topology detector.
39 40 41 |
# File 'lib/nvruby/collective/transport_selector.rb', line 39 def topology @topology end |
#transport_matrix ⇒ Hash<Array<Integer>, Transport::Base> (readonly)
Returns Transport matrix.
42 43 44 |
# File 'lib/nvruby/collective/transport_selector.rb', line 42 def transport_matrix @transport_matrix end |
Instance Method Details
#destroy! ⇒ void
This method returns an undefined value.
Clean up all transports
132 133 134 135 136 |
# File 'lib/nvruby/collective/transport_selector.rb', line 132 def destroy! @transport_matrix.each_value(&:destroy!) @transport_matrix.clear @initialized = false end |
#initialize! ⇒ void
This method returns an undefined value.
Initialize all transports
55 56 57 58 59 60 61 |
# File 'lib/nvruby/collective/transport_selector.rb', line 55 def initialize! return if @initialized build_transport_matrix! initialize_transports! @initialized = true end |
#optimal_ring_order ⇒ Array<Integer>
Get optimal ring order based on topology
84 85 86 |
# File 'lib/nvruby/collective/transport_selector.rb', line 84 def optimal_ring_order @topology.optimal_ring_order end |
#performance_summary ⇒ Hash
Get performance summary for logging
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
# File 'lib/nvruby/collective/transport_selector.rb', line 96 def performance_summary nvlink_count = 0 p2p_count = 0 ipc_count = 0 staged_count = 0 @transport_matrix.each_value do |transport| case transport when Transport::P2PTransport if transport.interconnect_type == :nvlink nvlink_count += 1 else p2p_count += 1 end when Transport::IPCTransport ipc_count += 1 else staged_count += 1 end end total = @device_ids.size * (@device_ids.size - 1) avg_bandwidth = @transport_matrix.values.sum(&:estimated_bandwidth) / [@transport_matrix.size, 1].max { total_paths: total, nvlink_paths: nvlink_count, p2p_paths: p2p_count, ipc_paths: ipc_count, staged_paths: staged_count, avg_bandwidth_gbps: avg_bandwidth.round(1), } end |
#ready? ⇒ Boolean
Check if all transports are ready
90 91 92 |
# File 'lib/nvruby/collective/transport_selector.rb', line 90 def ready? @initialized && @transport_matrix.values.all?(&:ready?) end |
#select_transport(src, dst) ⇒ Transport::Base?
Get optimal transport for a GPU pair
67 68 69 70 71 |
# File 'lib/nvruby/collective/transport_selector.rb', line 67 def select_transport(src, dst) return nil if src == dst @transport_matrix[[src, dst]] end |
#to_s ⇒ String
Returns Human-readable summary.
139 140 141 142 143 144 |
# File 'lib/nvruby/collective/transport_selector.rb', line 139 def to_s stats = performance_summary "TransportSelector[#{@device_ids.size} GPUs]: " \ "#{stats[:nvlink_paths]} NVLink, #{stats[:p2p_paths]} P2P, " \ "#{stats[:ipc_paths]} IPC (avg #{stats[:avg_bandwidth_gbps]} GB/s)" end |
#transport_type(src, dst) ⇒ Symbol?
Get transport type for a GPU pair
77 78 79 80 |
# File 'lib/nvruby/collective/transport_selector.rb', line 77 def transport_type(src, dst) transport = select_transport(src, dst) transport&.class&.transport_type end |