Class: Ignis::Collective::TransportSelector

Inherits:
Object
  • Object
show all
Defined in:
lib/nvruby/collective/transport_selector.rb

Overview

Automatic transport selector - NCCL-style Detects topology at init time and selects optimal transport for each GPU pair

Constant Summary collapse

TRANSPORT_PRIORITY =

Transport types ranked by performance (highest first)

[
  :nvlink,          # NVLink - 900 GB/s
  :pcie_p2p,        # PCIe P2P - 32 GB/s
  :cuda_vmm_ipc,    # cuMem VMM IPC - 25 GB/s
  :cuda_ipc,        # Legacy CUDA IPC - 20 GB/s
  :host_staged,     # Host staging - 12 GB/s
  :rio_network,     # Windows RIO - 100 Gbps
  :tcp,             # TCP fallback - variable
].freeze
TRANSPORT_CLASSES =

Map interconnect types to transport classes

{
  nvlink: Transport::P2PTransport,
  pcie_p2p: Transport::P2PTransport,
  cuda_ipc: Transport::IPCTransport,
  cuda_vmm_ipc: Transport::IPCTransport,
  host_staged: nil,  # TODO: Implement SHMTransport
  rio_network: nil,  # TODO: Implement RIOTransport
  tcp: nil,          # TODO: Implement TCPTransport
}.freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(device_ids) ⇒ TransportSelector

Create transport selector for given GPUs

Parameters:

  • device_ids (Array<Integer>)

    GPU device IDs



46
47
48
49
50
51
# File 'lib/nvruby/collective/transport_selector.rb', line 46

def initialize(device_ids)
  @device_ids = device_ids.dup.freeze
  @topology = Topology::Detector.new(device_ids: @device_ids)
  @transport_matrix = {}
  @initialized = false
end

Instance Attribute Details

#device_idsArray<Integer> (readonly)

Returns GPU device IDs in this communicator.

Returns:

  • (Array<Integer>)

    GPU device IDs in this communicator



36
37
38
# File 'lib/nvruby/collective/transport_selector.rb', line 36

def device_ids
  @device_ids
end

#topologyTopology::Detector (readonly)

Returns Topology detector.

Returns:



39
40
41
# File 'lib/nvruby/collective/transport_selector.rb', line 39

def topology
  @topology
end

#transport_matrixHash<Array<Integer>, Transport::Base> (readonly)

Returns Transport matrix.

Returns:



42
43
44
# File 'lib/nvruby/collective/transport_selector.rb', line 42

def transport_matrix
  @transport_matrix
end

Instance Method Details

#destroy!void

This method returns an undefined value.

Clean up all transports



132
133
134
135
136
# File 'lib/nvruby/collective/transport_selector.rb', line 132

def destroy!
  @transport_matrix.each_value(&:destroy!)
  @transport_matrix.clear
  @initialized = false
end

#initialize!void

This method returns an undefined value.

Initialize all transports



55
56
57
58
59
60
61
# File 'lib/nvruby/collective/transport_selector.rb', line 55

def initialize!
  return if @initialized

  build_transport_matrix!
  initialize_transports!
  @initialized = true
end

#optimal_ring_orderArray<Integer>

Get optimal ring order based on topology

Returns:

  • (Array<Integer>)

    Ordered GPU IDs for ring algorithm



84
85
86
# File 'lib/nvruby/collective/transport_selector.rb', line 84

def optimal_ring_order
  @topology.optimal_ring_order
end

#performance_summaryHash

Get performance summary for logging

Returns:

  • (Hash)

    Performance stats



96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
# File 'lib/nvruby/collective/transport_selector.rb', line 96

def performance_summary
  nvlink_count = 0
  p2p_count = 0
  ipc_count = 0
  staged_count = 0

  @transport_matrix.each_value do |transport|
    case transport
    when Transport::P2PTransport
      if transport.interconnect_type == :nvlink
        nvlink_count += 1
      else
        p2p_count += 1
      end
    when Transport::IPCTransport
      ipc_count += 1
    else
      staged_count += 1
    end
  end

  total = @device_ids.size * (@device_ids.size - 1)
  avg_bandwidth = @transport_matrix.values.sum(&:estimated_bandwidth) / [@transport_matrix.size, 1].max

  {
    total_paths: total,
    nvlink_paths: nvlink_count,
    p2p_paths: p2p_count,
    ipc_paths: ipc_count,
    staged_paths: staged_count,
    avg_bandwidth_gbps: avg_bandwidth.round(1),
  }
end

#ready?Boolean

Check if all transports are ready

Returns:

  • (Boolean)

    True if initialized



90
91
92
# File 'lib/nvruby/collective/transport_selector.rb', line 90

def ready?
  @initialized && @transport_matrix.values.all?(&:ready?)
end

#select_transport(src, dst) ⇒ Transport::Base?

Get optimal transport for a GPU pair

Parameters:

  • src (Integer)

    Source GPU

  • dst (Integer)

    Destination GPU

Returns:



67
68
69
70
71
# File 'lib/nvruby/collective/transport_selector.rb', line 67

def select_transport(src, dst)
  return nil if src == dst

  @transport_matrix[[src, dst]]
end

#to_sString

Returns Human-readable summary.

Returns:

  • (String)

    Human-readable summary



139
140
141
142
143
144
# File 'lib/nvruby/collective/transport_selector.rb', line 139

def to_s
  stats = performance_summary
  "TransportSelector[#{@device_ids.size} GPUs]: " \
    "#{stats[:nvlink_paths]} NVLink, #{stats[:p2p_paths]} P2P, " \
    "#{stats[:ipc_paths]} IPC (avg #{stats[:avg_bandwidth_gbps]} GB/s)"
end

#transport_type(src, dst) ⇒ Symbol?

Get transport type for a GPU pair

Parameters:

  • src (Integer)

    Source GPU

  • dst (Integer)

    Destination GPU

Returns:

  • (Symbol, nil)

    Transport type



77
78
79
80
# File 'lib/nvruby/collective/transport_selector.rb', line 77

def transport_type(src, dst)
  transport = select_transport(src, dst)
  transport&.class&.transport_type
end