Class: Ignis::Collective::DeviceManager
- Inherits:
-
Object
- Object
- Ignis::Collective::DeviceManager
- Defined in:
- lib/nvruby/collective/device_manager.rb
Overview
Multi-GPU device manager Handles device enumeration, context management, and peer access configuration
Instance Attribute Summary collapse
-
#device_ids ⇒ Array<Integer>
readonly
Managed GPU device IDs.
-
#devices ⇒ Hash<Integer, CUDA::Device>
readonly
Device objects by ID.
-
#p2p_access_enabled ⇒ Hash<Array<Integer>, Boolean>
readonly
P2P access status.
-
#topology ⇒ Topology::Detector
readonly
Topology detector.
Instance Method Summary collapse
-
#destroy! ⇒ void
Clean up resources.
-
#detect_topology! ⇒ Topology::Detector
Detect GPU topology.
-
#device(device_id) ⇒ CUDA::Device?
Get device by ID.
-
#disable_all_p2p_access! ⇒ void
Disable all P2P access.
-
#enable_all_p2p_access! ⇒ Hash<Array<Integer>, Boolean>
Enable P2P access between all GPU pairs where available.
-
#initialize(device_ids: nil) ⇒ DeviceManager
constructor
Create device manager for specified GPUs.
-
#initialize! ⇒ void
Initialize device manager and detect topology.
-
#optimal_ring_order ⇒ Array<Integer>
Get optimal ring order for collective operations.
-
#p2p_summary ⇒ Hash
Get P2P capability summary.
-
#ready? ⇒ Boolean
Check if fully initialized.
-
#set_device!(device_id) ⇒ void
Set current CUDA device.
-
#size ⇒ Integer
Get number of managed GPUs.
-
#synchronize!(device_id) ⇒ void
Synchronize a device.
-
#synchronize_all! ⇒ void
Synchronize all managed devices.
-
#to_s ⇒ String
Human-readable summary.
Constructor Details
#initialize(device_ids: nil) ⇒ DeviceManager
Create device manager for specified GPUs
25 26 27 28 29 30 31 32 33 34 |
# File 'lib/nvruby/collective/device_manager.rb', line 25 def initialize(device_ids: nil) @device_ids = (device_ids || all_device_ids).dup.freeze @devices = {} @topology = nil @p2p_access_enabled = {} @initialized = false validate_devices! create_device_objects! end |
Instance Attribute Details
#device_ids ⇒ Array<Integer> (readonly)
Returns Managed GPU device IDs.
12 13 14 |
# File 'lib/nvruby/collective/device_manager.rb', line 12 def device_ids @device_ids end |
#devices ⇒ Hash<Integer, CUDA::Device> (readonly)
Returns Device objects by ID.
15 16 17 |
# File 'lib/nvruby/collective/device_manager.rb', line 15 def devices @devices end |
#p2p_access_enabled ⇒ Hash<Array<Integer>, Boolean> (readonly)
Returns P2P access status.
21 22 23 |
# File 'lib/nvruby/collective/device_manager.rb', line 21 def p2p_access_enabled @p2p_access_enabled end |
#topology ⇒ Topology::Detector (readonly)
Returns Topology detector.
18 19 20 |
# File 'lib/nvruby/collective/device_manager.rb', line 18 def topology @topology end |
Instance Method Details
#destroy! ⇒ void
This method returns an undefined value.
Clean up resources
163 164 165 166 167 168 |
# File 'lib/nvruby/collective/device_manager.rb', line 163 def destroy! disable_all_p2p_access! @devices.clear @topology = nil @initialized = false end |
#detect_topology! ⇒ Topology::Detector
Detect GPU topology
47 48 49 |
# File 'lib/nvruby/collective/device_manager.rb', line 47 def detect_topology! @topology = Topology::Detector.new(device_ids: @device_ids) end |
#device(device_id) ⇒ CUDA::Device?
Get device by ID
108 109 110 |
# File 'lib/nvruby/collective/device_manager.rb', line 108 def device(device_id) @devices[device_id] end |
#disable_all_p2p_access! ⇒ void
This method returns an undefined value.
Disable all P2P access
88 89 90 91 92 93 94 95 96 |
# File 'lib/nvruby/collective/device_manager.rb', line 88 def disable_all_p2p_access! @p2p_access_enabled.each_key do |(src, dst)| CUDA::RuntimeAPI.cudaSetDevice(src) P2PBindings.cudaDeviceDisablePeerAccess(dst) rescue StandardError # Ignore errors during cleanup end @p2p_access_enabled.clear end |
#enable_all_p2p_access! ⇒ Hash<Array<Integer>, Boolean>
Enable P2P access between all GPU pairs where available
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/nvruby/collective/device_manager.rb', line 53 def enable_all_p2p_access! return @p2p_access_enabled unless @p2p_access_enabled.empty? detect_topology! unless @topology P2PBindings.ensure_loaded! CUDA::RuntimeAPI.ensure_loaded! @device_ids.each do |src| @device_ids.each do |dst| next if src == dst # Check if P2P is possible unless @topology.p2p_available?(src, dst) @p2p_access_enabled[[src, dst]] = false next end # Set source device context status = CUDA::RuntimeAPI.cudaSetDevice(src) CUDA::RuntimeAPI.check_status!(status, "Set device #{src}") # Enable peer access status = P2PBindings.cudaDeviceEnablePeerAccess(dst, 0) # 0 = success, 704 = already enabled @p2p_access_enabled[[src, dst]] = status.zero? || status == 704 end end @p2p_access_enabled end |
#initialize! ⇒ void
This method returns an undefined value.
Initialize device manager and detect topology
38 39 40 41 42 43 |
# File 'lib/nvruby/collective/device_manager.rb', line 38 def initialize! return if @initialized detect_topology! @initialized = true end |
#optimal_ring_order ⇒ Array<Integer>
Get optimal ring order for collective operations
100 101 102 103 |
# File 'lib/nvruby/collective/device_manager.rb', line 100 def optimal_ring_order detect_topology! unless @topology @topology.optimal_ring_order end |
#p2p_summary ⇒ Hash
Get P2P capability summary
148 149 150 151 152 153 154 155 156 157 158 159 |
# File 'lib/nvruby/collective/device_manager.rb', line 148 def p2p_summary return {} unless @topology matrix = @topology.matrix { gpu_count: @device_ids.size, total_paths: @device_ids.size * (@device_ids.size - 1), p2p_enabled: @p2p_access_enabled.count { |_, v| v }, nvlink_paths: matrix.nvlink_paths.size, full_mesh: matrix.full_p2p_mesh?, } end |
#ready? ⇒ Boolean
Check if fully initialized
142 143 144 |
# File 'lib/nvruby/collective/device_manager.rb', line 142 def ready? @initialized && @topology end |
#set_device!(device_id) ⇒ void
This method returns an undefined value.
Set current CUDA device
115 116 117 118 |
# File 'lib/nvruby/collective/device_manager.rb', line 115 def set_device!(device_id) validate_device_id!(device_id) @devices[device_id].set_current! end |
#size ⇒ Integer
Get number of managed GPUs
136 137 138 |
# File 'lib/nvruby/collective/device_manager.rb', line 136 def size @device_ids.size end |
#synchronize!(device_id) ⇒ void
This method returns an undefined value.
Synchronize a device
123 124 125 126 |
# File 'lib/nvruby/collective/device_manager.rb', line 123 def synchronize!(device_id) validate_device_id!(device_id) @devices[device_id].synchronize end |
#synchronize_all! ⇒ void
This method returns an undefined value.
Synchronize all managed devices
130 131 132 |
# File 'lib/nvruby/collective/device_manager.rb', line 130 def synchronize_all! @device_ids.each { |id| synchronize!(id) } end |
#to_s ⇒ String
Returns Human-readable summary.
171 172 173 174 |
# File 'lib/nvruby/collective/device_manager.rb', line 171 def to_s names = @devices.values.map { |d| "#{d.index}:#{d.name[0..15]}" } "DeviceManager[#{names.join(', ')}]" end |