Class: Ignis::Collective::DeviceManager

Inherits:
Object
  • Object
show all
Defined in:
lib/nvruby/collective/device_manager.rb

Overview

Multi-GPU device manager Handles device enumeration, context management, and peer access configuration

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(device_ids: nil) ⇒ DeviceManager

Create device manager for specified GPUs

Parameters:

  • device_ids (Array<Integer>, nil) (defaults to: nil)

    GPUs to manage (nil = all)



25
26
27
28
29
30
31
32
33
34
# File 'lib/nvruby/collective/device_manager.rb', line 25

def initialize(device_ids: nil)
  @device_ids = (device_ids || all_device_ids).dup.freeze
  @devices = {}
  @topology = nil
  @p2p_access_enabled = {}
  @initialized = false

  validate_devices!
  create_device_objects!
end

Instance Attribute Details

#device_idsArray<Integer> (readonly)

Returns Managed GPU device IDs.

Returns:

  • (Array<Integer>)

    Managed GPU device IDs



12
13
14
# File 'lib/nvruby/collective/device_manager.rb', line 12

def device_ids
  @device_ids
end

#devicesHash<Integer, CUDA::Device> (readonly)

Returns Device objects by ID.

Returns:

  • (Hash<Integer, CUDA::Device>)

    Device objects by ID



15
16
17
# File 'lib/nvruby/collective/device_manager.rb', line 15

def devices
  @devices
end

#p2p_access_enabledHash<Array<Integer>, Boolean> (readonly)

Returns P2P access status.

Returns:

  • (Hash<Array<Integer>, Boolean>)

    P2P access status



21
22
23
# File 'lib/nvruby/collective/device_manager.rb', line 21

def p2p_access_enabled
  @p2p_access_enabled
end

#topologyTopology::Detector (readonly)

Returns Topology detector.

Returns:



18
19
20
# File 'lib/nvruby/collective/device_manager.rb', line 18

def topology
  @topology
end

Instance Method Details

#destroy!void

This method returns an undefined value.

Clean up resources



163
164
165
166
167
168
# File 'lib/nvruby/collective/device_manager.rb', line 163

def destroy!
  disable_all_p2p_access!
  @devices.clear
  @topology = nil
  @initialized = false
end

#detect_topology!Topology::Detector

Detect GPU topology

Returns:



47
48
49
# File 'lib/nvruby/collective/device_manager.rb', line 47

def detect_topology!
  @topology = Topology::Detector.new(device_ids: @device_ids)
end

#device(device_id) ⇒ CUDA::Device?

Get device by ID

Parameters:

  • device_id (Integer)

    GPU device ID

Returns:

  • (CUDA::Device, nil)

    Device object



108
109
110
# File 'lib/nvruby/collective/device_manager.rb', line 108

def device(device_id)
  @devices[device_id]
end

#disable_all_p2p_access!void

This method returns an undefined value.

Disable all P2P access



88
89
90
91
92
93
94
95
96
# File 'lib/nvruby/collective/device_manager.rb', line 88

def disable_all_p2p_access!
  @p2p_access_enabled.each_key do |(src, dst)|
    CUDA::RuntimeAPI.cudaSetDevice(src)
    P2PBindings.cudaDeviceDisablePeerAccess(dst)
  rescue StandardError
    # Ignore errors during cleanup
  end
  @p2p_access_enabled.clear
end

#enable_all_p2p_access!Hash<Array<Integer>, Boolean>

Enable P2P access between all GPU pairs where available

Returns:

  • (Hash<Array<Integer>, Boolean>)

    Map of (src, dst) to success status



53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/nvruby/collective/device_manager.rb', line 53

def enable_all_p2p_access!
  return @p2p_access_enabled unless @p2p_access_enabled.empty?

  detect_topology! unless @topology

  P2PBindings.ensure_loaded!
  CUDA::RuntimeAPI.ensure_loaded!

  @device_ids.each do |src|
    @device_ids.each do |dst|
      next if src == dst

      # Check if P2P is possible
      unless @topology.p2p_available?(src, dst)
        @p2p_access_enabled[[src, dst]] = false
        next
      end

      # Set source device context
      status = CUDA::RuntimeAPI.cudaSetDevice(src)
      CUDA::RuntimeAPI.check_status!(status, "Set device #{src}")

      # Enable peer access
      status = P2PBindings.cudaDeviceEnablePeerAccess(dst, 0)

      # 0 = success, 704 = already enabled
      @p2p_access_enabled[[src, dst]] = status.zero? || status == 704
    end
  end

  @p2p_access_enabled
end

#initialize!void

This method returns an undefined value.

Initialize device manager and detect topology



38
39
40
41
42
43
# File 'lib/nvruby/collective/device_manager.rb', line 38

def initialize!
  return if @initialized

  detect_topology!
  @initialized = true
end

#optimal_ring_orderArray<Integer>

Get optimal ring order for collective operations

Returns:

  • (Array<Integer>)

    Ordered device IDs



100
101
102
103
# File 'lib/nvruby/collective/device_manager.rb', line 100

def optimal_ring_order
  detect_topology! unless @topology
  @topology.optimal_ring_order
end

#p2p_summaryHash

Get P2P capability summary

Returns:

  • (Hash)

    P2P statistics



148
149
150
151
152
153
154
155
156
157
158
159
# File 'lib/nvruby/collective/device_manager.rb', line 148

def p2p_summary
  return {} unless @topology

  matrix = @topology.matrix
  {
    gpu_count: @device_ids.size,
    total_paths: @device_ids.size * (@device_ids.size - 1),
    p2p_enabled: @p2p_access_enabled.count { |_, v| v },
    nvlink_paths: matrix.nvlink_paths.size,
    full_mesh: matrix.full_p2p_mesh?,
  }
end

#ready?Boolean

Check if fully initialized

Returns:

  • (Boolean)

    True if ready



142
143
144
# File 'lib/nvruby/collective/device_manager.rb', line 142

def ready?
  @initialized && @topology
end

#set_device!(device_id) ⇒ void

This method returns an undefined value.

Set current CUDA device

Parameters:

  • device_id (Integer)

    GPU to activate



115
116
117
118
# File 'lib/nvruby/collective/device_manager.rb', line 115

def set_device!(device_id)
  validate_device_id!(device_id)
  @devices[device_id].set_current!
end

#sizeInteger

Get number of managed GPUs

Returns:

  • (Integer)

    GPU count



136
137
138
# File 'lib/nvruby/collective/device_manager.rb', line 136

def size
  @device_ids.size
end

#synchronize!(device_id) ⇒ void

This method returns an undefined value.

Synchronize a device

Parameters:

  • device_id (Integer)

    GPU to synchronize



123
124
125
126
# File 'lib/nvruby/collective/device_manager.rb', line 123

def synchronize!(device_id)
  validate_device_id!(device_id)
  @devices[device_id].synchronize
end

#synchronize_all!void

This method returns an undefined value.

Synchronize all managed devices



130
131
132
# File 'lib/nvruby/collective/device_manager.rb', line 130

def synchronize_all!
  @device_ids.each { |id| synchronize!(id) }
end

#to_sString

Returns Human-readable summary.

Returns:

  • (String)

    Human-readable summary



171
172
173
174
# File 'lib/nvruby/collective/device_manager.rb', line 171

def to_s
  names = @devices.values.map { |d| "#{d.index}:#{d.name[0..15]}" }
  "DeviceManager[#{names.join(', ')}]"
end