Class: Ignis::Collective::DynamicOptimizer

Inherits:
Object
  • Object
show all
Defined in:
lib/nvruby/collective/dynamic_optimizer.rb

Overview

Dynamic GPU Optimizer Consolidates all dynamic optimization strategies for NvCCL

Features:

  • Thermal (heatmap) optimization

  • Load balancing

  • Power management

  • Memory management

  • GPU availability/selection

  • Scheduling optimization

  • Synchronization optimization

Examples:

Create optimizer and get optimal ring order

optimizer = DynamicOptimizer.new(device_ids: [0, 1, 2, 3])
ring_order = optimizer.optimal_ring_order(strategy: :thermal)

Constant Summary collapse

STRATEGIES =

Optimization strategies

%i[
  thermal
  load
  power
  memory
  availability
  scheduling
  synchronization
  balanced
].freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(device_ids:, health_monitor: nil) ⇒ DynamicOptimizer

Returns a new instance of DynamicOptimizer.

Parameters:

  • device_ids (Array<Integer>)

    GPU device IDs

  • health_monitor (HealthMonitor, nil) (defaults to: nil)

    Optional health monitor



48
49
50
51
52
53
54
55
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 48

def initialize(device_ids:, health_monitor: nil)
  @device_ids = device_ids.dup.freeze
  @health_monitor = health_monitor || HealthMonitor.new(device_ids: device_ids)
  @topology = Topology::Matrix.new(device_ids)
  @metrics_cache = {}
  @cache_ttl_seconds = 1.0
  @last_cache_time = Time.now - 100
end

Instance Attribute Details

#device_idsArray<Integer> (readonly)

Returns Device IDs.

Returns:

  • (Array<Integer>)

    Device IDs



38
39
40
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 38

def device_ids
  @device_ids
end

#health_monitorHealthMonitor (readonly)

Returns Health monitor instance.

Returns:



41
42
43
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 41

def health_monitor
  @health_monitor
end

#topologyTopology::Matrix (readonly)

Returns Topology matrix.

Returns:



44
45
46
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 44

def topology
  @topology
end

Instance Method Details

#available_devicesArray<Integer>

Get list of available GPUs

Returns:

  • (Array<Integer>)

    Available device IDs



117
118
119
120
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 117

def available_devices
  refresh_metrics_if_needed!
  @device_ids.reject { |id| should_exclude?(id) }
end

#current_metricsHash<Integer, Hash>

Get current GPU metrics for all devices

Returns:

  • (Hash<Integer, Hash>)

    Metrics per device



89
90
91
92
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 89

def current_metrics
  refresh_metrics_if_needed!
  @metrics_cache.dup
end

#optimal_ring_order(strategy: :balanced) ⇒ Array<Integer>

Get optimal ring order based on strategy

Parameters:

  • strategy (Symbol) (defaults to: :balanced)

    Optimization strategy

Returns:

  • (Array<Integer>)

    Ordered device IDs



61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 61

def optimal_ring_order(strategy: :balanced)
  refresh_metrics_if_needed!

  case strategy
  when :thermal
    thermal_optimized_order
  when :load
    load_balanced_order
  when :power
    power_optimized_order
  when :memory
    memory_optimized_order
  when :availability
    availability_optimized_order
  when :scheduling
    scheduling_optimized_order
  when :synchronization
    synchronization_optimized_order
  when :balanced
    balanced_optimized_order
  else
    @device_ids.dup
  end
end

#power_recommendationHash

Get power-aware throttling recommendation

Returns:

  • (Hash)

    Power recommendations



145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 145

def power_recommendation
  refresh_metrics_if_needed!

  high_power_devices = @metrics_cache.select do |_id, m|
    m[:power_usage] && m[:power_limit] && 
      (m[:power_usage].to_f / m[:power_limit]) > 0.9
  end.keys

  {
    throttle_devices: high_power_devices,
    should_throttle: high_power_devices.any?,
    recommendation: high_power_devices.any? ? :reduce_batch_size : :continue
  }
end

#should_exclude?(device_id) ⇒ Boolean

Check if a GPU should be excluded from collective

Parameters:

  • device_id (Integer)

    Device ID

Returns:

  • (Boolean)

    True if device should be excluded



98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 98

def should_exclude?(device_id)
  metrics = @metrics_cache[device_id]
  return true unless metrics

  # Exclude if thermal throttling
  return true if metrics[:temperature] && metrics[:temperature] > 90

  # Exclude if out of memory
  return true if metrics[:memory_used_percent] && metrics[:memory_used_percent] > 95

  # Exclude if marked unhealthy
  return true if metrics[:healthy] == false

  false
end

#suggest_chunk_size(total_size) ⇒ Integer

Suggest optimal chunk size based on memory

Parameters:

  • total_size (Integer)

    Total data size in bytes

Returns:

  • (Integer)

    Recommended chunk size



126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 126

def suggest_chunk_size(total_size)
  refresh_metrics_if_needed!

  # Find minimum available memory across all devices
  min_available = @metrics_cache.values.map do |m|
    m[:memory_free] || Float::INFINITY
  end.min

  # Use at most 25% of minimum available memory
  max_chunk = (min_available * 0.25).to_i

  # Align to 256 bytes
  chunk = [max_chunk, total_size].min
  (chunk / 256) * 256
end

#to_sString

Returns:

  • (String)


161
162
163
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 161

def to_s
  "DynamicOptimizer[#{@device_ids.size} GPUs, strategy=balanced]"
end