Class: Ignis::Collective::DynamicOptimizer
- Inherits:
-
Object
- Object
- Ignis::Collective::DynamicOptimizer
- Defined in:
- lib/nvruby/collective/dynamic_optimizer.rb
Overview
Dynamic GPU Optimizer Consolidates all dynamic optimization strategies for NvCCL
Features:
-
Thermal (heatmap) optimization
-
Load balancing
-
Power management
-
Memory management
-
GPU availability/selection
-
Scheduling optimization
-
Synchronization optimization
Constant Summary collapse
- STRATEGIES =
Optimization strategies
%i[ thermal load power memory availability scheduling synchronization balanced ].freeze
Instance Attribute Summary collapse
-
#device_ids ⇒ Array<Integer>
readonly
Device IDs.
-
#health_monitor ⇒ HealthMonitor
readonly
Health monitor instance.
-
#topology ⇒ Topology::Matrix
readonly
Topology matrix.
Instance Method Summary collapse
-
#available_devices ⇒ Array<Integer>
Get list of available GPUs.
-
#current_metrics ⇒ Hash<Integer, Hash>
Get current GPU metrics for all devices.
-
#initialize(device_ids:, health_monitor: nil) ⇒ DynamicOptimizer
constructor
A new instance of DynamicOptimizer.
-
#optimal_ring_order(strategy: :balanced) ⇒ Array<Integer>
Get optimal ring order based on strategy.
-
#power_recommendation ⇒ Hash
Get power-aware throttling recommendation.
-
#should_exclude?(device_id) ⇒ Boolean
Check if a GPU should be excluded from collective.
-
#suggest_chunk_size(total_size) ⇒ Integer
Suggest optimal chunk size based on memory.
- #to_s ⇒ String
Constructor Details
#initialize(device_ids:, health_monitor: nil) ⇒ DynamicOptimizer
Returns a new instance of DynamicOptimizer.
48 49 50 51 52 53 54 55 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 48 def initialize(device_ids:, health_monitor: nil) @device_ids = device_ids.dup.freeze @health_monitor = health_monitor || HealthMonitor.new(device_ids: device_ids) @topology = Topology::Matrix.new(device_ids) @metrics_cache = {} @cache_ttl_seconds = 1.0 @last_cache_time = Time.now - 100 end |
Instance Attribute Details
#device_ids ⇒ Array<Integer> (readonly)
Returns Device IDs.
38 39 40 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 38 def device_ids @device_ids end |
#health_monitor ⇒ HealthMonitor (readonly)
Returns Health monitor instance.
41 42 43 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 41 def health_monitor @health_monitor end |
#topology ⇒ Topology::Matrix (readonly)
Returns Topology matrix.
44 45 46 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 44 def topology @topology end |
Instance Method Details
#available_devices ⇒ Array<Integer>
Get list of available GPUs
117 118 119 120 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 117 def available_devices refresh_metrics_if_needed! @device_ids.reject { |id| should_exclude?(id) } end |
#current_metrics ⇒ Hash<Integer, Hash>
Get current GPU metrics for all devices
89 90 91 92 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 89 def current_metrics refresh_metrics_if_needed! @metrics_cache.dup end |
#optimal_ring_order(strategy: :balanced) ⇒ Array<Integer>
Get optimal ring order based on strategy
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 61 def optimal_ring_order(strategy: :balanced) refresh_metrics_if_needed! case strategy when :thermal thermal_optimized_order when :load load_balanced_order when :power power_optimized_order when :memory memory_optimized_order when :availability availability_optimized_order when :scheduling scheduling_optimized_order when :synchronization synchronization_optimized_order when :balanced balanced_optimized_order else @device_ids.dup end end |
#power_recommendation ⇒ Hash
Get power-aware throttling recommendation
145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 145 def power_recommendation refresh_metrics_if_needed! high_power_devices = @metrics_cache.select do |_id, m| m[:power_usage] && m[:power_limit] && (m[:power_usage].to_f / m[:power_limit]) > 0.9 end.keys { throttle_devices: high_power_devices, should_throttle: high_power_devices.any?, recommendation: high_power_devices.any? ? :reduce_batch_size : :continue } end |
#should_exclude?(device_id) ⇒ Boolean
Check if a GPU should be excluded from collective
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 98 def should_exclude?(device_id) metrics = @metrics_cache[device_id] return true unless metrics # Exclude if thermal throttling return true if metrics[:temperature] && metrics[:temperature] > 90 # Exclude if out of memory return true if metrics[:memory_used_percent] && metrics[:memory_used_percent] > 95 # Exclude if marked unhealthy return true if metrics[:healthy] == false false end |
#suggest_chunk_size(total_size) ⇒ Integer
Suggest optimal chunk size based on memory
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 126 def suggest_chunk_size(total_size) refresh_metrics_if_needed! # Find minimum available memory across all devices min_available = @metrics_cache.values.map do |m| m[:memory_free] || Float::INFINITY end.min # Use at most 25% of minimum available memory max_chunk = (min_available * 0.25).to_i # Align to 256 bytes chunk = [max_chunk, total_size].min (chunk / 256) * 256 end |
#to_s ⇒ String
161 162 163 |
# File 'lib/nvruby/collective/dynamic_optimizer.rb', line 161 def to_s "DynamicOptimizer[#{@device_ids.size} GPUs, strategy=balanced]" end |