Class: Ignis::AI::Trainer
- Inherits:
-
Object
- Object
- Ignis::AI::Trainer
- Defined in:
- lib/nnw/ai/trainer.rb
Overview
Trainer — complete training loop with gradient accumulation, checkpointing, and multi-GPU support via NvCCL.
Instance Attribute Summary collapse
-
#metrics ⇒ Hash
readonly
Training metrics.
- #model ⇒ Transformer::Model readonly
- #optimizer ⇒ Optim::Base readonly
Instance Method Summary collapse
-
#initialize(model:, optimizer:, scheduler: nil, grad_accumulation_steps: 1, max_grad_norm: 1.0, use_nvccl: false, checkpoint_dir: nil) ⇒ Trainer
constructor
A new instance of Trainer.
-
#load_checkpoint!(path) ⇒ void
Load from checkpoint.
-
#save_checkpoint! ⇒ String
Save model checkpoint.
-
#train(data_loader, steps:, log_interval: 100, checkpoint_interval: 1000, eval_fn: nil) ⇒ Hash
Train for a specified number of steps.
Constructor Details
#initialize(model:, optimizer:, scheduler: nil, grad_accumulation_steps: 1, max_grad_norm: 1.0, use_nvccl: false, checkpoint_dir: nil) ⇒ Trainer
Returns a new instance of Trainer.
24 25 26 27 28 29 30 31 32 33 34 35 36 |
# File 'lib/nnw/ai/trainer.rb', line 24 def initialize(model:, optimizer:, scheduler: nil, grad_accumulation_steps: 1, max_grad_norm: 1.0, use_nvccl: false, checkpoint_dir: nil) @model = model @optimizer = optimizer @scheduler = scheduler @grad_accumulation_steps = grad_accumulation_steps @max_grad_norm = max_grad_norm @use_nvccl = use_nvccl @checkpoint_dir = checkpoint_dir @metrics = { steps: 0, total_loss: 0.0, best_loss: Float::INFINITY } @model.train! end |
Instance Attribute Details
#metrics ⇒ Hash (readonly)
Returns training metrics.
15 16 17 |
# File 'lib/nnw/ai/trainer.rb', line 15 def metrics @metrics end |
#model ⇒ Transformer::Model (readonly)
9 10 11 |
# File 'lib/nnw/ai/trainer.rb', line 9 def model @model end |
#optimizer ⇒ Optim::Base (readonly)
12 13 14 |
# File 'lib/nnw/ai/trainer.rb', line 12 def optimizer @optimizer end |
Instance Method Details
#load_checkpoint!(path) ⇒ void
This method returns an undefined value.
Load from checkpoint.
155 156 157 158 |
# File 'lib/nnw/ai/trainer.rb', line 155 def load_checkpoint!(path) Safetensors.load_model(@model, path, strict: false) Ignis.logger.info("Checkpoint loaded: #{path}") end |
#save_checkpoint! ⇒ String
Save model checkpoint.
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
# File 'lib/nnw/ai/trainer.rb', line 131 def save_checkpoint! return unless @checkpoint_dir Dir.mkdir(@checkpoint_dir) unless Dir.exist?(@checkpoint_dir) path = File.join(@checkpoint_dir, "checkpoint_step_#{@metrics[:steps]}.safetensors") tensors = {} @model.named_parameters.each do |name, param| tensors[name] = param end Safetensors.save(tensors, path, metadata: { "step" => @metrics[:steps].to_s, "loss" => (@metrics[:total_loss] / [@metrics[:steps], 1].max).to_s, "framework" => "nnw" }) Ignis.logger.info("Checkpoint saved: #{path}") path end |
#train(data_loader, steps:, log_interval: 100, checkpoint_interval: 1000, eval_fn: nil) ⇒ Hash
Train for a specified number of steps.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
# File 'lib/nnw/ai/trainer.rb', line 45 def train(data_loader, steps:, log_interval: 100, checkpoint_interval: 1000, eval_fn: nil) @model.train! accumulated_loss = 0.0 steps.times do |step| # Get batch batch = data_loader.next_batch input_ids = batch[:input_ids] targets = batch[:targets] # Forward pass logits = @model.call(input_ids) loss = Loss.cross_entropy(logits, targets) # Scale loss for gradient accumulation scaled_loss = loss * (1.0 / @grad_accumulation_steps) # Backward pass scaled_loss.backward! accumulated_loss += loss.item # Optimizer step (every grad_accumulation_steps) if (step + 1) % @grad_accumulation_steps == 0 # Gradient clipping grad_norm = @optimizer.clip_grad_norm!(@max_grad_norm) # Multi-GPU gradient sync if @use_nvccl sync_gradients_nvccl! end # Optimizer step @optimizer.step @optimizer.zero_grad! @scheduler&.step @metrics[:steps] += 1 @metrics[:total_loss] += accumulated_loss / @grad_accumulation_steps # Logging if @metrics[:steps] % log_interval == 0 avg_loss = @metrics[:total_loss] / @metrics[:steps] lr = @optimizer.lr Ignis.logger.info( "Step #{@metrics[:steps]} | Loss: #{'%.4f' % (accumulated_loss / @grad_accumulation_steps)} | " \ "Avg Loss: #{'%.4f' % avg_loss} | LR: #{'%.2e' % lr} | Grad Norm: #{'%.2f' % grad_norm}" ) # EventBus publish if defined?(Ignis::Shared::EventBus) Ignis::Shared::EventBus.publish(:training_step, { step: @metrics[:steps], loss: accumulated_loss / @grad_accumulation_steps, avg_loss: avg_loss, lr: lr, grad_norm: grad_norm }) end # Eval if eval_fn @model.eval! eval_fn.call(@model, @metrics[:steps]) @model.train! end end # Checkpointing if @checkpoint_dir && @metrics[:steps] % checkpoint_interval == 0 save_checkpoint! end accumulated_loss = 0.0 end # Clear tape each iteration Tape.clear! end @metrics end |