trainers-rb
Fine-tune transformer models in Ruby.
trainers-rb provides a training loop, LoRA (Low-Rank Adaptation), learning rate scheduling, and model serialization for HuggingFace transformer models loaded via transformers-rb. It builds on torch-rb for autograd, optimizers, and tensor operations.
All the heavy lifting happens in LibTorch C++ kernels. Ruby is the conductor.
Installation
Add to your Gemfile:
gem "trainers-rb"
Or install directly:
gem install trainers-rb
Prerequisites
trainers-rb depends on torch-rb, which requires LibTorch:
# macOS arm64
curl -L -o /tmp/libtorch.zip https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.4.0.zip
unzip /tmp/libtorch.zip -d ~/libtorch
bundle config set build.torch-rb --with-torch-dir=$HOME/libtorch/libtorch
Quick Start
require "trainers-rb"
# Load a pre-trained model and tokenizer
model, tokenizer = Trainers.from_pretrained(
"distilbert-base-uncased",
task: :sequence_classification,
num_labels: 2
)
# Prepare your dataset
train_data = texts.map.with_index do |text, i|
encoded = tokenizer.(text, truncation: true, max_length: 128)
{
input_ids: encoded["input_ids"],
attention_mask: encoded["attention_mask"],
labels: labels[i]
}
end
train_dataset = Trainers::Dataset.new(train_data)
# Configure and train
args = Trainers::TrainingArguments.new(
output_dir: "./output",
num_train_epochs: 3,
learning_rate: 2e-5,
eval_strategy: :epoch
)
trainer = Trainers::Trainer.new(
model: model,
args: args,
train_dataset: train_dataset,
eval_dataset: val_dataset,
tokenizer: tokenizer,
data_collator: Trainers::DataCollatorWithPadding.new(tokenizer: tokenizer),
compute_metrics: ->(eval_pred) {
preds = eval_pred.predictions.argmax(1)
correct = preds.eq(eval_pred.label_ids).sum.item
{ accuracy: correct.to_f / eval_pred.label_ids.size(0) }
}
)
trainer.train
trainer.save_model("./my-model")
LoRA (Parameter-Efficient Fine-Tuning)
Freeze 99% of parameters and train only small low-rank adapter matrices:
# Apply LoRA to specific layers
config = Trainers::LoraConfig.new(
r: 8, # rank
lora_alpha: 16, # scaling factor
lora_dropout: 0.1,
target_modules: ["query", "value"], # which Linear layers to adapt
bias: :none # :none, :all, or :lora_only
)
Trainers::LoraModel.apply(model, config)
# => LoRA applied to 12 modules: ...
# => trainable params: 294,912 || all params: 66,955,010 || trainable%: 0.4404%
# Train as usual
trainer.train
# Save just the adapters (tiny files)
Trainers::LoraModel.save_adapters(model, "./lora-adapters")
# Or merge into base model for inference
Trainers::LoraModel.merge(model)
trainer.save_model("./merged-model")
Loading saved LoRA adapters
model, tokenizer = Trainers.from_pretrained("distilbert-base-uncased", num_labels: 2)
Trainers::LoraModel.apply(model, config)
Trainers::LoraModel.load_adapters(model, "./lora-adapters")
Training Arguments
| Argument | Default | Description |
|---|---|---|
output_dir |
"./output" |
Directory for checkpoints and saved models |
num_train_epochs |
3 |
Number of training epochs |
per_device_train_batch_size |
8 |
Training batch size |
per_device_eval_batch_size |
8 |
Evaluation batch size |
learning_rate |
5e-5 |
Peak learning rate for AdamW |
weight_decay |
0.0 |
Weight decay (applied to non-bias, non-norm params) |
max_grad_norm |
1.0 |
Max gradient norm for clipping |
gradient_accumulation_steps |
1 |
Accumulate gradients over N steps |
warmup_steps |
0 |
Linear warmup steps |
warmup_ratio |
0.0 |
Warmup as fraction of total steps (alternative to warmup_steps) |
lr_scheduler_type |
:linear |
:linear, :cosine, or :constant |
eval_strategy |
:no |
When to evaluate: :no, :epoch, or :steps |
eval_steps |
nil |
Evaluate every N steps (when eval_strategy: :steps) |
save_strategy |
:epoch |
When to save: :no, :epoch, or :steps |
save_total_limit |
nil |
Keep only the last N checkpoints |
logging_steps |
500 |
Log every N steps |
seed |
42 |
Random seed |
no_mps |
false |
Force CPU even if MPS is available |
Callbacks
Built-in callbacks:
# Early stopping
early_stop = Trainers::EarlyStoppingCallback.new(
patience: 3,
threshold: 0.01,
metric_name: "eval_loss"
)
trainer = Trainers::Trainer.new(
model: model,
args: args,
callbacks: [early_stop],
# ...
)
Custom callbacks:
class WandbCallback < Trainers::TrainerCallback
def on_log(args, state, control, logs: nil, **)
# send logs to Weights & Biases, MLflow, etc.
end
def on_evaluate(args, state, control, metrics: nil, **)
# log evaluation metrics
end
end
Callback hooks
| Hook | When it fires |
|---|---|
on_train_begin |
Before the first step |
on_train_end |
After the last step |
on_epoch_begin |
Start of each epoch |
on_epoch_end |
End of each epoch |
on_step_begin |
Before each training step |
on_step_end |
After each training step |
on_log |
When metrics are logged |
on_evaluate |
After evaluation |
on_save |
After saving a checkpoint |
Learning Rate Schedulers
Three schedules are available, all with optional linear warmup:
# Linear warmup then linear decay to 0 (default)
args = Trainers::TrainingArguments.new(lr_scheduler_type: :linear, warmup_steps: 100)
# Linear warmup then cosine decay to 0
args = Trainers::TrainingArguments.new(lr_scheduler_type: :cosine, warmup_steps: 100)
# Linear warmup then constant
args = Trainers::TrainingArguments.new(lr_scheduler_type: :constant, warmup_steps: 100)
Data Utilities
Dataset
Wrap an array of hashes:
data = [
{ input_ids: [101, 2023, 2003], attention_mask: [1, 1, 1], labels: 1 },
{ input_ids: [101, 2919, 2143], attention_mask: [1, 1, 1], labels: 0 },
]
dataset = Trainers::Dataset.new(data)
Data Collators
Dynamic padding collator (pads each batch to the longest sequence in that batch):
collator = Trainers::DataCollatorWithPadding.new(tokenizer: tokenizer)
Default collator (no padding, expects uniform-length inputs):
collator = Trainers::DefaultDataCollator.new
Supported Tasks
trainers-rb works with any Torch::NN::Module. The Trainers.from_pretrained convenience method supports these transformers-rb model classes:
| Task | Model class |
|---|---|
:sequence_classification |
AutoModelForSequenceClassification |
:token_classification |
AutoModelForTokenClassification |
:question_answering |
AutoModelForQuestionAnswering |
You can also use any custom model:
trainer = Trainers::Trainer.new(model: my_custom_model, args: args, ...)
Device Support
trainers-rb auto-detects the best available device:
- CPU — always available
- MPS — Apple Silicon GPU, used automatically when available
# Force CPU
args = Trainers::TrainingArguments.new(no_mps: true)
# Or set explicitly
args = Trainers::TrainingArguments.new(device: Torch.device("mps"))
Architecture
trainers-rb
-> transformers-rb (model loading, tokenizers, HF Hub)
-> torch-rb (autograd, nn modules, optimizers)
-> tokenizers (HuggingFace Rust tokenizers via FFI)
-> safetensors (weight file I/O)
trainers-rb adds the training layer that transformers-rb intentionally omits. Both gems call into the same LibTorch C++ kernels for the actual computation.
Roadmap
- [ ] More model architectures in transformers-rb (GPT-2, Llama for text generation)
- [ ] Mixed precision training (fp16/bf16)
- [ ] Gradient checkpointing for memory efficiency
- [ ] Dataset streaming for large datasets
- [ ] Distributed training
- [ ] Integration with ONNX export for deployment
- [ ] QLoRA (quantized base model + LoRA)
Contributing
Bug reports and pull requests are welcome on GitHub.
License
MIT