trainers-rb

Fine-tune transformer models in Ruby.

trainers-rb provides a training loop, LoRA (Low-Rank Adaptation), learning rate scheduling, and model serialization for HuggingFace transformer models loaded via transformers-rb. It builds on torch-rb for autograd, optimizers, and tensor operations.

All the heavy lifting happens in LibTorch C++ kernels. Ruby is the conductor.

Installation

Add to your Gemfile:

gem "trainers-rb"

Or install directly:

gem install trainers-rb

Prerequisites

trainers-rb depends on torch-rb, which requires LibTorch:

# macOS arm64
curl -L -o /tmp/libtorch.zip https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.4.0.zip
unzip /tmp/libtorch.zip -d ~/libtorch
bundle config set build.torch-rb --with-torch-dir=$HOME/libtorch/libtorch

Quick Start

require "trainers-rb"

# Load a pre-trained model and tokenizer
model, tokenizer = Trainers.from_pretrained(
  "distilbert-base-uncased",
  task: :sequence_classification,
  num_labels: 2
)

# Prepare your dataset
train_data = texts.map.with_index do |text, i|
  encoded = tokenizer.(text, truncation: true, max_length: 128)
  {
    input_ids:      encoded["input_ids"],
    attention_mask: encoded["attention_mask"],
    labels:         labels[i]
  }
end
train_dataset = Trainers::Dataset.new(train_data)

# Configure and train
args = Trainers::TrainingArguments.new(
  output_dir:        "./output",
  num_train_epochs:  3,
  learning_rate:     2e-5,
  eval_strategy:     :epoch
)

trainer = Trainers::Trainer.new(
  model:         model,
  args:          args,
  train_dataset: train_dataset,
  eval_dataset:  val_dataset,
  tokenizer:     tokenizer,
  data_collator: Trainers::DataCollatorWithPadding.new(tokenizer: tokenizer),
  compute_metrics: ->(eval_pred) {
    preds   = eval_pred.predictions.argmax(1)
    correct = preds.eq(eval_pred.label_ids).sum.item
    { accuracy: correct.to_f / eval_pred.label_ids.size(0) }
  }
)

trainer.train
trainer.save_model("./my-model")

LoRA (Parameter-Efficient Fine-Tuning)

Freeze 99% of parameters and train only small low-rank adapter matrices:

# Apply LoRA to specific layers
config = Trainers::LoraConfig.new(
  r:              8,         # rank
  lora_alpha:     16,        # scaling factor
  lora_dropout:   0.1,
  target_modules: ["query", "value"],  # which Linear layers to adapt
  bias:           :none      # :none, :all, or :lora_only
)

Trainers::LoraModel.apply(model, config)
# => LoRA applied to 12 modules: ...
# => trainable params: 294,912 || all params: 66,955,010 || trainable%: 0.4404%

# Train as usual
trainer.train

# Save just the adapters (tiny files)
Trainers::LoraModel.save_adapters(model, "./lora-adapters")

# Or merge into base model for inference
Trainers::LoraModel.merge(model)
trainer.save_model("./merged-model")

Loading saved LoRA adapters

model, tokenizer = Trainers.from_pretrained("distilbert-base-uncased", num_labels: 2)
Trainers::LoraModel.apply(model, config)
Trainers::LoraModel.load_adapters(model, "./lora-adapters")

Training Arguments

Argument	Default	Description
`output_dir`	`"./output"`	Directory for checkpoints and saved models
`num_train_epochs`	`3`	Number of training epochs
`per_device_train_batch_size`	`8`	Training batch size
`per_device_eval_batch_size`	`8`	Evaluation batch size
`learning_rate`	`5e-5`	Peak learning rate for AdamW
`weight_decay`	`0.0`	Weight decay (applied to non-bias, non-norm params)
`max_grad_norm`	`1.0`	Max gradient norm for clipping
`gradient_accumulation_steps`	`1`	Accumulate gradients over N steps
`warmup_steps`	`0`	Linear warmup steps
`warmup_ratio`	`0.0`	Warmup as fraction of total steps (alternative to warmup_steps)
`lr_scheduler_type`	`:linear`	`:linear`, `:cosine`, or `:constant`
`eval_strategy`	`:no`	When to evaluate: `:no`, `:epoch`, or `:steps`
`eval_steps`	`nil`	Evaluate every N steps (when `eval_strategy: :steps`)
`save_strategy`	`:epoch`	When to save: `:no`, `:epoch`, or `:steps`
`save_total_limit`	`nil`	Keep only the last N checkpoints
`logging_steps`	`500`	Log every N steps
`seed`	`42`	Random seed
`no_mps`	`false`	Force CPU even if MPS is available

Callbacks

Built-in callbacks:

# Early stopping
early_stop = Trainers::EarlyStoppingCallback.new(
  patience:    3,
  threshold:   0.01,
  metric_name: "eval_loss"
)

trainer = Trainers::Trainer.new(
  model: model,
  args: args,
  callbacks: [early_stop],
  # ...
)

Custom callbacks:

class WandbCallback < Trainers::TrainerCallback
  def on_log(args, state, control, logs: nil, **)
    # send logs to Weights & Biases, MLflow, etc.
  end

  def on_evaluate(args, state, control, metrics: nil, **)
    # log evaluation metrics
  end
end

Callback hooks

Hook	When it fires
`on_train_begin`	Before the first step
`on_train_end`	After the last step
`on_epoch_begin`	Start of each epoch
`on_epoch_end`	End of each epoch
`on_step_begin`	Before each training step
`on_step_end`	After each training step
`on_log`	When metrics are logged
`on_evaluate`	After evaluation
`on_save`	After saving a checkpoint

Learning Rate Schedulers

Three schedules are available, all with optional linear warmup:

# Linear warmup then linear decay to 0 (default)
args = Trainers::TrainingArguments.new(lr_scheduler_type: :linear, warmup_steps: 100)

# Linear warmup then cosine decay to 0
args = Trainers::TrainingArguments.new(lr_scheduler_type: :cosine, warmup_steps: 100)

# Linear warmup then constant
args = Trainers::TrainingArguments.new(lr_scheduler_type: :constant, warmup_steps: 100)

Data Utilities

Dataset

Wrap an array of hashes:

data = [
  { input_ids: [101, 2023, 2003], attention_mask: [1, 1, 1], labels: 1 },
  { input_ids: [101, 2919, 2143], attention_mask: [1, 1, 1], labels: 0 },
]
dataset = Trainers::Dataset.new(data)

Data Collators

Dynamic padding collator (pads each batch to the longest sequence in that batch):

collator = Trainers::DataCollatorWithPadding.new(tokenizer: tokenizer)

Default collator (no padding, expects uniform-length inputs):

collator = Trainers::DefaultDataCollator.new

Supported Tasks

trainers-rb works with any Torch::NN::Module. The Trainers.from_pretrained convenience method supports these transformers-rb model classes:

Task	Model class
`:sequence_classification`	`AutoModelForSequenceClassification`
`:token_classification`	`AutoModelForTokenClassification`
`:question_answering`	`AutoModelForQuestionAnswering`

You can also use any custom model:

trainer = Trainers::Trainer.new(model: my_custom_model, args: args, ...)

Device Support

trainers-rb auto-detects the best available device:

CPU — always available
MPS — Apple Silicon GPU, used automatically when available

# Force CPU
args = Trainers::TrainingArguments.new(no_mps: true)

# Or set explicitly
args = Trainers::TrainingArguments.new(device: Torch.device("mps"))

Architecture

trainers-rb
  -> transformers-rb    (model loading, tokenizers, HF Hub)
    -> torch-rb         (autograd, nn modules, optimizers)
    -> tokenizers        (HuggingFace Rust tokenizers via FFI)
    -> safetensors       (weight file I/O)

trainers-rb adds the training layer that transformers-rb intentionally omits. Both gems call into the same LibTorch C++ kernels for the actual computation.

Roadmap

[ ] More model architectures in transformers-rb (GPT-2, Llama for text generation)
[ ] Mixed precision training (fp16/bf16)
[ ] Gradient checkpointing for memory efficiency
[ ] Dataset streaming for large datasets
[ ] Distributed training
[ ] Integration with ONNX export for deployment
[ ] QLoRA (quantized base model + LoRA)

Contributing

Bug reports and pull requests are welcome on GitHub.

License

MIT