toy

v0.8.0 · first published gem · pre-1.0, not API-stable · CHANGELOG · docs · framework guide

Readable machine learning. toy is a transformer LM framework in Ruby, Spinel-compiled to native binaries. The whole forward pass fits on one screen, every shape is annotated, and the building blocks are named after the math — and it still runs real HuggingFace models (SmolLM2, Llama 3, Qwen 2.5/3, Mistral, Gemma 2, OLMoE) with output matched against PyTorch, on CPU, CUDA, and Metal.

# One transformer block (GPT-2 family — Llama swaps LN→RMSNorm and
# adds RoPE inside self_attention; same one-screen shape).
def transformer_block(x, block)
  h  = layer_norm(x, block.ln1_gamma, block.ln1_beta)
  x.add!(self_attention(h, block))
  h2 = layer_norm(x, block.ln2_gamma, block.ln2_beta)
  x.add!(feed_forward(h2, block))
  x
end

This is not pseudocode next to the real implementation — it is the implementation. Every model also carries an algorithm_card emitting Phuong–Hutter-style pseudocode (arXiv:2207.09238) with shape annotations, and the round-trip closes: toy describe <model> renders the card from a GGUF's metadata, and the card parses back into the Ruby that constructs the model.

Inference (KV-cache decode, F32/Q8, zero-copy mmap), training (from-scratch, warm-start, LoRA), eval (per-token logprobs), and an OpenAI-compatible server — each an end-to-end single native binary.

Five minutes of play

Requires Ruby, Spinel, and a C compiler. The CLI is plain MRI Ruby; the native compute runners are built on demand.

toy install                                  # build/verify the CPU backend
toy fetch ggml-org/models \
    tinyllamas/stories15M-q4_0.gguf          # grab a tiny model into ./data
toy infer data/stories15M-q4_0.gguf \
    --prompt "Once upon a time"              # greedy decode

Then train a small transformer from scratch, inspect it, serve it:

toy train from-scratch --steps 20 --seed 0   # runs/<id>/: weights, events, loss curve
toy eval data/SmolLM2-135M-Instruct-Q8_0.gguf --top-k 5
toy serve data/SmolLM2-135M-Instruct-Q8_0.gguf --port 4567 --name smol
toy list                                     # finds GGUFs in HF / Ollama / LM Studio caches
toy describe data/SmolLM2-135M-Instruct-Q8_0.gguf   # the algorithm card, from metadata

toy --help shows all 9 commands (new, install, list, describe, fetch, infer, train, eval, serve); docs/cli.md has flags, exit codes, and the machine-readable --manifest contract. If toy list shows nothing, any of toy fetch, huggingface-cli download, ollama pull, or LM Studio will populate a cache it sees.

Using toy as a framework

The CLI is the front door; the framework is the house. Everything is a layered stack — primitives → blocks → archs → engines → recipes — each layer plain Ruby, each gated bit-identical against a reference, all of it loaded with one require:

require "toy/compute"

cfg  = Toy::SmolLM2Config.tiny
opts = Toy::LLM::RecipeOptions.new
opts.t_seq = 32
opts.seed  = 42

recipe = Toy::LLM::Recipes::FromScratch.new
recipe.realize!(cfg, opts)
steps.times do |step|
  loss = recipe.step!(batch.seq_ids, batch.positions, batch.labels,
                      batch.hp, step == 0)
end

realize! builds the entire forward + loss + backward + AdamW graph natively; step! drives one training step; every knob is a named setter. That's the whole training contract — the same one toy's own gates use. Start your own project with toy new mylab (an experiment tree with ENV-driven hyperparameters — one compile, many runs) or toy new mylib --lib (a library consuming toy as a gem, native vendoring and a multi-arch build.sh included; devices are chosen at compile time).

The framework guide is the tour; docs/authoring.md shows how to add your own primitive, block, arch, or recipe; docs/consuming-toy.md is the full dependency story.

Models and backends

Seventeen checkpoints run today — across GPT-2, SmolLM2, TinyLlama, Llama 3.2, Mistral, Qwen 2.5/3, Gemma 2, and OLMoE (MoE) — in F32 and Q8_0, with three tokenizer flavors and RoPE scaling auto-detected from the GGUF. CPU is the gated reference backend; CUDA and Metal mirror it, held bit-identical by make verify-mirrors.

The honest per-model/per-backend matrix (including the footnotes — what's validated vs. expected-to-work vs. not wired) lives in docs/models.md; per-op coverage vs PyTorch in docs/coverage.md.

Documentation

docs/framework.md — start here to build with toy: the stack, recipes, toy/compute, project scaffolds.
docs/architecture.md — the five-layer algorithm stack and how the CLI shells to compute runners.
docs/cli.md — the 9 commands, flags, exit codes, and the manifest contract.
docs/authoring.md — adding a primitive, block, arch, or recipe; the algorithm-card round-trip.
docs/models.md — the supported-models matrix, tokenizers, RoPE, opt-in performance knobs.
docs/consuming-toy.md — toy as a gem dependency: vendoring, native builds, CUDA/Metal opt-ins.
docs/gating.md — the bit-identical reproducibility gates that hold all of the above together.
docs/events.md — the toy/v1 event schema.
docs/roadmap.md — deferred work and live research directions.
examples/ — the narrated teaching set: seven single-file examples (train, warm-start, LoRA, generate, logprobs, run-log compare, ViT), each one make target away.

Acknowledgments

A heartfelt thank-you to Ninoslav Milenović for graciously handing over the toy gem name on RubyGems. A good name is a gift, and giving one up for someone else's project is the kind of quiet generosity the Ruby community runs on. We don't take it lightly — thank you, Ninoslav. 🙏