Ollama::Client

A production-safe Ollama client for Rails & agent systems.

Not a chatbot UI. Not a 1:1 API wrapper. A failure-aware, contract-driven client that covers all 12 Ollama API endpoints with production guarantees.

Correctness. Determinism. Failure-aware design. Nothing else.

Why This Gem Exists

Other Ollama clients give you raw HTTP access. This one gives you production guarantees:

What goes wrong	What other gems do	What `ollama-client` does
Model isn't downloaded	Raise error	Auto-pull → retry
Ollama server is down	Hang for 60s	Fast-fail instantly
LLM returns broken JSON	Crash your parser	Repair prompt → retry
Request times out	Raise immediately	Exponential backoff
Schema violation	You find out in prod	`SchemaViolationError` before it reaches your code

Installation

gem "ollama-client"

Quick Start

Works out of the box — all defaults are production-safe:

require "ollama_client"

client = Ollama::Client.new
# model: "llama3.2:3b", timeout: 30, retries: 2, strict_json: true

Chat (Multi-turn Conversations)

The primary endpoint for agentic usage:

response = client.chat(
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is Ruby?" }
  ]
)

response.message.content  # => "Ruby is a dynamic, open source..."
response.message.role     # => "assistant"
response.done?            # => true
response.done_reason      # => "stop"
response.total_duration   # => 1234567 (nanoseconds)

Tool Calling

messages = [{ role: "user", content: "What is the weather in London?" }]

tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get weather for a city",
      parameters: {
        type: "object",
        properties: { city: { type: "string" } },
        required: ["city"]
      }
    }
  }
]

response = client.chat(messages: messages, tools: tools)
response.message.tool_calls.first.name       # => "get_weather"
response.message.tool_calls.first.arguments  # => { "city" => "London" }

Structured Output (JSON Schema)

messages = [{ role: "user", content: "What is the capital of France? Answer in JSON." }]
schema = { type: "object", properties: { answer: { type: "string" } } }

response = client.chat(messages: messages, format: schema)
JSON.parse(response.message.content)  # => { "answer" => "Paris" }

Thinking Mode

Note: Requires a thinking-capable model (e.g. deepseek-coder:6.7b, qwen3:0.6b).

messages = [{ role: "user", content: "What is the square root of 144?" }]

response = client.chat(messages: messages, model: "qwen3:0.6b", think: true)
response.message.thinking  # => "Let me reason through this..."
response.message.content   # => "The answer is 12."

Chat Options

messages = [{ role: "user", content: "Hello" }]

client.chat(
  messages: messages,
  model: "qwen2.5-coder:7b",             # Override default model
  options: { temperature: 0.8 }, # Runtime options
  keep_alive: "10m",           # Keep model loaded
  logprobs: true,              # Return log probabilities
  top_logprobs: 5
)

Generate (Prompt → Completion)

client.generate(prompt: "Explain Ruby blocks in one sentence.")
# => "Ruby blocks are anonymous closures passed to methods..."

Structured JSON (Agents / Planners)

schema = {
  "type" => "object",
  "required" => ["action", "confidence"],
  "properties" => {
    "action" => { "type" => "string", "enum" => ["search", "calculate", "finish"] },
    "confidence" => { "type" => "number" }
  }
}

result = client.generate(prompt: "User wants weather in Paris.", schema: schema)
result["action"]     # => "search"
result["confidence"] # => 0.95

If the LLM returns invalid JSON, the client automatically retries with a repair prompt. You get valid output or a typed exception — never a silent failure.

Structured Thinking (Zero-Magic CoT extraction)

You can ask reasoning models to output their thoughts separately from the final answer. ollama-client enforces this via strict JSON schema prompting.

Note: Requires a thinking model. Supported defaults: /deepseek/i, /qwen/i, /r1/i.

schema = {
  "type" => "object",
  "required" => ["decision"],
  "properties" => {
    "decision" => { "type" => "string" }
  }
}

result = client.generate(
  model: "deepseek-r1",
  prompt: "Should we BUY or WAIT?",
  schema: schema,
  think: true,
  return_reasoning: true
)

result["reasoning"]          # => "...step by step analysis..."
result["final"]["decision"]  # => "WAIT"

Generate Options

client.generate(
  prompt: "Write a poem",
  model: "qwen3:0.6b",               # Explicitly use a thinking model
  system: "You are a poet",          # System prompt
  think: true,                       # Thinking output
  keep_alive: "5m",                  # Keep model loaded
  options: { temperature: 0.8 }      # Runtime options
)

Streaming (Observer Hooks)

No raw SSE. No state corruption risk. Works with both chat and generate:

# Stream generate tokens
client.generate(
  prompt: "Write a haiku about code.",
  hooks: {
    on_token:    ->(token) { print token },
    on_error:    ->(err)   { warn err.message },
    on_complete: ->        { puts "\nDone" }
  }
)

# Stream chat tokens with log probabilities
client.chat(
  messages: [{ role: "user", content: "Tell me a story" }],
  logprobs: true,
  hooks: {
    # If your block takes 2 args, it receives the logprobs array for that token
    on_token: ->(token, logprobs) {
      print token
      # logprobs is an Array of Hashes, e.g. [{"token"=>"Once", "logprob"=>-0.12}, ...]
    },
    on_complete: -> { puts }
  }
)

Embeddings (RAG)

client.embeddings.embed(model: "nomic-embed-text:latest", input: "What is Ruby?")
# => [0.12, -0.05, 0.88, ...]

# Batch embeddings
client.embeddings.embed(model: "nomic-embed-text:latest", input: ["text1", "text2"])

# With options
client.embeddings.embed(
  model: "nomic-embed-text:latest",
  input: "text",
  truncate: true,        # Truncate long inputs
  dimensions: 256,       # Embedding dimensions
  keep_alive: "5m"       # Keep model loaded
)

Model Management

client.list_models              # Returns models with details & automatic capabilities map
# => [{ "name" => "llama3.1", "capabilities" => { "tools" => true, "thinking" => false, ... }, ... }]
client.list_model_names         # Just names: ["qwen2.5-coder:7b", "llama3.2:3b", ...]
client.list_running             # Currently loaded models (aliased as `ps`)
client.show_model(model: "qwen2.5-coder:7b")           # Model details, capabilities
client.show_model(model: "qwen2.5-coder:7b", verbose: true)  # Include model_info
client.pull("llama3.2:3b")                      # Download a model
client.delete_model(model: "old-model")      # Remove a model
client.copy_model(source: "qwen2.5-coder:7b", destination: "qwen2.5-coder:7b-backup")
client.create_model(model: "my-model", from: "qwen2.5-coder:7b", system: "You are Alpaca")
client.push_model(model: "user/my-model")    # Push to registry
client.version                               # => "0.12.6"

Runtime Options

Pass via options: on chat or generate:

messages = [{ role: "user", content: "Tell me a joke" }]

options = Ollama::Options.new(
  temperature: 0.7,
  num_predict: 256,
  stop: ["END"],
  presence_penalty: 0.5,
  frequency_penalty: -0.3
)

client.chat(messages: messages, options: options.to_h)

All supported options

| Option | Type | Description | |---|---|---| | `temperature` | Float (0–2) | Sampling temperature | | `top_p` | Float (0–1) | Nucleus sampling | | `top_k` | Integer | Top-K sampling | | `num_ctx` | Integer | Context window size | | `num_predict` | Integer | Max tokens to generate | | `repeat_penalty` | Float (0–2) | Repeat penalty | | `seed` | Integer | Random seed | | `stop` | Array | Stop sequences | | `tfs_z` | Float | Tail-free sampling | | `mirostat` | 0/1/2 | Mirostat sampling mode | | `mirostat_tau` | Float | Mirostat target entropy | | `mirostat_eta` | Float | Mirostat learning rate | | `typical_p` | Float (0–1) | Typical-p sampling | | `presence_penalty` | Float (-2–2) | Presence penalty | | `frequency_penalty` | Float (-2–2) | Frequency penalty | | `num_gpu` | Integer | GPU layers | | `num_thread` | Integer | CPU threads | | `num_keep` | Integer | Tokens to keep for context |

CLI

A strict, JSON-first CLI ships with the gem:

# Generate text
ollama-client generate --prompt "Explain Ruby blocks"

# Structured output with schema
echo '{"type":"object","properties":{"category":{"type":"string"}}}' > schema.json
ollama-client generate --prompt "Classify this" --schema schema.json --json

# Stream tokens
ollama-client generate --prompt "Write a poem" --stream

# Embeddings
ollama-client embed --input "What is Ruby?" --model nomic-embed-text:latest

# List models
ollama-client models

# Pull a model
ollama-client pull llama3.2:3b

All errors output as structured JSON to stderr. No hidden behavior.

Console (Debug Mode)

bin/console

verbose!  # Enable HTTP request/response logging
quiet!    # Disable it

client = Ollama::Client.new
client.version  # Prints full HTTP request/response to STDERR

Failure Behaviors

Scenario	What happens
Model missing (404)	Auto-pull → retry your request
Server unreachable	Instant `Ollama::Error` — no waiting
Timeout	Exponential backoff (`2^attempt` seconds)
Invalid JSON	Repair prompt → retry → `InvalidJSONError` if exhausted
Schema violation	Repair prompt → retry → `SchemaViolationError` if exhausted
Streaming error	`StreamError` raised with Ollama's error message

v1.0 Stability Contract

The public API is locked. See API_CONTRACT.md for the full specification.

All method signatures are stable until v2.0
Error class hierarchy is stable until v2.0
Recovery behaviors (auto-pull, backoff, repair) are guaranteed
No silent coercion of malformed JSON — ever
Typed errors over generic exceptions — always

Testing

# Unit + lint
bundle exec rake

# Integration (requires running Ollama)
OLLAMA_INTEGRATION=1 bundle exec rspec spec/integration/

License

MIT. See LICENSE.txt.