Kreuzberg for Ruby

Bindings Rust Python Node.js WASM Java Go C# PHP Ruby Elixir R Dart Kotlin Swift Zig C FFI Docker Helm License Documentation Hugging Face
Join Discord Live Demo GitHub Stars

Extract text, tables, images, metadata, and code intelligence from 96 file formats and 306 programming languages including PDF, Office documents, images, and audio/video transcripts where native transcription is available. Ruby bindings with idiomatic Ruby API and native performance.

What This Package Provides

  • Ruby-native extraction — idiomatic Ruby objects over the shared Rust document engine.
  • Structured results — text, tables, images, metadata, language detection, chunks, and warnings.
  • OCR support — Tesseract and PaddleOCR through the same configuration model as other bindings.
  • Cross-binding parity — output matches the Python, Node.js, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.

Installation

Add to your Gemfile:

gem 'kreuzberg'

Then execute:

bundle install

Or install it directly:

gem install kreuzberg

Quick Start

Basic Usage

require 'kreuzberg'

# Simple synchronous extraction
result = Kreuzberg.extract_file("document.pdf")
puts result.content

Async Extraction

require 'kreuzberg'

# Using Fiber for concurrency (Ruby 3.0+)
Fiber.new do
  result = Kreuzberg.extract_file_async("document.pdf")
  puts result.content
end.resume

Batch Processing

require 'kreuzberg'

files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]

results = files.map { |file| Kreuzberg.extract_file(file) }

results.each do |result|
  puts "Content length: #{result.content.length}"
end

Configuration

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  use_cache: true,
  enable_quality_processing: true,
  ocr: Kreuzberg::OcrConfig.new(
    backend: 'tesseract',
    language: 'eng'
  )
)

result = Kreuzberg.extract_file("document.pdf", config: config)
puts result.content

OCR Support

Tesseract Configuration

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  ocr: Kreuzberg::OcrConfig.new(
    backend: 'tesseract',
    language: 'eng',
    tesseract_config: Kreuzberg::TesseractConfig.new(
      psm: 6,
      enable_table_detection: true
    )
  )
)

result = Kreuzberg.extract_file("scanned.pdf", config: config)
puts result.content

Table Extraction

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  ocr: Kreuzberg::OcrConfig.new(
    backend: 'tesseract',
    tesseract_config: Kreuzberg::TesseractConfig.new(
      enable_table_detection: true
    )
  )
)

result = Kreuzberg.extract_file("invoice.pdf", config: config)

result.tables.each_with_index do |table, index|
  puts "Table #{index}:"
  puts table.markdown
end

Metadata Extraction

require 'kreuzberg'

result = Kreuzberg.extract_file("document.pdf")

# PDF metadata
if result.[:pdf]
  pdf_meta = result.[:pdf]
  puts "Title: #{pdf_meta[:title]}"
  puts "Author: #{pdf_meta[:author]}"
  puts "Pages: #{pdf_meta[:page_count]}"
end

# Detected languages
puts "Languages: #{result.detected_languages}"

# Images
if result.images
  puts "Images found: #{result.images.count}"
end

Text Chunking

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  chunking: Kreuzberg::ChunkingConfig.new(
    max_chars: 1000,
    max_overlap: 200
  )
)

result = Kreuzberg.extract_file("long_document.pdf", config: config)

result.chunks.each_with_index do |chunk, index|
  puts "Chunk #{index}: #{chunk.length} characters"
end

Password-Protected PDFs

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  pdf_options: Kreuzberg::PdfConfig.new(
    passwords: ["password1", "password2"]
  )
)

result = Kreuzberg.extract_file("protected.pdf", config: config)
puts result.content

Language Detection

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  language_detection: Kreuzberg::LanguageDetectionConfig.new(
    enabled: true
  )
)

result = Kreuzberg.extract_file("multilingual.pdf", config: config)
puts "Detected languages: #{result.detected_languages}"

API Reference

Main Methods

  • Kreuzberg.extract_file(path, config: nil) – Extract from file
  • Kreuzberg.extract_file_async(path, config: nil) – Async extraction
  • Kreuzberg.extract_bytes(data, mime_type, config: nil) – Extract from bytes
  • Kreuzberg.batch_extract_files(paths, config: nil) – Batch processing

Configuration Classes

  • ExtractionConfig – Main configuration
  • OcrConfig – OCR settings
  • TesseractConfig – Tesseract-specific options
  • ChunkingConfig – Text chunking settings
  • PdfConfig – PDF-specific options
  • LanguageDetectionConfig – Language detection settings

Result Object

  • content – Extracted text
  • metadata – File metadata as Hash
  • tables – Array of ExtractedTable objects
  • detected_languages – Array of language codes
  • chunks – Array of text chunks
  • images – Array of extracted images (if enabled)

System Requirements

Ruby Version

  • Ruby 3.2.0 or higher (including Ruby 4.x)
  • Ruby 4.0+ is fully supported with no code changes required
  • Magnus bindings compile successfully on all supported Ruby versions

Required

  • Rust toolchain (for native extension compilation)

Optional

# Tesseract OCR
brew install tesseract          # macOS
sudo apt-get install tesseract-ocr  # Ubuntu/Debian

Ruby 4.0 Compatibility

Kreuzberg is fully compatible with Ruby 4.0 (released December 25, 2025) and later. Key Ruby 4.0 features that work seamlessly:

  • Ruby Box - Improved memory efficiency and performance
  • ZJIT Compiler - Enhanced JIT compilation for faster execution
  • Ractor Improvements - Better multi-threaded document processing
  • Set Promoted to Core - No changes needed for Kreuzberg

All tests pass with Ruby 4.0.1 with 100% compatibility. The gem compiles without any breaking changes.

Development

Clone and setup:

git clone https://github.com/kreuzberg-dev/kreuzberg.git
cd kreuzberg
bundle install

Run tests:

rake test

Troubleshooting

Native extension compilation error

Ensure build tools are installed:

# macOS
xcode-select --install

# Ubuntu/Debian
sudo apt-get install build-essential ruby-dev

# Windows (via RubyInstaller)
ridk install

"Could not find Kreuzberg"

Reinstall the gem:

gem uninstall kreuzberg
gem install kreuzberg --no-document

OCR not working

Verify Tesseract is installed:

tesseract --version

Examples

Process Directory of PDFs

require 'kreuzberg'
require 'pathname'

Dir.glob("documents/*.pdf").each do |file|
  puts "Processing: #{file}"
  result = Kreuzberg.extract_file(file)
  puts "  Content length: #{result.content.length}"
  puts "  Language: #{result.detected_languages}"
end

Extract and Parse Structured Data

require 'kreuzberg'
require 'json'

result = Kreuzberg.extract_file("data.pdf")

# Parse content as JSON (if applicable)
begin
  data = JSON.parse(result.content)
  puts "Parsed data: #{data}"
rescue JSON::ParserError
  puts "Content is not JSON"
end

Save Extracted Images

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  images: Kreuzberg::ImageExtractionConfig.new(
    extract_images: true
  )
)

result = Kreuzberg.extract_file("document.pdf", config: config)

result.images&.each_with_index do |image, index|
  File.write("image_#{index}.png", image.data)
end

Documentation

For comprehensive documentation, visit https://kreuzberg.dev

Part of Kreuzberg.dev

  • Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
  • kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
  • html-to-markdown — fast, lossless HTML→Markdown engine.
  • liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
  • tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
  • alef — the polyglot binding generator that produces this README and all per-language bindings.
  • Discord — community, roadmap, announcements.

License

Elastic-2.0 License - see LICENSE for details.