Ruby
Extract text, tables, images, and metadata from 91+ file formats and 248 programming languages including PDF, Office documents, and images. Ruby bindings with idiomatic Ruby API and native performance.
Installation
Package Installation
Install via one of the supported package managers:
gem:
gem install kreuzberg
Bundler:
gem 'kreuzberg'
System Requirements
- Ruby 3.2.0 or higher required (including Ruby 4.x)
- Ruby 4.0+ is fully supported with no code changes required
- Optional: ONNX Runtime version 1.22.x for embeddings support
- Optional: Tesseract OCR for OCR functionality
Ruby 4.0 Compatibility: Kreuzberg is fully compatible with Ruby 4.0 (released December 25, 2025) and all Ruby 4.x versions. All tests pass with 100% compatibility. The gem compiles without any breaking changes. Key Ruby 4.0 features like Ruby Box, ZJIT compiler, and Ractor improvements work seamlessly with Kreuzberg.
Quick Start
Basic Extraction
Extract text, metadata, and structure from any supported document format:
require 'kreuzberg'
result = Kreuzberg.extract_file_sync('document.pdf')
puts "Content:"
puts result.content
puts "\nMetadata:"
puts "Title: #{result.&.dig('title')}"
puts "Author: #{result.&.dig('author')}"
puts "\nTables found: #{result.tables.length}"
puts "Images found: #{result.images.length}"
Common Use Cases
Extract with Custom Configuration
Most use cases benefit from configuration to control extraction behavior:
With OCR (for scanned documents):
require 'kreuzberg'
ocr_config = Kreuzberg::Config::OCR.new(
backend: 'tesseract',
language: 'eng'
)
config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts "Extracted text from scanned document:"
puts result.content
puts "Used OCR backend: tesseract"
Table Extraction
See Table Extraction Guide for detailed examples.
Processing Multiple Files
require 'kreuzberg'
puts "Kreuzberg version: #{Kreuzberg::VERSION}"
puts "FFI bindings loaded successfully"
result = Kreuzberg.extract_file_sync('sample.pdf')
puts "Installation verified! Extracted #{result.content.length} characters"
Async Processing
For non-blocking document processing:
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
use_cache: true,
enable_quality_processing: true
)
result = Kreuzberg.extract_file_sync('contract.pdf', config: config)
puts "Extracted #{result.content.length} characters"
puts "Quality score: #{result.quality_score}"
puts "Processing time: #{result.&.dig('processing_time')}ms"
Next Steps
- Installation Guide - Platform-specific setup
- API Documentation - Complete API reference
- Examples & Guides - Full code examples and usage guides
- Configuration Guide - Advanced configuration options
Features
Supported File Formats (91+)
91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
Office Documents
| Category | Formats | Capabilities |
|---|---|---|
| Word Processing | .docx, .docm, .dotx, .dotm, .dot, .odt |
Full text, tables, images, metadata, styles |
| Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods |
Sheet data, formulas, cell metadata, charts |
| Presentations | .pptx, .pptm, .ppsx, .potx, .potm, .pot, .ppt |
Slides, speaker notes, images, metadata |
.pdf |
Text, tables, images, metadata, OCR support | |
| eBooks | .epub, .fb2 |
Chapters, metadata, embedded resources |
| Database | .dbf |
Table data extraction, field type support |
| Hangul | .hwp, .hwpx |
Korean document format, text extraction |
Images (OCR-Enabled)
| Category | Formats | Features |
|---|---|---|
| Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif |
OCR, table detection, EXIF metadata, dimensions, color space |
| Advanced | .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm |
OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata |
| Vector | .svg |
DOM parsing, embedded text, graphics metadata |
Web & Data
| Category | Formats | Features |
|---|---|---|
| Markup | .html, .htm, .xhtml, .xml, .svg |
DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv |
Schema detection, nested structures, validation |
| Text & Markdown | .txt, .md, .markdown, .djot, .rst, .org, .rtf |
CommonMark, GFM, Djot, reStructuredText, Org Mode |
Email & Archives
| Category | Formats | Features |
|---|---|---|
.eml, .msg |
Headers, body (HTML/plain), attachments, threading | |
| Archives | .zip, .tar, .tgz, .gz, .7z |
File listing, nested archives, metadata |
Academic & Scientific
| Category | Formats | Features |
|---|---|---|
| Citations | .bib, .biblatex, .ris, .nbib, .enw, .csl |
Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON |
| Scientific | .tex, .latex, .typst, .jats, .ipynb, .docbook |
LaTeX, Jupyter notebooks, PubMed JATS |
| Documentation | .opml, .pod, .mdoc, .troff |
Technical documentation formats |
Code Intelligence (248 Languages)
| Feature | Description |
|---|---|
| Structure Extraction | Functions, classes, methods, structs, interfaces, enums |
| Import/Export Analysis | Module dependencies, re-exports, wildcard imports |
| Symbol Extraction | Variables, constants, type aliases, properties |
| Docstring Parsing | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
| Diagnostics | Parse errors with line/column positions |
| Syntax-Aware Chunking | Split code by semantic boundaries, not arbitrary byte offsets |
Powered by tree-sitter-language-pack — documentation.
Key Capabilities
- Text Extraction - Extract all text content with position and formatting information
- Metadata Extraction - Retrieve document properties, creation date, author, etc.
- Table Extraction - Parse tables with structure and cell content preservation
- Image Extraction - Extract embedded images and render page previews
OCR Support - Integrate multiple OCR backends for scanned documents
Async/Await - Non-blocking document processing with concurrent operations
Plugin System - Extensible post-processing for custom text transformation
Embeddings - Generate vector embeddings using ONNX Runtime models
Batch Processing - Efficiently process multiple documents in parallel
Memory Efficient - Stream large files without loading entirely into memory
Language Detection - Detect and support multiple languages in documents
Code Intelligence - Extract structure, imports, exports, symbols, and docstrings from 248 programming languages via tree-sitter
Configuration - Fine-grained control over extraction behavior
Performance Characteristics
| Format | Speed | Memory | Notes |
|---|---|---|---|
| PDF (text) | 10-100 MB/s | ~50MB per doc | Fastest extraction |
| Office docs | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
| Images (OCR) | 1-5 MB/s | Variable | Depends on OCR backend |
| Archives | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
| Web formats | 50-200 MB/s | Streaming | HTML, XML, JSON |
OCR Support
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
Tesseract
Paddleocr
OCR Configuration Example
require 'kreuzberg'
ocr_config = Kreuzberg::Config::OCR.new(
backend: 'tesseract',
language: 'eng'
)
config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts "Extracted text from scanned document:"
puts result.content
puts "Used OCR backend: tesseract"
Async Support
This binding provides full async/await support for non-blocking document processing:
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
use_cache: true,
enable_quality_processing: true
)
result = Kreuzberg.extract_file_sync('contract.pdf', config: config)
puts "Extracted #{result.content.length} characters"
puts "Quality score: #{result.quality_score}"
puts "Processing time: #{result.&.dig('processing_time')}ms"
Plugin System
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
For detailed plugin documentation, visit Plugin System Guide.
Embeddings Support
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
Batch Processing
Process multiple documents efficiently:
require 'kreuzberg'
puts "Kreuzberg version: #{Kreuzberg::VERSION}"
puts "FFI bindings loaded successfully"
result = Kreuzberg.extract_file_sync('sample.pdf')
puts "Installation verified! Extracted #{result.content.length} characters"
Configuration
For advanced configuration options including language detection, table extraction, OCR settings, and more:
Documentation
Contributing
Contributions are welcome! See Contributing Guide.
License
MIT License - see LICENSE file for details.
Support
- Discord Community: Join our Discord
- GitHub Issues: Report bugs
- Discussions: Ask questions