kreuzcrawl
Ruby bindings for kreuzcrawl — a high-performance Rust web crawling engine. Powered by Magnus with native Ruby objects, full metadata extraction, and Markdown conversion.
What This Package Provides
- Same crawler as every binding — one Rust engine behind Python, Node.js, Ruby, Go, Java, .NET, PHP, Elixir, Dart, Kotlin Android, Swift, Zig, WASM, and C FFI.
- Structured scrape output — HTML, Markdown, metadata, links, assets, response headers, and extraction warnings with consistent field names.
- Crawl controls — depth, page limits, concurrency, URL filters, robots/sitemap handling, rate limits, and partial failure reporting.
- Rendering path — optional browser rendering for JavaScript-heavy pages; direct HTTP path for fast static pages.
- Ruby package — Magnus-backed native extension with Ruby objects for crawl results.
Installation
gem install kreuzcrawl
Quick Start
```ruby title="Ruby" require "kreuzcrawl"
Simplest case: scrape a single page with default settings.
engine = Kreuzcrawl.create_engine result = Kreuzcrawl.scrape(engine, "https://example.com/") puts "Title: #resultresult.metadataresult.metadata.title" puts "Status: #resultresult.status_code" puts "Links found: #resultresult.linksresult.links.length"
Crawl from a seed URL, limited to one hop and a handful of pages.
config = Kreuzcrawl::CrawlConfig.new(max_depth: 1, max_pages: 5) crawl_engine = Kreuzcrawl.create_engine(config) crawl_result = Kreuzcrawl.crawl(crawl_engine, "https://en.wikipedia.org/wiki/Web_scraping") puts "Pages crawled: #crawl_resultcrawl_result.pagescrawl_result.pages.length"
## API Reference
Full API documentation is available at [docs.kreuzcrawl.kreuzberg.dev](https://docs.kreuzcrawl.kreuzberg.dev).
Key functions:
- `create_engine(config?)` — Create a crawl engine with optional configuration
- `scrape(engine, url)` — Scrape a single URL
- `crawl(engine, url)` — Crawl a website following links
- `map_urls(engine, url)` — Discover all pages on a site
- `batch_scrape(engine, urls)` — Scrape multiple URLs concurrently
- `batch_crawl(engine, urls)` — Crawl multiple seed URLs concurrently
## Contributing
Contributions are welcome! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/CONTRIBUTING.md) for details.
## Part of Kreuzberg.dev
- [Kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) — document intelligence: text, tables, metadata from 90+ formats with optional OCR.
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
## License
This project is licensed under [Elastic License 2.0](https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/LICENSE).
## Links
- [Documentation](https://docs.kreuzcrawl.kreuzberg.dev)
- [GitHub Repository](https://github.com/kreuzberg-dev/kreuzcrawl)
- [Issue Tracker](https://github.com/kreuzberg-dev/kreuzcrawl/issues)
- [Issues](https://github.com/kreuzberg-dev/kreuzcrawl/issues)