kreuzcrawl

Bindings Rust Python Node.js WASM Java Go C# PHP Ruby Elixir Dart Kotlin Swift Zig C FFI Docker License Documentation
Kreuzcrawl
Join Discord

Ruby bindings for kreuzcrawl — a high-performance Rust web crawling engine. Powered by Magnus with native Ruby objects, full metadata extraction, and Markdown conversion.

What This Package Provides

  • Same crawler as every binding — one Rust engine behind Python, Node.js, Ruby, Go, Java, .NET, PHP, Elixir, Dart, Kotlin Android, Swift, Zig, WASM, and C FFI.
  • Structured scrape output — HTML, Markdown, metadata, links, assets, response headers, and extraction warnings with consistent field names.
  • Crawl controls — depth, page limits, concurrency, URL filters, robots/sitemap handling, rate limits, and partial failure reporting.
  • Rendering path — optional browser rendering for JavaScript-heavy pages; direct HTTP path for fast static pages.
  • Ruby package — Magnus-backed native extension with Ruby objects for crawl results.

Installation

gem install kreuzcrawl

Quick Start

```ruby title="Ruby" require "kreuzcrawl"

Simplest case: scrape a single page with default settings.

engine = Kreuzcrawl.create_engine result = Kreuzcrawl.scrape(engine, "https://example.com/") puts "Title: #resultresult.metadataresult.metadata.title" puts "Status: #resultresult.status_code" puts "Links found: #resultresult.linksresult.links.length"

Crawl from a seed URL, limited to one hop and a handful of pages.

config = Kreuzcrawl::CrawlConfig.new(max_depth: 1, max_pages: 5) crawl_engine = Kreuzcrawl.create_engine(config) crawl_result = Kreuzcrawl.crawl(crawl_engine, "https://en.wikipedia.org/wiki/Web_scraping") puts "Pages crawled: #crawl_resultcrawl_result.pagescrawl_result.pages.length"


## API Reference

Full API documentation is available at [docs.kreuzcrawl.kreuzberg.dev](https://docs.kreuzcrawl.kreuzberg.dev).

Key functions:

- `create_engine(config?)` — Create a crawl engine with optional configuration
- `scrape(engine, url)` — Scrape a single URL
- `crawl(engine, url)` — Crawl a website following links
- `map_urls(engine, url)` — Discover all pages on a site
- `batch_scrape(engine, urls)` — Scrape multiple URLs concurrently
- `batch_crawl(engine, urls)` — Crawl multiple seed URLs concurrently

## Contributing

Contributions are welcome! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/CONTRIBUTING.md) for details.

## Part of Kreuzberg.dev

- [Kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) — document intelligence: text, tables, metadata from 90+ formats with optional OCR.
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.

## License

This project is licensed under [Elastic License 2.0](https://github.com/kreuzberg-dev/kreuzcrawl/blob/main/LICENSE).

## Links

- [Documentation](https://docs.kreuzcrawl.kreuzberg.dev)
- [GitHub Repository](https://github.com/kreuzberg-dev/kreuzcrawl)
- [Issue Tracker](https://github.com/kreuzberg-dev/kreuzcrawl/issues)
- [Issues](https://github.com/kreuzberg-dev/kreuzcrawl/issues)