html-to-markdown
Blazing-fast HTML to Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, WebAssembly, and PHP packages. Ship identical Markdown across every runtime while enjoying native extension performance with Magnus bindings.
Installation
gem install html-to-markdown
Requires Ruby 3.2+ with Magnus native extension bindings. Published for Linux, macOS.
Performance Snapshot
Apple M4 • Real Wikipedia documents • convert() (Ruby)
| Document | Size | Latency | Throughput |
|---|---|---|---|
| Lists (Timeline) | 129KB | 0.71ms | 182 MB/s |
| Tables (Countries) | 360KB | 2.15ms | 167 MB/s |
| Mixed (Python wiki) | 656KB | 4.89ms | 134 MB/s |
See Performance Guide for detailed benchmarks.
Quick Start
Basic conversion:
require 'html_to_markdown'
html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
markdown = HtmlToMarkdown.convert(html)
With conversion options:
require 'html_to_markdown'
html = "<h1>Hello</h1><p>This is <strong>fast</strong>!</p>"
markdown = HtmlToMarkdown.convert(html, heading_style: :atx, code_block_style: :fenced)
API Reference
Core Functions
convert(html, options: nil) -> String
Basic HTML-to-Markdown conversion. Fast and simple.
convert_with_metadata(html, options: nil, config: nil) -> [String, Hash]
Extract Markdown plus metadata (headers, links, images, structured data) in a single pass. See Metadata Extraction Guide.
convert_with_visitor(html, visitor:, options: nil) -> String
Customize conversion with visitor callbacks for element interception. See Visitor Pattern Guide.
convert_with_inline_images(html, config: nil) -> [String, Array, Array]
Extract base64-encoded inline images with metadata.
Options
ConversionOptions – Key configuration fields:
heading_style: Heading format ("underlined"|"atx"|"atx_closed") — default:"underlined"list_indent_width: Spaces per indent level — default:2bullets: Bullet characters cycle — default:"*+-"wrap: Enable text wrapping — default:falsewrap_width: Wrap at column — default:80code_language: Default fenced code block language — default: noneextract_metadata: Embed metadata as YAML frontmatter — default:falseoutput_format: Output markup format ("markdown"|"djot") — default:"markdown"
MetadataConfig – Selective metadata extraction:
extract_headers: h1-h6 elements — default:trueextract_links: Hyperlinks — default:trueextract_images: Image elements — default:trueextract_structured_data: JSON-LD, Microdata, RDFa — default:truemax_structured_data_size: Size limit in bytes — default:100KB
Djot Output Format
The library supports converting HTML to Djot, a lightweight markup language similar to Markdown but with a different syntax for some elements. Set output_format to "djot" to use this format.
Syntax Differences
| Element | Markdown | Djot |
|---|---|---|
| Strong | **text** |
*text* |
| Emphasis | *text* |
_text_ |
| Strikethrough | ~~text~~ |
{-text-} |
| Inserted/Added | N/A | {+text+} |
| Highlighted | N/A | {=text=} |
| Subscript | N/A | ~text~ |
| Superscript | N/A | ^text^ |
Example Usage
require 'html_to_markdown'
html = "<p>This is <strong>bold</strong> and <em>italic</em> text.</p>"
# Default Markdown output
markdown = HtmlToMarkdown.convert(html)
# Result: "This is **bold** and *italic* text."
# Djot output
djot = HtmlToMarkdown.convert(html, output_format: 'djot')
# Result: "This is *bold* and _italic_ text."
Djot's extended syntax allows you to express more semantic meaning in lightweight text, making it useful for documents that require strikethrough, insertion tracking, or mathematical notation.
Metadata Extraction
The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass.
Use Cases:
- SEO analysis – Extract title, description, Open Graph tags, Twitter cards
- Table of contents generation – Build structured outlines from heading hierarchy
- Content migration – Document all external links and resources
- Accessibility audits – Check for images without alt text, empty links, invalid heading hierarchy
- Link validation – Classify and validate anchor, internal, external, email, and phone links
Zero Overhead When Disabled: Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Disable unused metadata types in MetadataConfig to optimize further.
Example: Quick Start
require 'html_to_markdown'
html = '<h1>Article</h1><img src="test.jpg" alt="test">'
markdown, = HtmlToMarkdown.(html)
puts [:document][:title] # Document title
puts [:headers] # All h1-h6 elements
puts [:links] # All hyperlinks
puts [:images] # All images with alt text
puts [:structured_data] # JSON-LD, Microdata, RDFa
For detailed examples including SEO extraction, table-of-contents generation, link validation, and accessibility audits, see the Metadata Extraction Guide.
Visitor Pattern
The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Use visitors to transform content, filter elements, validate structure, or collect analytics.
Use Cases:
- Custom Markdown dialects – Convert to Obsidian, Notion, or other flavors
- Content filtering – Remove tracking pixels, ads, or unwanted elements
- URL rewriting – Rewrite CDN URLs, add query parameters, validate links
- Accessibility validation – Check alt text, heading hierarchy, link text
- Analytics – Track element usage, link destinations, image sources
Supported Visitor Methods: 40+ callbacks for text, inline elements, links, images, headings, lists, blocks, and tables.
Example: Quick Start
require 'html_to_markdown'
class MyVisitor
def visit_link(ctx, href, text, title = nil)
# Rewrite CDN URLs
if href.start_with?('https://old-cdn.com')
href = href.sub('https://old-cdn.com', 'https://new-cdn.com')
end
{ type: :custom, output: "[#{text}](#{href})" }
end
def visit_image(ctx, src, alt = nil, title = nil)
# Skip tracking pixels
src.include?('tracking') ? { type: :skip } : { type: :continue }
end
end
html = '<a href="https://old-cdn.com/file.pdf">Download</a>'
markdown = HtmlToMarkdown.convert_with_visitor(html, visitor: MyVisitor.new)
For comprehensive examples including content filtering, link footnotes, accessibility validation, and asynchronous URL validation, see the Visitor Pattern Guide.
Examples
Links
RubyGems: rubygems.org/gems/html-to-markdown
Kreuzberg Ecosystem: kreuzberg.dev
Discord: discord.gg/pXxagNK2zN
Contributing
We welcome contributions! Please see our Contributing Guide for details on:
- Setting up the development environment
- Running tests locally
- Submitting pull requests
- Reporting issues
All contributions must follow our code quality standards (enforced via pre-commit hooks):
- Proper test coverage (Rust 95%+, language bindings 80%+)
- Formatting and linting checks
- Documentation for public APIs
License
MIT License – see LICENSE.
Support
If you find this library useful, consider sponsoring the project.
Have questions or run into issues? We're here to help:
- GitHub Issues: github.com/kreuzberg-dev/html-to-markdown/issues
- Discussions: github.com/kreuzberg-dev/html-to-markdown/discussions
- Discord Community: discord.gg/pXxagNK2zN