Module: Rubino::Documents::Html
- Defined in:
- lib/rubino/documents/html.rb
Overview
The ONE HTML->Markdown core (markitdown’s ‘HtmlConverter` / `markdownify` equivalent). Every converter that can shape its content as HTML (the html converter itself, and docx via a paragraphs->HTML step) feeds this. Built on kramdown, which is ALREADY a rubino dependency, so no new lib is added.
kramdown parses HTML and emits Markdown but defaults to reference-style links ([text] + a [1]: url footer). LLMs read inline links more naturally, so we post-process the reference definitions back inline. We also strip non-content elements (script/style) before conversion.
Class Method Summary collapse
-
.inline_reference_links(markdown) ⇒ Object
Rewrites kramdown’s reference-style links/images back to inline form: [text] …
-
.strip_noise(html) ⇒ Object
Removes script/style/head blocks (their text is not document content) and the html/body document-wrapper tags, which kramdown otherwise leaves as literal ‘<html>…</html>` lines around the converted body.
-
.to_markdown(html) ⇒ Object
Converts an HTML string to Markdown.
Class Method Details
.inline_reference_links(markdown) ⇒ Object
Rewrites kramdown’s reference-style links/images back to inline form:
[text][1] ... [1]: http://x -> [text](http://x)
Leaves the body untouched when there are no reference definitions.
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/rubino/documents/html.rb', line 51 def inline_reference_links(markdown) defs = {} markdown.each_line do |line| m = line.match(/^\s*\[([^\]]+)\]:\s+(\S+)(?:\s+"[^"]*")?\s*$/) defs[m[1]] = m[2] if m end return markdown if defs.empty? body = markdown.gsub(/(!?)\[([^\]]*)\]\[([^\]]+)\]/) do bang = Regexp.last_match(1) text = Regexp.last_match(2) ref = Regexp.last_match(3) url = defs[ref.empty? ? text : ref] url ? "#{bang}[#{text}](#{url})" : Regexp.last_match(0) end # Drop the now-inlined reference-definition lines. body.each_line.grep_v(/^\s*\[[^\]]+\]:\s+\S+/).join end |
.strip_noise(html) ⇒ Object
Removes script/style/head blocks (their text is not document content) and the html/body document-wrapper tags, which kramdown otherwise leaves as literal ‘<html>…</html>` lines around the converted body. What’s left is the inner content kramdown shapes into Markdown.
39 40 41 42 43 44 45 46 |
# File 'lib/rubino/documents/html.rb', line 39 def strip_noise(html) html .gsub(%r{<script\b[^>]*>.*?</script>}mi, "") .gsub(%r{<style\b[^>]*>.*?</style>}mi, "") .gsub(%r{<head\b[^>]*>.*?</head>}mi, "") .gsub(/<!--.*?-->/m, "") .gsub(%r{</?(?:html|body|!doctype)\b[^>]*>}mi, "") end |
.to_markdown(html) ⇒ Object
Converts an HTML string to Markdown. Returns “” on failure rather than raising – the caller (to_markdown) treats empty as nil.
21 22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'lib/rubino/documents/html.rb', line 21 def to_markdown(html) return "" if html.nil? || html.to_s.strip.empty? cleaned = strip_noise(html.to_s) md = Kramdown::Document.new( cleaned, input: "html", html_to_native: true ).to_kramdown inline_reference_links(md).strip rescue StandardError "" end |