Module: HtmlToMarkdown

Defined in:
lib/html_to_markdown.rb,
lib/html_to_markdown/cli.rb,
lib/html_to_markdown/version.rb,
lib/html_to_markdown/cli_proxy.rb

Defined Under Namespace

Modules: CLI, CLIProxy Classes: Options

Constant Summary collapse

VERSION =
'2.25.0'

Class Method Summary collapse

Class Method Details

.convert(html, options = nil, visitor = nil) ⇒ Object



25
26
27
28
29
30
31
# File 'lib/html_to_markdown.rb', line 25

def convert(html, options = nil, visitor = nil)
  if visitor
    native_convert_with_visitor(html.to_s, options, visitor)
  else
    native_convert(html.to_s, options)
  end
end

.convert_with_inline_images(html, options = nil, image_config = nil, _visitor = nil) ⇒ Object



37
38
39
40
41
# File 'lib/html_to_markdown.rb', line 37

def convert_with_inline_images(html, options = nil, image_config = nil, _visitor = nil)
  # NOTE: visitor parameter is accepted for API compatibility but not used in inline images mode
  # The visitor pattern is only supported in the standard convert() method
  native_convert_with_inline_images(html.to_s, options, image_config)
end

.convert_with_inline_images_handle(html, options_handle, image_config = nil) ⇒ Object



43
44
45
# File 'lib/html_to_markdown.rb', line 43

def convert_with_inline_images_handle(html, options_handle, image_config = nil)
  native_convert_with_inline_images_handle(html.to_s, options_handle, image_config)
end

.convert_with_metadata(html, options = nil, metadata_config = nil, _visitor = nil) ⇒ Array<String, Hash>

Convert HTML to Markdown with comprehensive metadata extraction.

Performs HTML-to-Markdown conversion while extracting document metadata, headers, links, images, and structured data in a single pass. Ideal for content analysis, SEO workflows, and document indexing.

Examples:

Basic usage

html = <<~HTML
  <html lang="en">
    <head>
      <title>My Article</title>
      <meta name="description" content="A great read">
    </head>
    <body>
      <h1 id="intro">Introduction</h1>
      <p>Visit <a href="https://example.com">our site</a></p>
      <img src="photo.jpg" alt="Beautiful landscape">
    </body>
  </html>
HTML

markdown,  = HtmlToMarkdown.(html)

puts [:document][:title]  # => "My Article"
puts [:document][:language]  # => "en"
puts [:headers].length  # => 1
puts [:headers][0][:text]  # => "Introduction"
puts [:links].length  # => 1
puts [:images].length  # => 1

With selective metadata extraction

config = {
  extract_headers: true,
  extract_links: true,
  extract_images: false,      # Skip images
  extract_structured_data: false  # Skip structured data
}

markdown,  = HtmlToMarkdown.(html, nil, config)
puts [:images].empty?  # => true (not extracted)

With conversion options

options = {
  heading_style: "atx",     # Use # H1, ## H2 style
  wrap: true,
  wrap_width: 80
}

config = { extract_headers: true }

markdown,  = HtmlToMarkdown.(html, options, config)
# Markdown uses ATX-style headings and wraps at 80 characters

Parameters:

  • html (String)

    HTML string to convert. Line endings are normalized (CRLF -> LF).

  • options (ConversionOptions, Hash, nil) (defaults to: nil)

    Optional conversion configuration. When a Hash, keys should match ConversionOptions field names (as symbols or strings). Common options:

    • :heading_style [String] “atx”, “atx_closed”, or “underlined” (default: “underlined”)

    • :list_indent_type [String] “spaces” or “tabs” (default: “spaces”)

    • :list_indent_width [Integer] Spaces per indent level (default: 4)

    • :wrap [true, false] Enable text wrapping (default: false)

    • :wrap_width [Integer] Wrap at this column width (default: 80)

    See ConversionOptions documentation for complete list.

  • metadata_config (Hash, nil) (defaults to: nil)

    Optional metadata extraction configuration. Keys should be symbols or strings. Supported keys:

    • :extract_headers [true, false] Extract h1-h6 heading elements (default: true)

    • :extract_links [true, false] Extract hyperlinks with type classification (default: true)

    • :extract_images [true, false] Extract image elements (default: true)

    • :extract_structured_data [true, false] Extract JSON-LD/Microdata/RDFa (default: true)

    • :max_structured_data_size [Integer] Size limit for structured data in bytes (default: 1_000_000)

Returns:

  • (Array<String, Hash>)

    Tuple of [markdown_string, metadata_hash] markdown_string: String - The converted Markdown output

    metadata_hash: Hash with keys:

    • :document [Hash] Document-level metadata:

      • :title [String, nil] From <title> tag

      • :description [String, nil] From <meta name=“description”>

      • :keywords [Array<String>] From <meta name=“keywords”>

      • :author [String, nil] From <meta name=“author”>

      • :language [String, nil] From lang attribute (e.g., “en”)

      • :text_direction [String, nil] “ltr”, “rtl”, or “auto”

      • :canonical_url [String, nil] From <link rel=“canonical”>

      • :base_href [String, nil] From <base href=“”>

      • :open_graph [Hash<String, String>] Open Graph properties (og:* meta tags)

      • :twitter_card [Hash<String, String>] Twitter Card properties (twitter:* meta tags)

      • :meta_tags [Hash<String, String>] Other meta tags

    • :headers [Array<Hash>] Heading elements:

      • :level [Integer] 1-6

      • :text [String] Header text content

      • :id [String, nil] HTML id attribute

      • :depth [Integer] Tree nesting depth

      • :html_offset [Integer] Byte offset in original HTML

    • :links [Array<Hash>] Hyperlinks:

      • :href [String] Link URL

      • :text [String] Link text content

      • :title [String, nil] Title attribute

      • :link_type [String] “anchor”, “internal”, “external”, “email”, “phone”, or “other”

      • :rel [Array<String>] Rel attribute values

      • :attributes [Hash<String, String>] Additional HTML attributes

    • :images [Array<Hash>] Image elements:

      • :src [String] Image source URL or data URI

      • :alt [String, nil] Alt text for accessibility

      • :title [String, nil] Title attribute

      • :dimensions [Array<Integer>, nil] [width, height] if available

      • :image_type [String] “data_uri”, “external”, “relative”, or “inline_svg”

      • :attributes [Hash<String, String>] Additional HTML attributes

    • :structured_data [Array<Hash>] Structured data blocks:

      • :data_type [String] “json_ld”, “microdata”, or “rdfa”

      • :raw_json [String] Raw JSON content

      • :schema_type [String, nil] Schema type (e.g., “Article”, “Event”)

Raises:

  • (StandardError)

    If conversion fails or invalid configuration

See Also:

  • Simple conversion without metadata
  • Extract inline images during conversion
  • Detailed conversion configuration


173
174
175
176
177
# File 'lib/html_to_markdown.rb', line 173

def (html, options = nil,  = nil, _visitor = nil)
  # NOTE: visitor parameter is accepted for API compatibility but not used in metadata extraction mode
  # The visitor pattern is only supported in the standard convert() method
  (html.to_s, options, )
end

.convert_with_metadata_handle(html, options_handle, metadata_config = nil) ⇒ Object



179
180
181
# File 'lib/html_to_markdown.rb', line 179

def (html, options_handle,  = nil)
  (html.to_s, options_handle, )
end

.convert_with_options(html, options_handle) ⇒ Object



33
34
35
# File 'lib/html_to_markdown.rb', line 33

def convert_with_options(html, options_handle)
  native_convert_with_options(html.to_s, options_handle)
end

.native_convertObject



13
# File 'lib/html_to_markdown.rb', line 13

alias native_convert convert

.native_convert_with_inline_imagesObject



14
# File 'lib/html_to_markdown.rb', line 14

alias native_convert_with_inline_images convert_with_inline_images

.native_convert_with_inline_images_handleObject



15
# File 'lib/html_to_markdown.rb', line 15

alias native_convert_with_inline_images_handle convert_with_inline_images_handle

.native_convert_with_metadataObject



18
# File 'lib/html_to_markdown.rb', line 18

alias  

.native_convert_with_metadata_handleObject



19
# File 'lib/html_to_markdown.rb', line 19

alias  

.native_convert_with_optionsObject



17
# File 'lib/html_to_markdown.rb', line 17

alias native_convert_with_options convert_with_options

.native_optionsObject



16
# File 'lib/html_to_markdown.rb', line 16

alias native_options options

.options(options_hash = nil) ⇒ Object



47
48
49
# File 'lib/html_to_markdown.rb', line 47

def options(options_hash = nil)
  native_options(options_hash)
end