Markdownator

Convert files into clean, LLM-friendly Markdown. Point Markdownator at a PDF, Office document, web page, archive, or image and get Markdown back.

It uses a pluggable converter-registry architecture and loads heavy format libraries lazily, so you only install the gems for the formats you actually use.

Supported formats

Format Extensions Extra gem required
Plain text / Markdown .txt, .md — (built in)
CSV .csv — (built in)
JSON .json — (built in)
HTML .html, .htm reverse_markdown (+ nokogiri)
XML .xml nokogiri
Word .docx rubyzip, nokogiri
Excel .xlsx roo
PowerPoint .pptx rubyzip, nokogiri
PDF .pdf pdf-reader
EPUB .epub rubyzip, nokogiri, reverse_markdown
ZIP (recurses) .zip rubyzip
Images (metadata) .jpg, .png, .tiff, … exifr (for EXIF)

If a required gem is missing, the converter raises Markdownator::MissingDependencyError telling you exactly what to add to your Gemfile.

Installation

gem "markdownator"

Then add the gems for the formats you need, e.g.:

gem "pdf-reader"      # PDF
gem "roo"             # XLSX
gem "rubyzip"         # DOCX, PPTX, EPUB, ZIP
gem "nokogiri"        # HTML, XML, DOCX, PPTX, EPUB
gem "reverse_markdown" # HTML, EPUB
gem "exifr"           # image EXIF

Usage

require "markdownator"

# From a local path — format is detected from the extension.
result = Markdownator.convert("report.pdf")
puts result.markdown
puts result.title     # when the format exposes one (HTML, EPUB)
puts result.  # e.g. { page_count: 12 } for PDF

# From a URL.
Markdownator.convert("https://example.com").markdown

# From an open stream — pass hints via StreamInfo.
File.open("data.csv", "rb") do |io|
  info = Markdownator::StreamInfo.new(extension: "csv")
  Markdownator.convert_stream(io, info).markdown
end

Result#to_s and Result#text_content both return the Markdown, so a result is convenient to print or interpolate directly.

Image captioning (optional)

Image conversion emits EXIF metadata by default. To add a natural-language description, pass any object that responds to #caption(io, stream_info) and returns a String:

class ClaudeCaptioner
  def caption(io, stream_info)
    # Send io.read to your vision model (e.g. Claude) and return its description.
  end
end

Markdownator.convert("photo.jpg", captioner: ClaudeCaptioner.new).markdown

No LLM gem is bundled; the hook is off unless you provide a captioner.

Development

After checking out the repo, run bin/setup to install dependencies. Then run rake spec to run the tests, or bin/console for an interactive prompt.

To install this gem onto your local machine, run bundle exec rake install.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/alexrupom/markdownator.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the Markdownator project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.