Class: Rubino::Documents::Converters::Docx
- Inherits:
-
Object
- Object
- Rubino::Documents::Converters::Docx
- Defined in:
- lib/rubino/documents/converters/docx.rb
Overview
DOCX -> Markdown via the ‘docx` gem (MIT, OPTIONAL). markitdown gets this “for free” by going docx->HTML (mammoth) then through its HTML core; the Ruby `docx` gem instead hands us paragraphs (with a style name) and tables, so we map the structure directly:
"Heading 1".."Heading 6" -> "#".."######"
"Title" -> "#"
list paragraphs -> "- " / "1. "
bold/italic runs -> "**"/"*"
tables -> GFM table via the shared Table emitter
Known limitations (documented in specs): embedded images are dropped, nested tables are flattened, and run-level formatting beyond bold/italic is not preserved.
Constant Summary collapse
- MIMES =
%w[ application/vnd.openxmlformats-officedocument.wordprocessingml.document ].freeze
Instance Method Summary collapse
Instance Method Details
#accepts?(mime, path) ⇒ Boolean
30 31 32 33 34 |
# File 'lib/rubino/documents/converters/docx.rb', line 30 def accepts?(mime, path) return true if MIMES.include?(mime.to_s) File.extname(path.to_s).downcase == ".docx" end |
#available? ⇒ Boolean
23 24 25 26 27 28 |
# File 'lib/rubino/documents/converters/docx.rb', line 23 def available? require "docx" true rescue LoadError false end |
#convert(path) ⇒ Object
36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/rubino/documents/converters/docx.rb', line 36 def convert(path) require "docx" doc = ::Docx::Document.open(path) blocks = [] # Iterate document order when the gem exposes it; otherwise paragraphs # then tables (best-effort -- the gem version dictates what's available). if doc.respond_to?(:each_paragraph) doc.each_paragraph { |p| blocks << paragraph_markdown(p) } else doc.paragraphs.each { |p| blocks << paragraph_markdown(p) } end doc.tables.each { |t| blocks << table_markdown(t) } if doc.respond_to?(:tables) blocks.compact.reject(&:empty?).join("\n\n") end |