Class: Rubino::Documents::Converters::Docx

Inherits:
Object
  • Object
show all
Defined in:
lib/rubino/documents/converters/docx.rb

Overview

DOCX -> Markdown via the ‘docx` gem (MIT, OPTIONAL). markitdown gets this “for free” by going docx->HTML (mammoth) then through its HTML core; the Ruby `docx` gem instead hands us paragraphs (with a style name) and tables, so we map the structure directly:

"Heading 1".."Heading 6"  -> "#".."######"
"Title"                   -> "#"
list paragraphs           -> "- " / "1. "
bold/italic runs          -> "**"/"*"
tables                    -> GFM table via the shared Table emitter

Known limitations (documented in specs): embedded images are dropped, nested tables are flattened, and run-level formatting beyond bold/italic is not preserved.

Constant Summary collapse

MIMES =
%w[
  application/vnd.openxmlformats-officedocument.wordprocessingml.document
].freeze

Instance Method Summary collapse

Instance Method Details

#accepts?(mime, path) ⇒ Boolean

Returns:

  • (Boolean)


30
31
32
33
34
# File 'lib/rubino/documents/converters/docx.rb', line 30

def accepts?(mime, path)
  return true if MIMES.include?(mime.to_s)

  File.extname(path.to_s).downcase == ".docx"
end

#available?Boolean

Returns:

  • (Boolean)


23
24
25
26
27
28
# File 'lib/rubino/documents/converters/docx.rb', line 23

def available?
  require "docx"
  true
rescue LoadError
  false
end

#convert(path) ⇒ Object



36
37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/rubino/documents/converters/docx.rb', line 36

def convert(path)
  require "docx"
  doc = ::Docx::Document.open(path)
  blocks = []
  # Iterate document order when the gem exposes it; otherwise paragraphs
  # then tables (best-effort -- the gem version dictates what's available).
  if doc.respond_to?(:each_paragraph)
    doc.each_paragraph { |p| blocks << paragraph_markdown(p) }
  else
    doc.paragraphs.each { |p| blocks << paragraph_markdown(p) }
  end
  doc.tables.each { |t| blocks << table_markdown(t) } if doc.respond_to?(:tables)
  blocks.compact.reject(&:empty?).join("\n\n")
end