Class: Rubino::Documents::Converters::Docx

Inherits:
Object
  • Object
show all
Defined in:
lib/rubino/documents/converters/docx.rb

Overview

DOCX -> Markdown via the ‘docx` gem (MIT, OPTIONAL). markitdown gets this “for free” by going docx->HTML (mammoth) then through its HTML core; the Ruby `docx` gem instead hands us paragraphs (with a style name) and tables, so we map the structure directly:

"Heading 1".."Heading 6"  -> "#".."######"
"Title"                   -> "#"
list paragraphs           -> "- " / "1. "
bold/italic runs          -> "**"/"*"
tables                    -> GFM table via the shared Table emitter

Known limitations (documented in specs): embedded images are dropped, nested tables are flattened, and run-level formatting beyond bold/italic is not preserved.

Constant Summary collapse

MIMES =
%w[
  application/vnd.openxmlformats-officedocument.wordprocessingml.document
].freeze

Instance Method Summary collapse

Instance Method Details

#accepts?(mime, path) ⇒ Boolean

Returns:

  • (Boolean)


30
31
32
33
34
# File 'lib/rubino/documents/converters/docx.rb', line 30

def accepts?(mime, path)
  return true if MIMES.include?(mime.to_s)

  File.extname(path.to_s).downcase == ".docx"
end

#available?Boolean

Returns:

  • (Boolean)


23
24
25
26
27
28
# File 'lib/rubino/documents/converters/docx.rb', line 23

def available?
  require "docx"
  true
rescue LoadError
  false
end

#convert(path, budget = Limits.null_budget) ⇒ Object



36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# File 'lib/rubino/documents/converters/docx.rb', line 36

def convert(path, budget = Limits.null_budget)
  require "docx"
  # PRE-OPEN guard: Docx::Document.open reads the whole (decompressed)
  # word/document*.xml and builds the full Nokogiri DOM before yielding a
  # paragraph, so a zip-expand bomb's RSS is paid at open(). Sum the
  # uncompressed sizes of every entry under word/ from the central
  # directory first and bail to the shell-hint before the gem inflates
  # anything. `word/**` matches across `/` (guard_zip! globs without
  # FNM_PATHNAME) so a nested bomb is summed too (#337).
  Limits.guard_zip!(path, budget, ["word/**"])
  doc = ::Docx::Document.open(path)
  blocks = []
  # Iterate document order when the gem exposes it; otherwise paragraphs
  # then tables (best-effort -- the gem version dictates what's available).
  # budget.tick per paragraph bails a paragraph bomb (1M <w:p>) DURING
  # iteration -- before the 34 MB of XML is fully materialised to text.
  if doc.respond_to?(:each_paragraph)
    doc.each_paragraph { |p| blocks << emit_paragraph(p, budget) }
  else
    doc.paragraphs.each { |p| blocks << emit_paragraph(p, budget) }
  end
  if doc.respond_to?(:tables)
    doc.tables.each do |t|
      budget.tick
      blocks << table_markdown(t, budget)
    end
  end
  blocks.compact.reject(&:empty?).join("\n\n")
end