Class: Rubino::Documents::Converters::Docx
- Inherits:
-
Object
- Object
- Rubino::Documents::Converters::Docx
- Defined in:
- lib/rubino/documents/converters/docx.rb
Overview
DOCX -> Markdown via the ‘docx` gem (MIT, OPTIONAL). markitdown gets this “for free” by going docx->HTML (mammoth) then through its HTML core; the Ruby `docx` gem instead hands us paragraphs (with a style name) and tables, so we map the structure directly:
"Heading 1".."Heading 6" -> "#".."######"
"Title" -> "#"
list paragraphs -> "- " / "1. "
bold/italic runs -> "**"/"*"
tables -> GFM table via the shared Table emitter
Known limitations (documented in specs): embedded images are dropped, nested tables are flattened, and run-level formatting beyond bold/italic is not preserved.
Constant Summary collapse
- MIMES =
%w[ application/vnd.openxmlformats-officedocument.wordprocessingml.document ].freeze
Instance Method Summary collapse
- #accepts?(mime, path) ⇒ Boolean
- #available? ⇒ Boolean
- #convert(path, budget = Limits.null_budget) ⇒ Object
Instance Method Details
#accepts?(mime, path) ⇒ Boolean
30 31 32 33 34 |
# File 'lib/rubino/documents/converters/docx.rb', line 30 def accepts?(mime, path) return true if MIMES.include?(mime.to_s) File.extname(path.to_s).downcase == ".docx" end |
#available? ⇒ Boolean
23 24 25 26 27 28 |
# File 'lib/rubino/documents/converters/docx.rb', line 23 def available? require "docx" true rescue LoadError false end |
#convert(path, budget = Limits.null_budget) ⇒ Object
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# File 'lib/rubino/documents/converters/docx.rb', line 36 def convert(path, budget = Limits.null_budget) require "docx" # PRE-OPEN guard: Docx::Document.open reads the whole (decompressed) # word/document*.xml and builds the full Nokogiri DOM before yielding a # paragraph, so a zip-expand bomb's RSS is paid at open(). Sum the # uncompressed sizes of every entry under word/ from the central # directory first and bail to the shell-hint before the gem inflates # anything. `word/**` matches across `/` (guard_zip! globs without # FNM_PATHNAME) so a nested bomb is summed too (#337). Limits.guard_zip!(path, budget, ["word/**"]) doc = ::Docx::Document.open(path) blocks = [] # Iterate document order when the gem exposes it; otherwise paragraphs # then tables (best-effort -- the gem version dictates what's available). # budget.tick per paragraph bails a paragraph bomb (1M <w:p>) DURING # iteration -- before the 34 MB of XML is fully materialised to text. if doc.respond_to?(:each_paragraph) doc.each_paragraph { |p| blocks << emit_paragraph(p, budget) } else doc.paragraphs.each { |p| blocks << emit_paragraph(p, budget) } end if doc.respond_to?(:tables) doc.tables.each do |t| budget.tick blocks << table_markdown(t, budget) end end blocks.compact.reject(&:empty?).join("\n\n") end |