Class: Uniword::Mhtml::Document
- Inherits:
-
Lutaml::Model::Serializable
- Object
- Lutaml::Model::Serializable
- Uniword::Mhtml::Document
- Defined in:
- lib/uniword/mhtml/document.rb
Overview
MHTML Document — top-level model for .mht/.mhtml/.doc files.
This is COMPLETELY SEPARATE from OOXML Wordprocessingml::DocumentRoot. MHTML uses MIME multipart format with HTML content, not ZIP + OOXML XML.
Structure:
Mhtml::Document
├── html_part (HtmlPart) — main document HTML
├── parts[] (MimePart) — all MIME parts (images, XML, theme, etc.)
├── document_properties (Metadata::DocumentProperties)
├── word_document_settings (Metadata::WordDocumentSettings)
└── filelist_xml (String)
Instance Method Summary collapse
-
#add_part(part) ⇒ Object
Add a MIME part.
-
#body_html ⇒ Object
Body inner HTML.
-
#color_scheme_mapping_part ⇒ XmlPart?
Color scheme mapping part.
-
#color_scheme_mapping_xml ⇒ String?
Color scheme mapping XML.
-
#css_styles ⇒ Object
CSS styles from HTML head.
-
#filelist_part ⇒ XmlPart?
Filelist XML part.
-
#filelist_xml ⇒ String?
Filelist XML content.
-
#footer_html ⇒ String?
Footer HTML (placeholder).
-
#header_footer_parts ⇒ Array<HeaderFooterPart>
Header/footer HTML parts.
-
#header_html ⇒ String?
Header HTML.
-
#html ⇒ HtmlPart
The main HTML part.
-
#image_parts ⇒ Array<ImagePart>
All image parts.
-
#images ⇒ Hash
Images as filename => decoded data.
-
#inspect ⇒ Object
Build a summary of the document structure.
-
#placeholder_html ⇒ String?
Placeholder header HTML.
-
#raw_html ⇒ Object
Raw HTML string of the main HTML part.
- #raw_html=(value) ⇒ Object
-
#text ⇒ Object
Text content (stripped of HTML tags).
-
#theme_part ⇒ ThemePart?
Theme data part.
-
#xml_parts ⇒ Array<XmlPart>
All XML parts.
Instance Method Details
#add_part(part) ⇒ Object
Add a MIME part
154 155 156 157 |
# File 'lib/uniword/mhtml/document.rb', line 154 def add_part(part) parts << part self end |
#body_html ⇒ Object
Body inner HTML
56 57 58 |
# File 'lib/uniword/mhtml/document.rb', line 56 def body_html html_part&.body_html end |
#color_scheme_mapping_part ⇒ XmlPart?
Returns Color scheme mapping part.
109 110 111 112 113 |
# File 'lib/uniword/mhtml/document.rb', line 109 def color_scheme_mapping_part parts.find do |p| p.is_a?(XmlPart) && p.filename&.include?("colorschememapping") end end |
#color_scheme_mapping_xml ⇒ String?
Returns Color scheme mapping XML.
116 117 118 |
# File 'lib/uniword/mhtml/document.rb', line 116 def color_scheme_mapping_xml color_scheme_mapping_part&.decoded_content end |
#css_styles ⇒ Object
CSS styles from HTML head
61 62 63 |
# File 'lib/uniword/mhtml/document.rb', line 61 def css_styles html_part&.css_styles end |
#filelist_part ⇒ XmlPart?
Returns Filelist XML part.
99 100 101 |
# File 'lib/uniword/mhtml/document.rb', line 99 def filelist_part parts.find { |p| p.is_a?(XmlPart) && p.filename == "filelist.xml" } end |
#filelist_xml ⇒ String?
Returns Filelist XML content.
104 105 106 |
# File 'lib/uniword/mhtml/document.rb', line 104 def filelist_xml filelist_part&.decoded_content end |
#footer_html ⇒ String?
Returns Footer HTML (placeholder).
133 134 135 136 137 |
# File 'lib/uniword/mhtml/document.rb', line 133 def .find do |p| p.filename&.include?("footer") end&.decoded_content end |
#header_footer_parts ⇒ Array<HeaderFooterPart>
Returns Header/footer HTML parts.
121 122 123 |
# File 'lib/uniword/mhtml/document.rb', line 121 def parts.grep(HeaderFooterPart) end |
#header_html ⇒ String?
Returns Header HTML.
126 127 128 129 130 |
# File 'lib/uniword/mhtml/document.rb', line 126 def header_html .find do |p| p.filename&.include?("header") end&.decoded_content end |
#html ⇒ HtmlPart
Returns The main HTML part.
39 40 41 |
# File 'lib/uniword/mhtml/document.rb', line 39 def html @html_part end |
#image_parts ⇒ Array<ImagePart>
Returns All image parts.
89 90 91 |
# File 'lib/uniword/mhtml/document.rb', line 89 def image_parts parts.grep(ImagePart) end |
#images ⇒ Hash
Returns Images as filename => decoded data.
147 148 149 150 151 |
# File 'lib/uniword/mhtml/document.rb', line 147 def images image_parts.each_with_object({}) do |part, hash| hash[part.filename] = part.decoded_content if part.filename end end |
#inspect ⇒ Object
Build a summary of the document structure
160 161 162 163 |
# File 'lib/uniword/mhtml/document.rb', line 160 def inspect "#<#{self.class} parts=#{parts.length} images=#{image_parts.length} " \ "xml=#{xml_parts.length} theme=#{theme_part ? 'yes' : 'no'}>" end |
#placeholder_html ⇒ String?
Returns Placeholder header HTML.
140 141 142 143 144 |
# File 'lib/uniword/mhtml/document.rb', line 140 def placeholder_html .find do |p| p.filename&.include?("plchdr") end&.decoded_content end |
#raw_html ⇒ Object
Raw HTML string of the main HTML part
44 45 46 |
# File 'lib/uniword/mhtml/document.rb', line 44 def raw_html html_part&.decoded_content end |
#raw_html=(value) ⇒ Object
48 49 50 51 52 53 |
# File 'lib/uniword/mhtml/document.rb', line 48 def raw_html=(value) self.html_part ||= HtmlPart.new html_part.content_type = "text/html" html_part.content_transfer_encoding = "quoted-printable" html_part.raw_content = value end |
#text ⇒ Object
Text content (stripped of HTML tags)
66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
# File 'lib/uniword/mhtml/document.rb', line 66 def text return "" unless raw_html raw_html .gsub(/<[^>]+>/, " ") .gsub("<", "<") .gsub(">", ">") .gsub("&", "&") .gsub(""", '"') .gsub("'", "'") .gsub(" ", " ") .gsub(/\s+/, " ") .strip end |