Class: Uniword::Mhtml::HtmlPart

Inherits:

MimePart

Object
Lutaml::Model::Serializable
MimePart
Uniword::Mhtml::HtmlPart

show all

Defined in:: lib/uniword/mhtml/html_part.rb

Overview

HTML MIME part — the main document content in an MHTML file.

Contains the Word HTML document with embedded XML metadata (DocumentProperties, WordDocument settings, LatentStyles).

Instance Method Summary collapse

#body_html ⇒ Object

Extract the <body> element inner HTML.
#body_inner_html ⇒ Object

Get the body inner HTML.
#css_styles ⇒ Object

Extract inline CSS styles from <style> tags.
#document_properties_xml ⇒ Object

Extract DocumentProperties XML from HTML head comments.
#head_html ⇒ Object

Extract the <head> element as string.
#html_document ⇒ Object

Parse the decoded HTML with Nokogiri.
#latent_styles_xml ⇒ Object

Extract LatentStyles XML from HTML head comments.
#office_document_settings_xml ⇒ Object

Extract OfficeDocumentSettings XML from HTML head comments.
#to_html ⇒ Object

Get the full HTML string.
#word_document_xml ⇒ Object

Extract WordDocument XML from HTML head comments.
#xml_blocks ⇒ Object

Extract all <xml> blocks from head.

Methods inherited from MimePart

#decoded_content, #decoded_content=, #filename, #html_content?, #image_content?, #text_content?, #theme_content?, #xml_content?

Instance Method Details

#body_html ⇒ `Object`

Extract the <body> element inner HTML

# File 'lib/uniword/mhtml/html_part.rb', line 22

def body_html
  node = html_document.at_css("body")
  node ? node.inner_html : ""
end

#body_inner_html ⇒ `Object`

Get the body inner HTML



73
74
75

# File 'lib/uniword/mhtml/html_part.rb', line 73

def body_inner_html
  body_html
end

#css_styles ⇒ `Object`

Extract inline CSS styles from <style> tags



28
29
30

# File 'lib/uniword/mhtml/html_part.rb', line 28

def css_styles
  html_document.css("style").map(&:content).join("\n")
end

#document_properties_xml ⇒ `Object`

Extract DocumentProperties XML from HTML head comments.

Returns the <o:DocumentProperties> element as a string with namespace declarations for lutaml-model parsing.

# File 'lib/uniword/mhtml/html_part.rb', line 36

def document_properties_xml
  extract_office_xml("DocumentProperties",
                     "urn:schemas-microsoft-com:office:office", "o")
end

#head_html ⇒ `Object`

Extract the <head> element as string

# File 'lib/uniword/mhtml/html_part.rb', line 16

def head_html
  node = html_document.at_css("head")
  node ? node.to_s : ""
end

#html_document ⇒ `Object`

Parse the decoded HTML with Nokogiri



11
12
13

# File 'lib/uniword/mhtml/html_part.rb', line 11

def html_document
  @html_document ||= Nokogiri::HTML(decoded_content)
end

#latent_styles_xml ⇒ `Object`

Extract LatentStyles XML from HTML head comments.

# File 'lib/uniword/mhtml/html_part.rb', line 54

def latent_styles_xml
  extract_office_xml("LatentStyles",
                     "urn:schemas-microsoft-com:office:word", "w")
end

#office_document_settings_xml ⇒ `Object`

Extract OfficeDocumentSettings XML from HTML head comments.

# File 'lib/uniword/mhtml/html_part.rb', line 42

def office_document_settings_xml
  extract_office_xml("OfficeDocumentSettings",
                     "urn:schemas-microsoft-com:office:office", "o")
end

#to_html ⇒ `Object`

Get the full HTML string



68
69
70

# File 'lib/uniword/mhtml/html_part.rb', line 68

def to_html
  html_document.to_s
end

#word_document_xml ⇒ `Object`

Extract WordDocument XML from HTML head comments.

# File 'lib/uniword/mhtml/html_part.rb', line 48

def word_document_xml
  extract_office_xml("WordDocument",
                     "urn:schemas-microsoft-com:office:word", "w")
end

#xml_blocks ⇒ `Object`

Extract all <xml> blocks from head

# File 'lib/uniword/mhtml/html_part.rb', line 60

def xml_blocks
  html_document.at_css("head")&.xpath("comment()")&.filter_map do |comment|
    text = comment.text
    ::Regexp.last_match(1).strip if text =~ %r{<xml>(.*?)</xml>}m
  end || []
end

Class: Uniword::Mhtml::HtmlPart

Overview

Instance Method Summary collapse

Methods inherited from MimePart

Instance Method Details

#body_html ⇒ Object

#body_inner_html ⇒ Object

#css_styles ⇒ Object

#document_properties_xml ⇒ Object

#head_html ⇒ Object

#html_document ⇒ Object

#latent_styles_xml ⇒ Object

#office_document_settings_xml ⇒ Object

#to_html ⇒ Object

#word_document_xml ⇒ Object

#xml_blocks ⇒ Object