Class: Uniword::Transformation::HtmlElementBuilder
- Inherits:
-
Object
- Object
- Uniword::Transformation::HtmlElementBuilder
- Defined in:
- lib/uniword/transformation/html_element_builder.rb
Overview
Builds OOXML element objects from parsed HTML (Nokogiri nodes).
Pure functions — no state, no side effects. Used by HtmlToOoxmlConverter to construct OOXML paragraphs, tables, runs.
Delegates formatting extraction to HtmlFormattingMapper.
Class Method Summary collapse
-
.apply_css_style(paragraph, element) ⇒ Object
Apply CSS class → OOXML style mapping.
-
.apply_heading_style(paragraph, element) ⇒ Object
Apply heading level style from tag name (h1-h6).
-
.build_cell(html_cell) ⇒ Uniword::Wordprocessingml::TableCell?
Build an OOXML TableCell from an HTML td/th element.
-
.build_children(paragraph, element) ⇒ Object
Build child nodes (text nodes and element nodes) into runs/SDTs/hyperlinks.
-
.build_paragraph(element) ⇒ Uniword::Wordprocessingml::Paragraph?
Build an OOXML Paragraph from an HTML element.
-
.build_row(html_row) ⇒ Uniword::Wordprocessingml::TableRow?
Build an OOXML TableRow from an HTML tr element.
-
.build_table(html_table) ⇒ Uniword::Wordprocessingml::Table?
Build an OOXML Table from an HTML table element.
-
.create_break_run(element) ⇒ Uniword::Wordprocessingml::Run
Create a Run with Break from an HTML
element. -
.create_endnote_reference_run(element) ⇒ Uniword::Wordprocessingml::Run
Create a Run with EndnoteReference from an HTML endnote reference span.
-
.create_footnote_reference_run(element) ⇒ Uniword::Wordprocessingml::Run
Create a Run with FootnoteReference from an HTML footnote reference span.
-
.create_hyperlink(element) ⇒ Uniword::Wordprocessingml::Hyperlink?
Create a Hyperlink from an HTML <a href> element.
-
.create_run(text) ⇒ Uniword::Wordprocessingml::Run
Create a simple OOXML Run from text.
-
.create_run_from_element(element) ⇒ Uniword::Wordprocessingml::Run?
Create an OOXML Run from an HTML element with inline formatting.
-
.create_sdt_from_element(element) ⇒ Uniword::Wordprocessingml::StructuredDocumentTag?
Create an OOXML SDT from an HTML w:sdt element.
-
.create_vmerge_continuation_cell ⇒ Uniword::Wordprocessingml::TableCell
Create a vMerge continuation cell (empty cell with vMerge, no val).
-
.endnote_reference_span?(element) ⇒ Boolean
Check if element is an endnote reference span.
-
.ensure_empty_cell(cell) ⇒ Object
Ensure cell has at least one empty paragraph to preserve structure.
-
.extract_note_id(element) ⇒ String?
Extract footnote/endnote ID from reference span element.
-
.footnote_reference_span?(element) ⇒ Boolean
Check if element is a footnote reference span.
-
.parse_sdt_attributes(element) ⇒ Object
Parse SDT attributes from HTML element.
Class Method Details
.apply_css_style(paragraph, element) ⇒ Object
Apply CSS class → OOXML style mapping.
215 216 217 218 219 220 221 222 223 224 225 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 215 def self.apply_css_style(paragraph, element) css_class = element.attr("class") return unless css_class && !css_class.empty? mapped_style = HtmlFormattingMapper.map_css_class_to_style(css_class) return unless mapped_style paragraph.properties ||= Uniword::Wordprocessingml::ParagraphProperties.new paragraph.properties.style = mapped_style end |
.apply_heading_style(paragraph, element) ⇒ Object
Apply heading level style from tag name (h1-h6).
205 206 207 208 209 210 211 212 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 205 def self.apply_heading_style(paragraph, element) return unless element.name.match?(/^h[1-6]$/) paragraph.properties ||= Uniword::Wordprocessingml::ParagraphProperties.new heading_num = element.name[1] paragraph.properties.style = "Heading#{heading_num}" end |
.build_cell(html_cell) ⇒ Uniword::Wordprocessingml::TableCell?
Build an OOXML TableCell from an HTML td/th element.
Handles colspan → gridSpan, rowspan → vMerge restart, <th> → header.
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 114 def self.build_cell(html_cell) cell = Uniword::Wordprocessingml::TableCell.new # Detect <th> → header cell cell.header = html_cell.name.downcase == "th" # Handle colspan → gridSpan, rowspan → vMerge restart colspan = html_cell.attr("colspan") rowspan = html_cell.attr("rowspan") if (colspan && colspan.to_i > 1) || (rowspan && rowspan.to_i > 1) cell.properties ||= Uniword::Wordprocessingml::TableCellProperties.new if colspan && colspan.to_i > 1 cell.properties.grid_span = Uniword::Wordprocessingml::ValInt.new(value: colspan.to_i) end if rowspan && rowspan.to_i > 1 cell.properties.v_merge = Uniword::Wordprocessingml::ValInt.new(value: 1) # restart end end # Convert cell content to paragraphs html_cell.css("p, div, h1, h2, h3, h4, h5, h6").each do |para_element| paragraph = build_paragraph(para_element) cell.paragraphs << paragraph if paragraph end # If no paragraphs found, create one from text content if cell.paragraphs.empty? text = html_cell.text.strip if text && !text.empty? para = Uniword::Wordprocessingml::Paragraph.new para.runs << create_run(text) cell.paragraphs << para end end # Always return a cell, even if empty (preserves table structure) cell.paragraphs.empty? ? ensure_empty_cell(cell) : cell end |
.build_children(paragraph, element) ⇒ Object
Build child nodes (text nodes and element nodes) into runs/SDTs/hyperlinks.
Handles:
→ Break run, <a href> → Hyperlink, <span class=“MsoFootnoteReference”> → FootnoteReference run, <span class=“MsoEndnoteReference”> → EndnoteReference run.
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 231 def self.build_children(paragraph, element) element.children.each do |child| case child.type when Nokogiri::XML::Node::TEXT_NODE text = child.text.strip next if text.empty? paragraph.runs << create_run(text) when Nokogiri::XML::Node::ELEMENT_NODE case child.name.downcase when "w:sdt", "sdt" sdt = create_sdt_from_element(child) paragraph.sdts << sdt if sdt when "br" paragraph.runs << create_break_run(child) when "a" hyperlink = create_hyperlink(child) paragraph.hyperlinks << hyperlink if hyperlink when "img" # Images require binary data not available from HTML parsing; skip when "span" if footnote_reference_span?(child) paragraph.runs << create_footnote_reference_run(child) elsif endnote_reference_span?(child) paragraph.runs << create_endnote_reference_run(child) else run = create_run_from_element(child) paragraph.runs << run if run end else run = create_run_from_element(child) paragraph.runs << run if run end end end end |
.build_paragraph(element) ⇒ Uniword::Wordprocessingml::Paragraph?
Build an OOXML Paragraph from an HTML element.
16 17 18 19 20 21 22 23 24 25 26 27 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 16 def self.build_paragraph(element) paragraph = Uniword::Wordprocessingml::Paragraph.new apply_heading_style(paragraph, element) apply_css_style(paragraph, element) build_children(paragraph, element) has_content = paragraph.runs.any? || paragraph.hyperlinks.any? || paragraph.sdts.any? has_content ? paragraph : nil end |
.build_row(html_row) ⇒ Uniword::Wordprocessingml::TableRow?
Build an OOXML TableRow from an HTML tr element.
93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 93 def self.build_row(html_row) row = Uniword::Wordprocessingml::TableRow.new cells = [] html_row.css("td, th").each do |html_cell| cell = build_cell(html_cell) cells << cell if cell end return nil if cells.empty? cells.each { |c| row.cells << c } row end |
.build_table(html_table) ⇒ Uniword::Wordprocessingml::Table?
Build an OOXML Table from an HTML table element.
Handles colspan → gridSpan and rowspan → vMerge with continuation cell insertion. The grid layout is computed row-by-row to correctly place vMerge continuation cells where HTML rowspan omits cells.
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 37 def self.build_table(html_table) table = Uniword::Wordprocessingml::Table.new html_rows = html_table.css("tr") return nil if html_rows.empty? # Track columns occupied by rowspan: col_idx → remaining continuation rows occupied = {} rows = [] html_rows.each do |html_row| row = Uniword::Wordprocessingml::TableRow.new col_idx = 0 html_row.css("td, th").each do |html_cell| # Insert vMerge continuation cells for columns occupied by rowspan while occupied.key?(col_idx) row.cells << create_vmerge_continuation_cell occupied[col_idx] -= 1 occupied.delete(col_idx) if occupied[col_idx] <= 0 col_idx += 1 end cell = build_cell(html_cell) row.cells << cell # Track rowspan for continuation cell insertion in subsequent rows rowspan = html_cell.attr("rowspan")&.to_i if rowspan && rowspan > 1 colspan = html_cell.attr("colspan")&.to_i || 1 colspan.times { |c| occupied[col_idx + c] = rowspan - 1 } end col_idx += html_cell.attr("colspan")&.to_i || 1 end # Handle remaining occupied columns at end of row while occupied.key?(col_idx) row.cells << create_vmerge_continuation_cell occupied[col_idx] -= 1 occupied.delete(col_idx) if occupied[col_idx] <= 0 col_idx += 1 end rows << row unless row.cells.empty? end return nil if rows.empty? rows.each { |r| table.rows << r } table end |
.create_break_run(element) ⇒ Uniword::Wordprocessingml::Run
Create a Run with Break from an HTML
element.
325 326 327 328 329 330 331 332 333 334 335 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 325 def self.create_break_run(element) run = Uniword::Wordprocessingml::Run.new brk = Uniword::Wordprocessingml::Break.new style = element.attr("style").to_s clear = element.attr("clear").to_s if style.include?("page-break-before") || style.include?("page-break-after") || clear == "all" brk.type = "page" end run.break = brk run end |
.create_endnote_reference_run(element) ⇒ Uniword::Wordprocessingml::Run
Create a Run with EndnoteReference from an HTML endnote reference span.
373 374 375 376 377 378 379 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 373 def self.create_endnote_reference_run(element) run = Uniword::Wordprocessingml::Run.new id = extract_note_id(element) || "1" run.endnote_reference = Uniword::Wordprocessingml::EndnoteReference.new(id: id) run end |
.create_footnote_reference_run(element) ⇒ Uniword::Wordprocessingml::Run
Create a Run with FootnoteReference from an HTML footnote reference span.
361 362 363 364 365 366 367 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 361 def self.create_footnote_reference_run(element) run = Uniword::Wordprocessingml::Run.new id = extract_note_id(element) || "1" run.footnote_reference = Uniword::Wordprocessingml::FootnoteReference.new(id: id) run end |
.create_hyperlink(element) ⇒ Uniword::Wordprocessingml::Hyperlink?
Create a Hyperlink from an HTML <a href> element.
410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 410 def self.create_hyperlink(element) href = element.attr("href") return nil unless href hyperlink = Uniword::Wordprocessingml::Hyperlink.new hyperlink.target = href element.children.each do |child| case child.type when Nokogiri::XML::Node::TEXT_NODE text = child.text.strip next if text.empty? hyperlink.runs << create_run(text) when Nokogiri::XML::Node::ELEMENT_NODE run = create_run_from_element(child) hyperlink.runs << run if run end end hyperlink.runs.any? ? hyperlink : nil end |
.create_run(text) ⇒ Uniword::Wordprocessingml::Run
Create a simple OOXML Run from text.
160 161 162 163 164 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 160 def self.create_run(text) run = Uniword::Wordprocessingml::Run.new run.text = HtmlFormattingMapper.decode_entities(text) run end |
.create_run_from_element(element) ⇒ Uniword::Wordprocessingml::Run?
Create an OOXML Run from an HTML element with inline formatting.
170 171 172 173 174 175 176 177 178 179 180 181 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 170 def self.create_run_from_element(element) text = element.text.strip return nil if text.empty? decoded_text = HtmlFormattingMapper.decode_entities(text) props = HtmlFormattingMapper.collect_formatting(element) run = Uniword::Wordprocessingml::Run.new run.text = decoded_text run.properties = props if HtmlFormattingMapper.has_formatting?(props) run end |
.create_sdt_from_element(element) ⇒ Uniword::Wordprocessingml::StructuredDocumentTag?
Create an OOXML SDT from an HTML w:sdt element.
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 187 def self.create_sdt_from_element(element) text = element.text.strip return nil if text.empty? sdt = Uniword::Wordprocessingml::StructuredDocumentTag.new sdt_props = parse_sdt_attributes(element) sdt.properties = sdt_props if sdt_props content = Uniword::Wordprocessingml::StructuredDocumentTag::Content.new run = Uniword::Wordprocessingml::Run.new run.text = HtmlFormattingMapper.decode_entities(text) content.runs = [run] sdt.content = content sdt end |
.create_vmerge_continuation_cell ⇒ Uniword::Wordprocessingml::TableCell
Create a vMerge continuation cell (empty cell with vMerge, no val).
436 437 438 439 440 441 442 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 436 def self.create_vmerge_continuation_cell cell = Uniword::Wordprocessingml::TableCell.new cell.properties = Uniword::Wordprocessingml::TableCellProperties.new cell.properties.v_merge = Uniword::Wordprocessingml::ValInt.new cell.paragraphs << Uniword::Wordprocessingml::Paragraph.new cell end |
.endnote_reference_span?(element) ⇒ Boolean
Check if element is an endnote reference span.
351 352 353 354 355 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 351 def self.endnote_reference_span?(element) return false unless element.name.downcase == "span" element.attr("class").to_s.split.include?("MsoEndnoteReference") end |
.ensure_empty_cell(cell) ⇒ Object
Ensure cell has at least one empty paragraph to preserve structure.
269 270 271 272 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 269 def self.ensure_empty_cell(cell) cell.paragraphs << Uniword::Wordprocessingml::Paragraph.new cell end |
.extract_note_id(element) ⇒ String?
Extract footnote/endnote ID from reference span element.
Tries: nested <a href=“#_ftnN”>, then text content digits.
387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 387 def self.extract_note_id(element) # Try nested <a href="#_ftnN"> or <a href="#_ednN"> anchor = element.at_css("a[href^='#']") if anchor href = anchor.attr("href").sub(/^#/, "") if href =~ /(\d+)\s*$/ return Regexp.last_match(1) end end # Fall back to digit in text content text = element.text.strip if text =~ /(\d+)/ return Regexp.last_match(1) end nil end |
.footnote_reference_span?(element) ⇒ Boolean
Check if element is a footnote reference span.
341 342 343 344 345 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 341 def self.footnote_reference_span?(element) return false unless element.name.downcase == "span" element.attr("class").to_s.split.include?("MsoFootnoteReference") end |
.parse_sdt_attributes(element) ⇒ Object
Parse SDT attributes from HTML element.
275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 |
# File 'lib/uniword/transformation/html_element_builder.rb', line 275 def self.parse_sdt_attributes(element) attrs = element.attributes return nil if attrs.empty? sdt_props = Uniword::Wordprocessingml::StructuredDocumentTagProperties.new if attrs["showingplchdr"] || attrs["ShowingPlcHdr"] sdt_props.showing_placeholder_header = Uniword::Wordprocessingml::StructuredDocumentTag::ShowingPlaceholderHeader.new end if attrs["temporary"] || attrs["Temporary"] sdt_props.temporary = Uniword::Wordprocessingml::StructuredDocumentTag::Temporary.new end doc_part = attrs["docpart"] || attrs["DocPart"] if doc_part placeholder = Uniword::Wordprocessingml::StructuredDocumentTag::Placeholder.new doc_part_ref = Uniword::Wordprocessingml::StructuredDocumentTag::DocPartReference.new(value: doc_part.value) placeholder.doc_part = doc_part_ref sdt_props.placeholder = placeholder end if attrs["text"] || attrs["Text"] sdt_props.text = Uniword::Wordprocessingml::StructuredDocumentTag::Text.new(value: "whole") end id_attr = attrs["id"] || attrs["ID"] if id_attr sdt_props.id = Uniword::Wordprocessingml::StructuredDocumentTag::Id.new(value: id_attr.value.to_i) end if attrs["bibliography"] || attrs["Bibliography"] sdt_props.bibliography = Uniword::Wordprocessingml::StructuredDocumentTag::Bibliography.new end sdt_props end |