Class: Coradoc::Docx::Transform::ToCoreModel

Inherits:
Object
  • Object
show all
Defined in:
lib/coradoc/docx/transform/to_core_model.rb

Overview

Orchestrator for OOXML → CoreModel transformation.

Walks a Uniword::Wordprocessingml::DocumentRoot tree and dispatches to registered transform rules. Handles:

  • Style-based heading detection (via StyleResolver)

  • List grouping (consecutive numPr paragraphs → single ListBlock)

  • Footnote content collection

  • Image reference tracking

  • Bookmark ID propagation

Dispatch strategy:

  • HeadingRule and ListItemRule are dispatched directly by the orchestrator (they need context for style resolution).

  • All other element types are dispatched via RuleRegistry.

Examples:

Transform a DOCX document

doc = Uniword::DocumentFactory.from_file("input.docx")
core = ToCoreModel.transform(doc)
# => Coradoc::CoreModel::StructuralElement

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.transform(document) ⇒ Object



28
29
30
# File 'lib/coradoc/docx/transform/to_core_model.rb', line 28

def transform(document)
  new.transform(document)
end

Instance Method Details

#transform(document) ⇒ Object



33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# File 'lib/coradoc/docx/transform/to_core_model.rb', line 33

def transform(document)
  registry = build_registry

  context = Context.new(
    styles_configuration: document.styles_configuration,
    numbering_configuration: document.numbering_configuration,
    footnotes: collect_footnotes(document),
    registry: registry
  )

  @heading_rule = Rules::HeadingRule.new
  @list_item_rule = Rules::ListItemRule.new

  body = document.body
  doc_title = extract_document_title(document, context)
  children = transform_elements(body, context)

  # If the first child is an H1 matching the doc title, skip the
  # duplicate — the document title already captures it
  if doc_title && children.first.is_a?(Coradoc::CoreModel::StructuralElement) &&
     children.first.section? &&
     children.first.title == doc_title &&
     children.first.level == 1
    children.shift
  end

  doc = Coradoc::CoreModel::StructuralElement.new(
    element_type: 'document',
    title: doc_title,
    children: children
  )

  # Extract semantic content from headers/footers
  (document, doc)

  doc
end