Class: Coradoc::Docx::Transform::ToCoreModel

Inherits:

Object

Object
Coradoc::Docx::Transform::ToCoreModel

show all

Defined in:: lib/coradoc/docx/transform/to_core_model.rb

Overview

Orchestrator for OOXML → CoreModel transformation.

Walks a Uniword::Wordprocessingml::DocumentRoot tree and dispatches to registered transform rules. Handles:

Style-based heading detection (via StyleResolver)
List grouping (consecutive numPr paragraphs → single ListBlock)
Footnote content collection
Image reference tracking
Bookmark ID propagation

Dispatch strategy:

HeadingRule and ListItemRule are dispatched directly by the orchestrator (they need context for style resolution).
All other element types are dispatched via RuleRegistry.

Examples:

Transform a DOCX document

doc = Uniword::DocumentFactory.from_file("input.docx")
core = ToCoreModel.transform(doc)
# => Coradoc::CoreModel::StructuralElement

Class Method Summary collapse

.transform(document) ⇒ Object

Instance Method Summary collapse

#transform(document) ⇒ Object

Class Method Details

.transform(document) ⇒ `Object`



28
29
30

# File 'lib/coradoc/docx/transform/to_core_model.rb', line 28

def transform(document)
  new.transform(document)
end

Instance Method Details

#transform(document) ⇒ `Object`

# File 'lib/coradoc/docx/transform/to_core_model.rb', line 33

def transform(document)
  registry = build_registry

  context = Context.new(
    styles_configuration: document.styles_configuration,
    numbering_configuration: document.numbering_configuration,
    footnotes: collect_footnotes(document),
    registry: registry
  )

  @heading_rule = Rules::HeadingRule.new
  @list_item_rule = Rules::ListItemRule.new

  body = document.body
  doc_title = extract_document_title(document, context)
  children = transform_elements(body, context)

  # If the first child is an H1 matching the doc title, skip the
  # duplicate — the document title already captures it
  if doc_title && children.first.is_a?(Coradoc::CoreModel::StructuralElement) &&
     children.first.section? &&
     children.first.title == doc_title &&
     children.first.level == 1
    children.shift
  end

  doc = Coradoc::CoreModel::StructuralElement.new(
    element_type: 'document',
    title: doc_title,
    children: children
  )

  # Extract semantic content from headers/footers
  extract_header_footer_metadata(document, doc)

  doc
end