Class: Canon::Xml::SaxBuilder

Inherits:

Object

Object
Canon::Xml::SaxBuilder

show all

Defined in:: lib/canon/xml/sax_builder.rb

Overview

Builds Canon::Xml::Node tree using Nokogiri SAX parser

This is MUCH faster than DOM parsing + conversion because:

No intermediate Nokogiri DOM tree (saves ~60ms)
No tree traversal to build Canon (saves ~1200ms)
No memory overhead of two complete DOM trees

Current (SLOW): XML String → Nokogiri DOM (~60ms) → Canon DOM (~1200ms) = ~1260ms Optimized (FAST): XML String → Nokogiri SAX → Canon DOM (~200ms) = ~200ms

Usage:

root = SaxBuilder.parse(xml_string, preserve_whitespace: false)
# root is a Canon::Xml::Nodes::RootNode

For C14N, use strip_doctype: true to avoid DTD default attribute expansion:

root = SaxBuilder.parse(xml_string, strip_doctype: true)

Class Method Summary collapse

.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ Nodes::RootNode

Parse XML string and return Canon::Xml::Node tree.
.strip_doctype_declaration(xml) ⇒ String

Strip DOCTYPE declaration without using complex regex This avoids ReDoS vulnerability from patterns like s+ and [^>]*.

Instance Method Summary collapse

#characters(string) ⇒ Object

Called for text content.
#comment(string) ⇒ Object

Called for comments.
#end_element(_name) ⇒ Object

Called when an element ends.
#error(string) ⇒ Object

SAX callbacks for libxml errors and warnings.
#initialize(preserve_whitespace: false) ⇒ SaxBuilder constructor

Initialize the SAX builder.
#processing_instruction(name, content) ⇒ Object

Called for processing instructions.
#reorder_children(root) ⇒ Object

Reorder root children so document element comes first followed by PIs and comments (outside document element).
#result ⇒ Nodes::RootNode

Return the built tree.
#start_element(name, attrs = []) ⇒ Object

Called when an element starts.
#warning(string) ⇒ Object

Constructor Details

#initialize(preserve_whitespace: false) ⇒ `SaxBuilder`

Initialize the SAX builder

Parameters:

preserve_whitespace (Boolean) (defaults to: false) —

Whether to preserve whitespace-only text nodes

# File 'lib/canon/xml/sax_builder.rb', line 81

def initialize(preserve_whitespace: false)
  super()
  @preserve_whitespace = preserve_whitespace
  @root = Nodes::RootNode.new
  @stack = [@root]
  # Track in-scope namespaces at each level
  # Each entry is a hash of prefix => uri
  @namespace_stack = [build_initial_namespaces]
  # Captured libxml errors during SAX parsing.  Surfaced on the
  # resulting RootNode so the diff report can warn the user
  # when a FATAL parse error has caused content loss
  # (see lutaml/canon#130).
  @parse_errors = []
end

Class Method Details

.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ `Nodes::RootNode`

Parse XML string and return Canon::Xml::Node tree

Parameters:

xml_string (String) —

XML content to parse
preserve_whitespace (Boolean) (defaults to: false) —

Whether to preserve whitespace-only text nodes
strip_doctype (Boolean) (defaults to: false) —

Strip DOCTYPE before parsing (for C14N to avoid DTD default attrs)

Returns:

(Nodes::RootNode) —

Root of the data model tree

# File 'lib/canon/xml/sax_builder.rb', line 31

def self.parse(xml_string, preserve_whitespace: false,
strip_doctype: false)
  # Strip DOCTYPE to prevent Nokogiri SAX from expanding DTD default attributes
  # This is needed for C14N which should NOT include default attributes from DTD
  # Use string methods instead of complex regex to avoid ReDoS vulnerability
  if strip_doctype
    xml_string = strip_doctype_declaration(xml_string)
  end

  builder = new(preserve_whitespace: preserve_whitespace)
  parser = Nokogiri::XML::SAX::Parser.new(builder)
  parser.parse(xml_string)
  builder.result
end

.strip_doctype_declaration(xml) ⇒ `String`

Strip DOCTYPE declaration without using complex regex This avoids ReDoS vulnerability from patterns like s+ and [^>]*

Parameters:

xml (String) —

XML string potentially containing DOCTYPE

Returns:

(String) —

XML string with DOCTYPE removed

# File 'lib/canon/xml/sax_builder.rb', line 51

def self.strip_doctype_declaration(xml)
  # Find DOCTYPE start (case-insensitive)
  doctype_start = xml.upcase.index("<!DOCTYPE")
  return xml unless doctype_start

  # Find the end of DOCTYPE - it ends with >
  # Handle both simple DOCTYPE and those with internal subset [...]
  pos = doctype_start + 9 # length of "<!DOCTYPE"
  in_bracket = false

  while pos < xml.length
    char = xml[pos]
    if char == "[" && !in_bracket
      in_bracket = true
    elsif char == "]" && in_bracket
      in_bracket = false
    elsif char == ">" && !in_bracket
      # Found the end of DOCTYPE
      return xml[0...doctype_start] + xml[(pos + 1)..]
    end
    pos += 1
  end

  # If we didn't find a proper end, just return original
  xml
end

Instance Method Details

#characters(string) ⇒ `Object`

Called for text content

Parameters:

string (String) —

Text content

# File 'lib/canon/xml/sax_builder.rb', line 168

def characters(string)
  return if string.nil?

  parent = @stack.last

  # Capture raw text BEFORE entity resolution for accurate serialization
  raw_string = string

  # Decode numeric character references
  decoded_string = decode_character_references(string)

  # Combine with previous text node if adjacent (SAX can split text content)
  # This MUST happen before whitespace check, because SAX may split "foo "
  # into "foo" and " " callbacks - we need to combine them before deciding
  # whether to skip whitespace
  last_child = parent.children.last
  if last_child&.node_type == :text
    # Combine both raw and decoded forms
    last_child.value = last_child.value + decoded_string
    last_child.original = (last_child.original || "") + raw_string
    return
  end

  # Skip whitespace-only text nodes unless:
  # 1. preserve_whitespace is true, OR
  # 2. The content contains CR (from &#xD; entities) which must be preserved for C14N, OR
  # 3. The content contains non-ASCII whitespace (NBSP U+00A0, ideographic
  #    space U+3000, etc.) — those are semantically meaningful content,
  #    not pretty-print indentation, and must survive parsing so the
  #    comparator can detect Unicode whitespace-type differences.
  #
  # Strip only when the node is pure ASCII whitespace (space, tab, CR, LF).
  # This lets pretty-printed fixtures work (indent nodes stripped) while
  # preserving NBSP-only text nodes.
  if !@preserve_whitespace && decoded_string.gsub(/[ \t\r\n]/,
                                                  "").empty? && parent.node_type == :element && !decoded_string.include?("\r")
    # Only skip if parent is an element (not root)
    return
  end

  text = Nodes::TextNode.new(value: decoded_string, original: raw_string)
  parent.add_child(text)
end

#comment(string) ⇒ `Object`

Called for comments

Parameters:

string (String) —

Comment content

# File 'lib/canon/xml/sax_builder.rb', line 215

def comment(string)
  parent = @stack.last
  comment_node = Nodes::CommentNode.new(value: string)
  parent.add_child(comment_node)
end

#end_element(_name) ⇒ `Object`

Called when an element ends

Parameters:

_name (String) —

Element name (unused)

# File 'lib/canon/xml/sax_builder.rb', line 160

def end_element(_name)
  @stack.pop
  @namespace_stack.pop
end

#error(string) ⇒ `Object`

SAX callbacks for libxml errors and warnings. Without these overrides the default handlers swallow the events; with them, libxml’s “Attribute xml:lang redefined” and similar messages land in @parse_errors and ride through to ComparisonResult.



100
101
102

# File 'lib/canon/xml/sax_builder.rb', line 100

def error(string)
  @parse_errors << string.to_s.strip
end

#processing_instruction(name, content) ⇒ `Object`

Called for processing instructions

Parameters:

name (String) —

PI target
content (String) —

PI content

# File 'lib/canon/xml/sax_builder.rb', line 225

def processing_instruction(name, content)
  parent = @stack.last
  pi = Nodes::ProcessingInstructionNode.new(target: name,
                                            data: content || "")
  parent.add_child(pi)
end

#reorder_children(root) ⇒ `Object`

Reorder root children so document element comes first followed by PIs and comments (outside document element)

# File 'lib/canon/xml/sax_builder.rb', line 246

def reorder_children(root)
  doc_element = root.children.find { |c| c.node_type == :element }
  return unless doc_element

  other_children = root.children.reject { |c| c.node_type == :element }
  root.children = [doc_element] + other_children
end

#result ⇒ `Nodes::RootNode`

Return the built tree

Returns:

(Nodes::RootNode) —

Root of the tree

# File 'lib/canon/xml/sax_builder.rb', line 235

def result
  # Reorder children so that the document element comes first,
  # followed by PIs and comments outside the document element
  # (C14N requires this ordering)
  reorder_children(@root)
  @root.parse_errors = @parse_errors if @parse_errors.any?
  @root
end

#start_element(name, attrs = []) ⇒ `Object`

Called when an element starts

Parameters:

name (String) —

Element name (may include prefix like “ns:element”)
attrs (Array) (defaults to: []) —

Array of [name, value] pairs

# File 'lib/canon/xml/sax_builder.rb', line 112

def start_element(name, attrs = [])
  parent = @stack.last

  # Parse namespace from name (prefix:localname or just localname)
  prefix, local_name = parse_qname(name)

  # Separate namespace declarations from regular attributes
  ns_decls, regular_attrs = separate_namespaces(attrs)

  # Check for relative namespace URIs (before building hash)
  # Convert to hash for iteration
  ns_hash = build_ns_hash(ns_decls)
  ns_hash.each_value do |uri|
    next if uri.nil? || uri.empty?

    if relative_uri?(uri)
      raise Canon::Error,
            "Relative namespace URI not allowed: #{uri}"
    end
  end

  # Push new namespace scope with declarations
  new_scope = @namespace_stack.last.merge(ns_hash)
  @namespace_stack.push(new_scope)

  # Find namespace URI from current scope
  ns_uri = new_scope[prefix.to_s]

  # Create element node
  element = Nodes::ElementNode.new(
    name: local_name,
    namespace_uri: ns_uri,
    prefix: prefix,
  )

  # Add namespace nodes from current scope
  add_namespace_nodes(element, new_scope)

  # Build and add attribute nodes (excluding xmlns declarations)
  add_attribute_nodes(element, regular_attrs)

  parent.add_child(element)
  @stack.push(element)
end

#warning(string) ⇒ `Object`



104
105
106

# File 'lib/canon/xml/sax_builder.rb', line 104

def warning(string)
  @parse_errors << string.to_s.strip
end

Class: Canon::Xml::SaxBuilder

Overview

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(preserve_whitespace: false) ⇒ SaxBuilder

Class Method Details

.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ Nodes::RootNode

.strip_doctype_declaration(xml) ⇒ String

Instance Method Details

#characters(string) ⇒ Object

#comment(string) ⇒ Object

#end_element(_name) ⇒ Object

#error(string) ⇒ Object

#processing_instruction(name, content) ⇒ Object

#reorder_children(root) ⇒ Object

#result ⇒ Nodes::RootNode

#start_element(name, attrs = []) ⇒ Object

#warning(string) ⇒ Object

#initialize(preserve_whitespace: false) ⇒ `SaxBuilder`

.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ `Nodes::RootNode`

.strip_doctype_declaration(xml) ⇒ `String`

#characters(string) ⇒ `Object`

#comment(string) ⇒ `Object`

#end_element(_name) ⇒ `Object`

#error(string) ⇒ `Object`

#processing_instruction(name, content) ⇒ `Object`

#reorder_children(root) ⇒ `Object`

#result ⇒ `Nodes::RootNode`

#start_element(name, attrs = []) ⇒ `Object`

#warning(string) ⇒ `Object`