Class: Canon::Comparison::MarkupComparator

Inherits:

Object

Object
Canon::Comparison::MarkupComparator

show all

Defined in:: lib/canon/comparison/markup_comparator.rb

Overview

Base class for markup document comparison (XML, HTML)

Provides shared comparison functionality for markup documents, including node type checking, text extraction, filtering, and difference creation.

Format-specific comparators (XmlComparator, HtmlComparator) inherit from this class and add format-specific behavior.

Direct Known Subclasses

HtmlComparator, XmlComparator

Class Method Summary collapse

.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object

Add a difference to the differences array.
.build_attribute_difference_reason(attrs1, attrs2) ⇒ String

Build a clear reason message for attribute presence differences Shows which attributes are only in node1, only in node2, or different values.
.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String

Build a human-readable reason for a difference.
.build_path_for_node(node) ⇒ String^?

Build canonical path for a node.
.build_text_difference_reason(text1, text2) ⇒ String

Build a clear reason message for text content differences Shows the actual text content (truncated if too long).
.comment_node?(node) ⇒ Boolean

Check if a node is a comment node.
.determine_node_dimension(node) ⇒ Symbol

Determine the appropriate dimension for a node type.
.enrich_diff_metadata(node1, node2) ⇒ Hash

Enrich DiffNode with canonical path, serialized content, and attributes This extracts presentation-ready metadata from nodes for Stage 4 rendering.
.extract_attributes(node) ⇒ Hash^?

Extract attributes from a node.
.extract_text_content_from_node(node) ⇒ String^?

Extract text content from a node for diff reason.
.filter_children(children, opts) ⇒ Array

Filter children based on options.
.node_excluded?(node, opts) ⇒ Boolean

Check if node should be excluded from comparison.
.node_text(node) ⇒ String

Get text content from a node.
.same_node_type?(node1, node2) ⇒ Boolean

Check if two nodes are the same type.
.serialize_element_node(node) ⇒ String

Serialize an element node to string.
.serialize_node(node) ⇒ String^?

Serialize a node to string for display.
.text_node?(node) ⇒ Boolean

Check if a node is a text node.
.truncate_text(text, max_length = 40) ⇒ String

Truncate text for display in reason messages.
.whitespace_only_difference?(text1, text2) ⇒ Boolean

Check if difference between two texts is only whitespace.

Class Method Details

.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ `Object`

Add a difference to the differences array

Creates a DiffNode with enriched metadata including path, serialized content, and attributes for Stage 4 rendering.

Parameters:

node1 (Object, nil) —

First node
node2 (Object, nil) —

Second node
diff1 (Symbol) —

Difference type for node1
diff2 (Symbol) —

Difference type for node2
dimension (Symbol) —

The match dimension causing this difference
_opts (Hash) —

Options (unused but kept for interface compatibility)
differences (Array) —

Array to append difference to

# File 'lib/canon/comparison/markup_comparator.rb', line 32

def add_difference(node1, node2, diff1, diff2, dimension, _opts,
                   differences)
  # All differences must be DiffNode objects (OO architecture)
  if dimension.nil?
    raise ArgumentError,
          "dimension required for DiffNode"
  end

  # Build informative reason message
  reason = build_difference_reason(node1, node2, diff1, diff2,
                                   dimension)

  # Enrich with path, serialized content, and attributes for Stage 4 rendering
  metadata = enrich_diff_metadata(node1, node2)

  diff_node = Canon::Diff::DiffNode.new(
    node1: node1,
    node2: node2,
    dimension: dimension,
    reason: reason,
    **metadata,
  )
  differences << diff_node
end

.build_attribute_difference_reason(attrs1, attrs2) ⇒ `String`

Build a clear reason message for attribute presence differences Shows which attributes are only in node1, only in node2, or different values

Parameters:

attrs1 (Hash, nil) —

First node’s attributes
attrs2 (Hash, nil) —

Second node’s attributes

Returns:

(String) —

Clear explanation of the attribute difference

# File 'lib/canon/comparison/markup_comparator.rb', line 316

def build_attribute_difference_reason(attrs1, attrs2)
  return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2

  require "set"
  keys1 = attrs1.keys.to_set
  keys2 = attrs2.keys.to_set

  only_in_1 = keys1 - keys2
  only_in_2 = keys2 - keys1
  common = keys1 & keys2

  # Check if values differ for common keys
  different_values = common.reject { |k| attrs1[k] == attrs2[k] }

  parts = []
  parts << "only in first: #{only_in_1.to_a.sort.join(', ')}" if only_in_1.any?
  parts << "only in second: #{only_in_2.to_a.sort.join(', ')}" if only_in_2.any?
  parts << "different values: #{different_values.sort.join(', ')}" if different_values.any?

  if parts.empty?
    "#{keys1.size} vs #{keys2.size} attributes (same names)"
  else
    parts.join("; ")
  end
end

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ `String`

Build a human-readable reason for a difference

Parameters:

node1 (Object, nil) —

First node
node2 (Object, nil) —

Second node
diff1 (Symbol) —

Difference type for node1
diff2 (Symbol) —

Difference type for node2
dimension (Symbol) —

The dimension of the difference

Returns:

(String) —

Human-readable reason

# File 'lib/canon/comparison/markup_comparator.rb', line 287

def build_difference_reason(node1, node2, diff1, diff2, dimension)
  # For attribute presence differences, show what attributes differ
  if dimension == :attribute_presence
    attrs1 = extract_attributes(node1)
    attrs2 = extract_attributes(node2)
    return build_attribute_difference_reason(attrs1, attrs2)
  end

  # For text content differences, show the actual text (truncated if needed)
  if dimension == :text_content
    text1 = extract_text_content_from_node(node1)
    text2 = extract_text_content_from_node(node2)
    return build_text_difference_reason(text1, text2)
  end

  # Default reason - can be overridden in subclasses
  if diff1 == Canon::Comparison::MISSING_NODE && diff2 == Canon::Comparison::MISSING_NODE
    "element structure mismatch (children differ)"
  else
    Canon::Comparison.code_pair_label(diff1, diff2)
  end
end

.build_path_for_node(node) ⇒ `String`^?

Build canonical path for a node

Parameters:

node (Object) —

Node to build path for

Returns:

(String, nil) —

Canonical path with ordinal indices

# File 'lib/canon/comparison/markup_comparator.rb', line 77

def build_path_for_node(node)
  return nil if node.nil?

  Canon::Diff::PathBuilder.build(node, format: :document)
end

.build_text_difference_reason(text1, text2) ⇒ `String`

Build a clear reason message for text content differences Shows the actual text content (truncated if too long)

Parameters:

text1 (String, nil) —

First text content
text2 (String, nil) —

Second text content

Returns:

(String) —

Clear explanation of the text difference

# File 'lib/canon/comparison/markup_comparator.rb', line 371

def build_text_difference_reason(text1, text2)
  # Handle nil cases
  return "missing vs '#{truncate_text(text2)}'" if text1.nil? && text2
  return "'#{truncate_text(text1)}' vs missing" if text1 && text2.nil?
  return "both missing" if text1.nil? && text2.nil?

  # Both have content - show truncated versions
  "'#{truncate_text(text1)}' vs '#{truncate_text(text2)}'"
end

.comment_node?(node) ⇒ `Boolean`

Check if a node is a comment node

For XML/XHTML, this checks the node’s comment? method or node_type. For HTML, this also checks TEXT nodes that contain HTML-style comments (Nokogiri parses HTML comments as TEXT nodes with content like “<!– comment –>” or escaped like “<\!– comment –>” in full HTML documents).

Parameters:

node (Object) —

Node to check

Returns:

(Boolean) —

true if node is a comment



245
246
247

# File 'lib/canon/comparison/markup_comparator.rb', line 245

def comment_node?(node)
  NodeInspector.comment_node?(node)
end

.determine_node_dimension(node) ⇒ `Symbol`

Determine the appropriate dimension for a node type

Used by ChildComparison to tag per-child orphan diffs with a dimension that matches what the node is, so the formatter renders correctly. An element orphan tagged :text_content would otherwise route through PR #126’s one-sided text formatter and render as text “” instead of as the actual element (see lutaml/canon#125 follow-up).

Parameters:

node (Object) —

The node to check

Returns:

(Symbol) —

The dimension symbol

# File 'lib/canon/comparison/markup_comparator.rb', line 423

def determine_node_dimension(node)
  case node
  when Canon::Xml::Node
    case node.node_type
    when :element then :element_structure
    when :comment then :comments
    when :text, :cdata then :text_content
    when :processing_instruction then :processing_instructions
    else :text_content
    end
  when Nokogiri::XML::Node
    if node.comment?
      :comments
    elsif node.text? || node.cdata?
      :text_content
    elsif node.processing_instruction?
      :processing_instructions
    elsif node.element?
      :element_structure
    else
      :text_content
    end
  else
    :text_content
  end
end

.enrich_diff_metadata(node1, node2) ⇒ `Hash`

Enrich DiffNode with canonical path, serialized content, and attributes This extracts presentation-ready metadata from nodes for Stage 4 rendering

Parameters:

node1 (Object, nil) —

First node
node2 (Object, nil) —

Second node

Returns:

(Hash) —

Enriched metadata hash

# File 'lib/canon/comparison/markup_comparator.rb', line 63

def enrich_diff_metadata(node1, node2)
  {
    path: build_path_for_node(node1 || node2),
    serialized_before: serialize_node(node1),
    serialized_after: serialize_node(node2),
    attributes_before: extract_attributes(node1),
    attributes_after: extract_attributes(node2),
  }
end

.extract_attributes(node) ⇒ `Hash`^?

Extract attributes from a node

Parameters:

node (Object, nil) —

Node to extract attributes from

Returns:

(Hash, nil) —

Hash of attribute name => value pairs

# File 'lib/canon/comparison/markup_comparator.rb', line 114

def extract_attributes(node)
  return nil if node.nil?

  # Canon::Xml::Node ElementNode
  if node.is_a?(Canon::Xml::Nodes::ElementNode)
    node.attribute_nodes.to_h do |attr|
      [attr.name, attr.value]
    end
  # Nokogiri elements
  elsif node.is_a?(Nokogiri::XML::Element)
    node.attributes.to_h do |_, attr|
      [attr.name, attr.value]
    end
  else
    {}
  end
end

.extract_text_content_from_node(node) ⇒ `String`^?

Extract text content from a node for diff reason

Parameters:

node (Object, nil) —

Node to extract text from

Returns:

(String, nil) —

Text content or nil

# File 'lib/canon/comparison/markup_comparator.rb', line 346

def extract_text_content_from_node(node)
  return nil if node.nil?

  case node
  when Canon::Xml::Nodes::TextNode
    node.value
  when Canon::Xml::Node
    node.text_content
  when Nokogiri::XML::Node
    node.content.to_s
  when String
    node
  else
    node.to_s
  end
rescue StandardError
  nil
end

.filter_children(children, opts) ⇒ `Array`

Filter children based on options

Removes nodes that should be excluded from comparison based on options like :ignore_nodes, :ignore_comments, etc.

Parameters:

children (Array) —

Array of child nodes
opts (Hash) —

Comparison options

Returns:

(Array) —

Filtered array of children

# File 'lib/canon/comparison/markup_comparator.rb', line 140

def filter_children(children, opts)
  children.reject do |child|
    node_excluded?(child, opts)
  end
end

.node_excluded?(node, opts) ⇒ `Boolean`

Check if node should be excluded from comparison

Parameters:

node (Object) —

Node to check
opts (Hash) —

Comparison options

Returns:

(Boolean) —

true if node should be excluded

# File 'lib/canon/comparison/markup_comparator.rb', line 151

def node_excluded?(node, opts)
  return false if node.nil?

  return true if opts[:ignore_nodes]&.include?(node)
  return true if opts[:ignore_comments] && comment_node?(node)
  return true if opts[:ignore_text_nodes] && text_node?(node)

  # Check match options
  match_opts = opts[:match_opts]
  return false unless match_opts

  # Filter comments based on match options and format
  # HTML: Filter comments to avoid spurious differences from zip pairing
  #       BUT only when not in verbose mode (verbose needs differences recorded)
  # XML: Don't filter comments (allow informative differences to be recorded)
  if match_opts[:comments] == :ignore && comment_node?(node)
    # In verbose mode, don't filter comments - we want to record the differences
    return false if opts[:verbose]

    # Only filter comments for HTML, not XML (when not verbose)
    format = opts[:format] || match_opts[:format]
    if %i[html html4 html5].include?(format)
      return true
    end
  end

  # Strip whitespace-only text nodes based on parent element configuration.
  # Use preserve_whitespace_elements / strip_whitespace_elements to control.
  # Blacklist (strip) > preserve > collapse > format defaults.
  return false unless text_node?(node) && node.parent
  return false unless MatchOptions.normalize_text(node_text(node)).empty?

  # NBSP (U+00A0) is never insignificant whitespace —
  # it always renders as a visible non-breaking space.
  # For HTML: always preserve NBSP nodes.
  # For XML with whitespace_type: :strict: preserve NBSP nodes so
  # different Unicode whitespace types remain distinguishable.
  format = opts[:format] || match_opts[:format]
  whitespace_type = match_opts[:whitespace_type] || :strict
  if (%i[html html4
         html5].include?(format) || whitespace_type == :strict) && WhitespaceSensitivity.contains_nbsp?(node_text(node))
    return false
  end

  if %i[html html4
        html5].include?(format) && WhitespaceSensitivity.inline_whitespace_significant?(node)
    # Whitespace between inline element siblings is semantically
    # significant (renders as a visible gap) and must not be stripped.
    return false
  end

  return true unless WhitespaceSensitivity.whitespace_preserved?(
    node.parent, match_opts
  )

  # When the pretty-print-side flag is active (set by opts_for_side in
  # ChildComparison.compare), drop whitespace-only text nodes that start
  # with "\n" inside :collapse elements — they are structural indentation
  # from the pretty-printer, not content.  Space-only nodes (no initial "\n") are
  # real inline content and are kept for normalised comparison.
  # :preserve elements are always left unchanged.
  if match_opts[:_pretty_print_side_active]
    ws_class = WhitespaceSensitivity.classify_text_node(node, opts)
    return true if ws_class == :collapse && node_text(node).start_with?("\n")
  end

  false
end

.node_text(node) ⇒ `String`

Get text content from a node

Parameters:

node (Object) —

Node to get text from

Returns:

(String) —

Text content



261
262
263

# File 'lib/canon/comparison/markup_comparator.rb', line 261

def node_text(node)
  NodeInspector.text_content(node)
end

.same_node_type?(node1, node2) ⇒ `Boolean`

Check if two nodes are the same type

Parameters:

node1 (Object) —

First node
node2 (Object) —

Second node

Returns:

(Boolean) —

true if nodes are same type

# File 'lib/canon/comparison/markup_comparator.rb', line 225

def same_node_type?(node1, node2)
  return false if node1.class != node2.class

  case node1
  when Canon::Xml::Node, Nokogiri::XML::Node
    node1.node_type == node2.node_type
  else
    true
  end
end

.serialize_element_node(node) ⇒ `String`

Serialize an element node to string

Parameters:

node (Canon::Xml::Nodes::ElementNode) —

Element node

Returns:

(String) —

Serialized element

# File 'lib/canon/comparison/markup_comparator.rb', line 399

def serialize_element_node(node)
  attrs = node.attribute_nodes.map do |a|
    " #{a.name}=\"#{a.value}\""
  end.join
  children_xml = node.children.map { |c| serialize_node(c) }.join

  if children_xml.empty?
    "<#{node.name}#{attrs}/>"
  else
    "<#{node.name}#{attrs}>#{children_xml}</#{node.name}>"
  end
end

.serialize_node(node) ⇒ `String`^?

Serialize a node to string for display

Parameters:

node (Object, nil) —

Node to serialize

Returns:

(String, nil) —

Serialized content

# File 'lib/canon/comparison/markup_comparator.rb', line 87

def serialize_node(node)
  return nil if node.nil?

  # Canon::Xml::Node types
  case node
  when Canon::Xml::Nodes::RootNode
    # Serialize all children of root
    node.children.map { |child| serialize_node(child) }.join
  when Canon::Xml::Nodes::ElementNode
    serialize_element_node(node)
  when Canon::Xml::Nodes::TextNode
    # Use original text (with entity references) if available,
    # otherwise fall back to value (decoded text)
    node.original || node.value
  when Canon::Xml::Nodes::CommentNode
    "<!--#{node.value}-->"
  when Canon::Xml::Nodes::ProcessingInstructionNode
    "<?#{node.target} #{node.data}?>"
  else
    node.to_s
  end
end

.text_node?(node) ⇒ `Boolean`

Check if a node is a text node

Parameters:

node (Object) —

Node to check

Returns:

(Boolean) —

true if node is a text node



253
254
255

# File 'lib/canon/comparison/markup_comparator.rb', line 253

def text_node?(node)
  NodeInspector.text_node?(node)
end

.truncate_text(text, max_length = 40) ⇒ `String`

Truncate text for display in reason messages

Parameters:

text (String) —

Text to truncate
max_length (Integer) (defaults to: 40) —

Maximum length

Returns:

(String) —

Truncated text

# File 'lib/canon/comparison/markup_comparator.rb', line 386

def truncate_text(text, max_length = 40)
  return "" if text.nil?

  text = text.to_s
  return text if text.length <= max_length

  "#{text[0...max_length]}..."
end

.whitespace_only_difference?(text1, text2) ⇒ `Boolean`

Check if difference between two texts is only whitespace

Parameters:

text1 (String) —

First text
text2 (String) —

Second text

Returns:

(Boolean) —

true if difference is only in whitespace

# File 'lib/canon/comparison/markup_comparator.rb', line 270

def whitespace_only_difference?(text1, text2)
  # Normalize both texts (collapse/trim whitespace)
  norm1 = MatchOptions.normalize_text(text1)
  norm2 = MatchOptions.normalize_text(text2)

  # If normalized texts are the same, the difference was only whitespace
  norm1 == norm2
end

Class: Canon::Comparison::MarkupComparator

Overview

Direct Known Subclasses

Class Method Summary collapse

Class Method Details

.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object

.build_attribute_difference_reason(attrs1, attrs2) ⇒ String

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String

.build_path_for_node(node) ⇒ String?

.build_text_difference_reason(text1, text2) ⇒ String

.comment_node?(node) ⇒ Boolean

.determine_node_dimension(node) ⇒ Symbol

.enrich_diff_metadata(node1, node2) ⇒ Hash

.extract_attributes(node) ⇒ Hash?

.extract_text_content_from_node(node) ⇒ String?

.filter_children(children, opts) ⇒ Array

.node_excluded?(node, opts) ⇒ Boolean

.node_text(node) ⇒ String

.same_node_type?(node1, node2) ⇒ Boolean

.serialize_element_node(node) ⇒ String

.serialize_node(node) ⇒ String?

.text_node?(node) ⇒ Boolean

.truncate_text(text, max_length = 40) ⇒ String

.whitespace_only_difference?(text1, text2) ⇒ Boolean

.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ `Object`

.build_attribute_difference_reason(attrs1, attrs2) ⇒ `String`

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ `String`

.build_path_for_node(node) ⇒ `String`^?

.build_text_difference_reason(text1, text2) ⇒ `String`

.comment_node?(node) ⇒ `Boolean`

.determine_node_dimension(node) ⇒ `Symbol`

.enrich_diff_metadata(node1, node2) ⇒ `Hash`

.extract_attributes(node) ⇒ `Hash`^?

.extract_text_content_from_node(node) ⇒ `String`^?

.filter_children(children, opts) ⇒ `Array`

.node_excluded?(node, opts) ⇒ `Boolean`

.node_text(node) ⇒ `String`

.same_node_type?(node1, node2) ⇒ `Boolean`

.serialize_element_node(node) ⇒ `String`

.serialize_node(node) ⇒ `String`^?

.text_node?(node) ⇒ `Boolean`

.truncate_text(text, max_length = 40) ⇒ `String`

.whitespace_only_difference?(text1, text2) ⇒ `Boolean`