Class: Canon::Comparison::XmlComparator

Inherits:

MarkupComparator

Object
MarkupComparator
Canon::Comparison::XmlComparator

show all

Defined in:: lib/canon/comparison/xml_comparator.rb

Overview

XML comparison class Handles comparison of XML nodes with various options

Inherits shared comparison functionality from MarkupComparator.

Constant Summary collapse

DEFAULT_OPTS = Default comparison options for XML

{
  # Structural filtering options
  ignore_children: false,
  ignore_text_nodes: false,
  ignore_attr_content: [],
  ignore_attrs: [],
  ignore_attrs_by_name: [],
  ignore_nodes: [],

  # Output options
  verbose: false,
  diff_children: false,

  # Match system options
  match_profile: nil,
  match: nil,
  preprocessing: nil,
  global_profile: nil,
  global_options: nil,

  # Diff display options
  diff: nil,
}.freeze

Class Method Summary collapse

.build_attribute_diff_reason(attrs1, attrs2) ⇒ String

Build a clear reason message for attribute presence differences.
.build_attribute_value_diff_reason(attrs1, attrs2) ⇒ String

Build a clear reason message for attribute value differences.
.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String

Build a human-readable reason for a difference.
.build_text_diff_reason(text1, text2) ⇒ String

Build a clear reason message for text content differences.
.build_whitespace_adjacency_reason(node1, node2) ⇒ Object

Build a Reason line for a :whitespace_adjacency diff (#137).
.character_visualization_map ⇒ Hash

Get the character visualization map (lazy-loaded to avoid circular dependency).
.comment_vs_non_comment_comparison?(node1, node2) ⇒ Boolean

Check if this is a comment vs non-comment comparison.
.compare_attribute_sets(n1, n2, opts, differences) ⇒ Object

Compare attribute sets Delegates to XmlComparatorHelpers::AttributeComparator.
.compare_children(n1, n2, opts, child_opts, diff_children, differences) ⇒ Object

Compare children of two nodes using semantic matching.
.compare_comment_nodes(n1, n2, opts, differences) ⇒ Object

Compare comment nodes.
.compare_document_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ Object

Compare document nodes.
.compare_element_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ Object

Compare two element nodes.
.compare_namespace_declarations(n1, n2, opts, differences) ⇒ Object

Compare namespace declarations (xmlns and xmlns:* attributes) Delegates to XmlComparatorHelpers::NamespaceComparator.
.compare_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ Object

Main comparison dispatcher.
.compare_processing_instruction_nodes(n1, n2, opts, differences) ⇒ Object

Compare processing instruction nodes.
.compare_text_nodes(n1, n2, opts, differences) ⇒ Object

Compare text nodes.
.describe_whitespace(text) ⇒ String

Describe whitespace content in a readable way.
.equivalent?(n1, n2, opts = {}, child_opts = {}) ⇒ Boolean, Array

Compare two XML nodes for equivalence.
.extract_attributes(node) ⇒ Hash^?

Extract attributes from a node as a normalized hash.
.extract_element_path(node) ⇒ Array<String>

Extract element path for context (best effort).
.extract_text_from_node(node) ⇒ String^?

Extract text from a node for diff reason.
.in_preserve_element?(node, preserve_list) ⇒ Boolean

Check if a node is inside a whitespace-preserving element.
.non_ws_sibling_exists?(siblings, idx, direction) ⇒ Boolean
.serialize_node(node) ⇒ String^?

Serialize a node to string for display.
.should_preserve_whitespace_strictly?(n1, n2, opts) ⇒ Boolean

Check if whitespace should be preserved strictly for these text nodes This applies to HTML elements like pre, code, textarea, script, style and elements with xml:space=“preserve” or in user-configured preserve list.
.truncate_text(text, max_length = 40) ⇒ String

Truncate text for display in reason messages.
.visualize_whitespace(text) ⇒ String

Make whitespace visible in text content Uses the existing character visualization map from DiffFormatter (single source of truth).
.whitespace_only?(text) ⇒ Boolean

Check if text is only whitespace.
.whitespace_partner_direction(ws_node) ⇒ Object

Direction of the partner content relative to the whitespace node, phrased from the partner’s point of view: “before” when the whitespace immediately precedes its next non-whitespace sibling (the alignment partner on the other side), “after” when the whitespace trails the previous non-whitespace sibling, or “adjacent to” as a degenerate fallback when neither neighbour exists.

Methods inherited from MarkupComparator

add_difference, build_attribute_difference_reason, build_path_for_node, build_text_difference_reason, comment_node?, determine_node_dimension, enrich_diff_metadata, extract_text_content_from_node, filter_children, node_excluded?, node_text, same_node_type?, serialize_element_node, text_node?, whitespace_only_difference?

Class Method Details

.build_attribute_diff_reason(attrs1, attrs2) ⇒ `String`

Build a clear reason message for attribute presence differences

Parameters:

attrs1 (Hash, nil) —

First node’s attributes
attrs2 (Hash, nil) —

Second node’s attributes

Returns:

(String) —

Clear explanation of the attribute difference

# File 'lib/canon/comparison/xml_comparator.rb', line 765

def build_attribute_diff_reason(attrs1, attrs2)
  return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2

  require "set"
  keys1 = attrs1.keys.to_set
  keys2 = attrs2.keys.to_set

  only_in_first = keys1 - keys2
  only_in_second = keys2 - keys1
  common = keys1 & keys2

  # Check if values differ for common keys
  different_values = common.reject { |k| attrs1[k] == attrs2[k] }

  parts = []
  parts << "only in first: #{only_in_first.to_a.sort.join(', ')}" if only_in_first.any?
  parts << "only in second: #{only_in_second.to_a.sort.join(', ')}" if only_in_second.any?
  parts << "different values: #{different_values.sort.join(', ')}" if different_values.any?

  if parts.empty?
    "#{keys1.size} vs #{keys2.size} attributes (same names)"
  else
    parts.join("; ")
  end
end

.build_attribute_value_diff_reason(attrs1, attrs2) ⇒ `String`

Build a clear reason message for attribute value differences

Parameters:

attrs1 (Hash, nil) —

First node’s attributes
attrs2 (Hash, nil) —

Second node’s attributes

Returns:

(String) —

Clear explanation of the attribute value difference

# File 'lib/canon/comparison/xml_comparator.rb', line 741

def build_attribute_value_diff_reason(attrs1, attrs2)
  return "missing vs present attributes" unless attrs1 && attrs2

  require "set"
  keys1 = attrs1.keys.to_set
  keys2 = attrs2.keys.to_set

  common = keys1 & keys2
  different_values = common.reject { |k| attrs1[k] == attrs2[k] }

  return "all attribute values match" if different_values.empty?

  parts = different_values.map do |k|
    "#{k}: #{attrs1[k].inspect} vs #{attrs2[k].inspect}"
  end

  parts.join("; ")
end

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ `String`

Build a human-readable reason for a difference

Parameters:

node1 (Object) —

First node
node2 (Object) —

Second node
diff1 (String) —

Difference type for node1
diff2 (String) —

Difference type for node2
dimension (Symbol) —

The dimension of the difference

Returns:

(String) —

Human-readable reason

# File 'lib/canon/comparison/xml_comparator.rb', line 664

def build_difference_reason(node1, node2, diff1, diff2, dimension)
  # For deleted/inserted nodes, include namespace information if available
  if dimension == :text_content && (node1.nil? || node2.nil?)
    node = node1 || node2
    if node.respond_to?(:name) && node.respond_to?(:namespace_uri)
      ns = node.namespace_uri
      ns_info = if ns.nil? || ns.empty?
                  ""
                else
                  " (namespace: #{ns})"
                end
      label = Canon::Comparison.code_pair_label(diff1, diff2)
      return "element '#{node.name}'#{ns_info}: #{label}"
    elsif node.respond_to?(:name) && !node.respond_to?(:namespace_uri)
      # TextNode and other nodes without namespace_uri
      display = if node.respond_to?(:value) && node.node_type == :text
                  "\"#{truncate_text(node.value)}\""
                else
                  node.name.to_s
                end
      return "element missing: #{display}"
    end
  end

  # For attribute presence differences, show what attributes differ
  if dimension == :attribute_presence
    attrs1 = extract_attributes(node1)
    attrs2 = extract_attributes(node2)
    return build_attribute_diff_reason(attrs1, attrs2)
  end

  # For text content differences, show the actual text (truncated if needed)
  if dimension == :text_content
    text1 = extract_text_from_node(node1)
    text2 = extract_text_from_node(node2)
    return build_text_diff_reason(text1, text2)
  end

  if dimension == :whitespace_adjacency
    return build_whitespace_adjacency_reason(node1, node2)
  end

  # For attribute values differences, show the actual values
  if dimension == :attribute_values
    attrs1 = extract_attributes(node1)
    attrs2 = extract_attributes(node2)
    return build_attribute_value_diff_reason(attrs1, attrs2)
  end

  # For attribute order differences, show the actual attribute names
  if dimension == :attribute_order
    attrs1 = extract_attributes(node1)&.keys || []
    attrs2 = extract_attributes(node2)&.keys || []
    return "Attribute order changed: [#{attrs1.join(', ')}] → [#{attrs2.join(', ')}]"
  end

  if diff1 == Canon::Comparison::MISSING_NODE && diff2 == Canon::Comparison::MISSING_NODE
    "element structure mismatch (children differ)"
  elsif dimension == :element_structure &&
      diff1 == Canon::Comparison::UNEQUAL_ELEMENTS &&
      diff2 == Canon::Comparison::UNEQUAL_ELEMENTS &&
      (node1.is_a?(Canon::Xml::Node) || node1.is_a?(Nokogiri::XML::Node)) &&
      (node2.is_a?(Canon::Xml::Node) || node2.is_a?(Nokogiri::XML::Node)) &&
      node1.name && node2.name && node1.name != node2.name
    # Most common case: differing element names.  Surface the
    # actual names rather than a generic "elements differ".
    "different element name (<#{node1.name}> vs <#{node2.name}>)"
  else
    Canon::Comparison.code_pair_label(diff1, diff2)
  end
end

.build_text_diff_reason(text1, text2) ⇒ `String`

Build a clear reason message for text content differences

Parameters:

text1 (String, nil) —

First text content
text2 (String, nil) —

Second text content

Returns:

(String) —

Clear explanation of the text difference

# File 'lib/canon/comparison/xml_comparator.rb', line 827

def build_text_diff_reason(text1, text2)
  # Handle nil cases
  return "missing vs '#{truncate_text(text2)}'" if text1.nil? && text2
  return "'#{truncate_text(text2)}' vs missing" if text1 && text2.nil?
  return "both missing" if text1.nil? && text2.nil?

  # Check if both are whitespace-only
  if whitespace_only?(text1) && whitespace_only?(text2)
    return "whitespace: #{describe_whitespace(text1)} vs #{describe_whitespace(text2)}"
  end

  # Show text with visible whitespace markers
  # Use escaped representations for clarity: \n for newline, \t for tab, · for spaces
  vis1 = visualize_whitespace(text1)
  vis2 = visualize_whitespace(text2)

  "Text: \"#{vis1}\" vs \"#{vis2}\""
end

.build_whitespace_adjacency_reason(node1, node2) ⇒ `Object`

Build a Reason line for a :whitespace_adjacency diff (#137). Names which side carries the whitespace, the adjacency position relative to content neighbours, and surfaces the whitespace with visible markers.

# File 'lib/canon/comparison/xml_comparator.rb', line 850

def build_whitespace_adjacency_reason(node1, node2)
  text1 = extract_text_from_node(node1)
  text2 = extract_text_from_node(node2)

  ni = NodeInspector
  ws_on_first = ni.whitespace_only_text?(node1) &&
    !ni.whitespace_only_text?(node2)
  ws_on_second = ni.whitespace_only_text?(node2) &&
    !ni.whitespace_only_text?(node1)

  if ws_on_first
    ws_text = text1
    content_text = text2
    present_side = "EXPECTED"
    absent_side = "ACTUAL"
    ws_node = node1
  elsif ws_on_second
    ws_text = text2
    content_text = text1
    present_side = "ACTUAL"
    absent_side = "EXPECTED"
    ws_node = node2
  else
    return build_text_diff_reason(text1, text2)
  end

  direction = whitespace_partner_direction(ws_node)
  ws_vis = visualize_whitespace(ws_text)
  content_vis = content_text ? visualize_whitespace(truncate_text(content_text)) : "(none)"

  "Whitespace #{direction} \"#{content_vis}\": " \
    "present on #{present_side} (\"#{ws_vis}\"), absent on #{absent_side}"
end

.character_visualization_map ⇒ `Hash`

Get the character visualization map (lazy-loaded to avoid circular dependency)

Returns:

(Hash) —

Character to visualization symbol mapping

# File 'lib/canon/comparison/xml_comparator.rb', line 949

def character_visualization_map
  @character_visualization_map ||= begin
    # Load the YAML file directly to avoid circular dependency
    require "yaml"
    lib_root = File.expand_path("../..", __dir__)
    yaml_path = File.join(lib_root,
                          "canon/diff_formatter/character_map.yml")
    data = YAML.load_file(yaml_path)

    # Build visualization map from the YAML data
    visualization_map = {}
    data["characters"].each do |char_data|
      # Get the character from either unicode code point or character field
      char = if char_data["unicode"]
               # Convert hex string to character
               [char_data["unicode"].to_i(16)].pack("U")
             else
               # Use character field directly (handles \n, \t, etc.)
               char_data["character"]
             end

      vis = char_data["visualization"]
      visualization_map[char] = vis
    end

    visualization_map
  end
end

.comment_vs_non_comment_comparison?(node1, node2) ⇒ `Boolean`

Check if this is a comment vs non-comment comparison

Parameters:

node1 (Object) —

First node
node2 (Object) —

Second node

Returns:

(Boolean) —

true if exactly one node is a comment

# File 'lib/canon/comparison/xml_comparator.rb', line 369

def comment_vs_non_comment_comparison?(node1, node2)
  require_relative "xml_node_comparison"

  node1_comment = XmlNodeComparison
    .comment_node?(node1, check_children: true)
  node2_comment = XmlNodeComparison
    .comment_node?(node2, check_children: true)

  # XOR: exactly one is a comment
  node1_comment ^ node2_comment
end

.compare_attribute_sets(n1, n2, opts, differences) ⇒ `Object`

Compare attribute sets Delegates to XmlComparatorHelpers::AttributeComparator

# File 'lib/canon/comparison/xml_comparator.rb', line 429

def compare_attribute_sets(n1, n2, opts, differences)
  XmlComparatorHelpers::AttributeComparator.compare(n1, n2, opts,
                                                    differences)
end

.compare_children(n1, n2, opts, child_opts, diff_children, differences) ⇒ `Object`

Compare children of two nodes using semantic matching

Delegates to ChildComparison module which handles both ElementMatcher (semantic matching) and simple positional comparison.

# File 'lib/canon/comparison/xml_comparator.rb', line 604

def compare_children(n1, n2, opts, child_opts, diff_children,
differences)
  XmlComparatorHelpers::ChildComparison.compare(
    n1, n2, self, opts, child_opts, diff_children, differences
  )
end

.compare_comment_nodes(n1, n2, opts, differences) ⇒ `Object`

Compare comment nodes

# File 'lib/canon/comparison/xml_comparator.rb', line 534

def compare_comment_nodes(n1, n2, opts, differences)
  match_opts = opts[:match_opts]
  behavior = match_opts[:comments]

  # Canon::Xml::Node CommentNode uses .value, Nokogiri uses .content
  content1 = node_text(n1)
  content2 = node_text(n2)

  # Check if content differs
  contents_differ = content1 != content2

  # Create DiffNode in verbose mode when content differs
  # This ensures informative diffs are created even for :ignore behavior
  if contents_differ && opts[:verbose]
    add_difference(n1, n2, Comparison::UNEQUAL_COMMENTS,
                   Comparison::UNEQUAL_COMMENTS, :comments, opts,
                   differences)
  end

  # Return based on behavior and whether content matches
  if behavior == :ignore || !contents_differ
    Comparison::EQUIVALENT
  else
    Comparison::UNEQUAL_COMMENTS
  end
end

.compare_document_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ `Object`

Compare document nodes

# File 'lib/canon/comparison/xml_comparator.rb', line 584

def compare_document_nodes(n1, n2, opts, child_opts, diff_children,
                           differences)
  # Compare root elements
  root1 = n1.root
  root2 = n2.root

  if root1.nil? || root2.nil?
    add_difference(n1, n2, Comparison::MISSING_NODE,
                   Comparison::MISSING_NODE, :text_content, opts, differences)
    return Comparison::MISSING_NODE
  end

  compare_nodes(root1, root2, opts, child_opts, diff_children,
                differences)
end

.compare_element_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ `Object`

Compare two element nodes

# File 'lib/canon/comparison/xml_comparator.rb', line 384

def compare_element_nodes(n1, n2, opts, child_opts, diff_children,
                          differences)
  # Compare element names
  unless n1.name == n2.name
    add_difference(n1, n2, Comparison::UNEQUAL_ELEMENTS,
                   Comparison::UNEQUAL_ELEMENTS, :element_structure, opts,
                   differences)
    return Comparison::UNEQUAL_ELEMENTS
  end

  # Compare namespace URIs - elements with different namespaces are different elements
  ns1 = n1.respond_to?(:namespace_uri) ? n1.namespace_uri : nil
  ns2 = n2.respond_to?(:namespace_uri) ? n2.namespace_uri : nil

  unless ns1 == ns2
    # Create descriptive reason showing the actual namespace URIs
    ns1_display = ns1.nil? || ns1.empty? ? "(no namespace)" : ns1
    ns2_display = ns2.nil? || ns2.empty? ? "(no namespace)" : ns2

    diff_node = Canon::Diff::DiffNode.new(
      node1: n1,
      node2: n2,
      dimension: :namespace_uri,
      reason: "namespace '#{ns1_display}' vs '#{ns2_display}' on element '#{n1.name}'",
    )
    differences << diff_node if opts[:verbose]
    return Comparison::UNEQUAL_ELEMENTS
  end

  # Compare namespace declarations (xmlns and xmlns:* attributes)
  ns_result = compare_namespace_declarations(n1, n2, opts, differences)
  return ns_result unless ns_result == Comparison::EQUIVALENT

  # Compare attributes
  attr_result = compare_attribute_sets(n1, n2, opts, differences)
  return attr_result unless attr_result == Comparison::EQUIVALENT

  # Compare children if not ignored
  return Comparison::EQUIVALENT if opts[:ignore_children]

  compare_children(n1, n2, opts, child_opts, diff_children, differences)
end

.compare_namespace_declarations(n1, n2, opts, differences) ⇒ `Object`

Compare namespace declarations (xmlns and xmlns:* attributes) Delegates to XmlComparatorHelpers::NamespaceComparator

# File 'lib/canon/comparison/xml_comparator.rb', line 1015

def compare_namespace_declarations(n1, n2, opts, differences)
  XmlComparatorHelpers::NamespaceComparator.compare(n1, n2, opts,
                                                    differences)
end

.compare_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ `Object`

Main comparison dispatcher

# File 'lib/canon/comparison/xml_comparator.rb', line 293

def compare_nodes(n1, n2, opts, child_opts, diff_children, differences)
  # FAST PATH: Object identity - same object is always equivalent
  return Comparison::EQUIVALENT if n1.equal?(n2)

  # Handle DocumentFragment nodes - compare their children instead
  if n1.is_a?(Nokogiri::XML::DocumentFragment) &&
      n2.is_a?(Nokogiri::XML::DocumentFragment)
    children1 = n1.children.to_a
    children2 = n2.children.to_a

    if children1.length != children2.length
      add_difference(n1, n2, Comparison::UNEQUAL_ELEMENTS,
                     Comparison::UNEQUAL_ELEMENTS, :text_content, opts,
                     differences)
      return Comparison::UNEQUAL_ELEMENTS
    elsif children1.empty?
      return Comparison::EQUIVALENT
    else
      # Compare each pair of children
      result = Comparison::EQUIVALENT
      children1.zip(children2).each do |child1, child2|
        child_result = compare_nodes(child1, child2, opts, child_opts,
                                     diff_children, differences)
        result = child_result unless child_result == Comparison::EQUIVALENT
      end
      return result
    end
  end

  # Check if nodes should be excluded
  return Comparison::EQUIVALENT if node_excluded?(n1, opts) &&
    node_excluded?(n2, opts)

  if node_excluded?(n1, opts) || node_excluded?(n2, opts)
    add_difference(n1, n2, Comparison::MISSING_NODE,
                   Comparison::MISSING_NODE, :text_content, opts, differences)
    return Comparison::MISSING_NODE
  end

  # Handle comment vs non-comment comparisons specially
  # Create :comments dimension differences instead of UNEQUAL_NODES_TYPES
  if comment_vs_non_comment_comparison?(n1, n2)
    match_opts = opts[:match_opts]
    comment_behavior = match_opts ? match_opts[:comments] : nil

    # Create a :comments dimension difference
    # The difference will be marked as normative or not based on the profile
    add_difference(n1, n2, Comparison::MISSING_NODE,
                   Comparison::MISSING_NODE, :comments, opts,
                   differences)

    # Return EQUIVALENT if comments are ignored, otherwise return UNEQUAL
    if comment_behavior == :ignore
      Comparison::EQUIVALENT
    else
      Comparison::UNEQUAL_COMMENTS
    end
  elsif !same_node_type?(n1, n2)
    # Check node types match for non-comment comparisons
    add_difference(n1, n2, Comparison::UNEQUAL_NODES_TYPES,
                   Comparison::UNEQUAL_NODES_TYPES, :text_content, opts,
                   differences)
    Comparison::UNEQUAL_NODES_TYPES
  else
    # Dispatch based on node type using NodeTypeComparator strategy
    XmlComparatorHelpers::NodeTypeComparator.compare(
      n1, n2, self, opts, child_opts, diff_children, differences
    )
  end
end

.compare_processing_instruction_nodes(n1, n2, opts, differences) ⇒ `Object`

Compare processing instruction nodes

# File 'lib/canon/comparison/xml_comparator.rb', line 562

def compare_processing_instruction_nodes(n1, n2, opts, differences)
  unless n1.target == n2.target
    add_difference(n1, n2, Comparison::UNEQUAL_NODES_TYPES,
                   Comparison::UNEQUAL_NODES_TYPES, :text_content, opts,
                   differences)
    return Comparison::UNEQUAL_NODES_TYPES
  end

  content1 = n1.respond_to?(:content) ? n1.content.to_s.strip : ""
  content2 = n2.respond_to?(:content) ? n2.content.to_s.strip : ""

  if content1 == content2
    Comparison::EQUIVALENT
  else
    add_difference(n1, n2, Comparison::UNEQUAL_TEXT_CONTENTS,
                   Comparison::UNEQUAL_TEXT_CONTENTS, :text_content,
                   opts, differences)
    Comparison::UNEQUAL_TEXT_CONTENTS
  end
end

.compare_text_nodes(n1, n2, opts, differences) ⇒ `Object`

Compare text nodes

# File 'lib/canon/comparison/xml_comparator.rb', line 435

def compare_text_nodes(n1, n2, opts, differences)
  return Comparison::EQUIVALENT if opts[:ignore_text_nodes]

  text1 = node_text(n1)
  text2 = node_text(n2)

  # Use match options
  match_opts = opts[:match_opts]
  behavior = match_opts[:text_content]

  # For HTML, check if text node is inside whitespace-preserving element
  # If so, always use strict comparison regardless of text_content setting
  sensitive_element = should_preserve_whitespace_strictly?(n1, n2, opts)
  if sensitive_element
    behavior = :strict
  end

  # Check if raw content differs
  raw_differs = text1 != text2

  # Check if matches according to behavior
  whitespace_type = match_opts[:whitespace_type] || :strict
  matches_per_behavior = MatchOptions.match_text?(text1, text2,
                                                  behavior,
                                                  whitespace_type: whitespace_type)

  # Determine the correct dimension for this difference
  # - If text_content is :strict, ALL differences use :text_content dimension
  # - If text_content is :normalize, whitespace-only diffs could use :structural_whitespace
  #   but we keep :text_content to ensure correct classification behavior
  # - Otherwise use :text_content
  # However, if element is whitespace-sensitive (like <pre> in HTML),
  # always use :text_content dimension regardless of behavior
  #
  # NOTE: We keep the dimension as :text_content even for whitespace-only diffs
  # when text_content: :normalize. This ensures that the classification uses
  # the text_content behavior (:normalize) instead of structural_whitespace
  # behavior (:strict for XML), which would incorrectly mark the diff as normative.
  if sensitive_element
  # Whitespace-sensitive element: always use :text_content dimension
  else
    # Always use :text_content for text differences
    # This ensures correct classification based on text_content behavior
  end
  dimension = :text_content

  # Create DiffNode in verbose mode when raw content differs
  # This ensures informative diffs are created even for :ignore/:normalize
  if raw_differs && opts[:verbose]
    add_difference(n1, n2, Comparison::UNEQUAL_TEXT_CONTENTS,
                   Comparison::UNEQUAL_TEXT_CONTENTS, dimension,
                   opts, differences)
  end

  # Return based on whether behavior makes difference acceptable
  matches_per_behavior ? Comparison::EQUIVALENT : Comparison::UNEQUAL_TEXT_CONTENTS
end

.describe_whitespace(text) ⇒ `String`

Describe whitespace content in a readable way

Parameters:

text (String) —

Whitespace text

Returns:

(String) —

Description like “4 chars (2 newlines, 2 spaces)”

# File 'lib/canon/comparison/xml_comparator.rb', line 982

def describe_whitespace(text)
  return "0 chars" if text.nil? || text.empty?

  char_count = text.length
  newline_count = text.count("\n")
  space_count = text.count(" ")
  tab_count = text.count("\t")

  parts = []
  parts << "#{newline_count} newlines" if newline_count.positive?
  parts << "#{space_count} spaces" if space_count.positive?
  parts << "#{tab_count} tabs" if tab_count.positive?

  description = parts.join(", ")
  "#{char_count} chars (#{description})"
end

.equivalent?(n1, n2, opts = {}, child_opts = {}) ⇒ `Boolean`, `Array`

Compare two XML nodes for equivalence

Parameters:

n1 (String, Moxml::Node) —

First node
n2 (String, Moxml::Node) —

Second node
opts (Hash) (defaults to: {}) —

Comparison options
child_opts (Hash) (defaults to: {}) —

Options for child comparison

Returns:

(Boolean, Array) —

true if equivalent, or array of diffs if verbose

# File 'lib/canon/comparison/xml_comparator.rb', line 65

def equivalent?(n1, n2, opts = {}, child_opts = {})
  # FAST PATH: Object identity - same object is always equivalent
  # Skip when semantic_diff is requested (caller needs tree diff metadata)
  if n1.equal?(n2) && !opts.dig(:match, :semantic_diff)
    return build_trivial_equivalent_result(n1, n2, opts)
  end

  # FAST PATH: String content equality - identical strings are equivalent
  # Skip in verbose mode since caller may need full metadata (e.g. tree_diff statistics)
  if !opts[:verbose] && n1.is_a?(String) && n2.is_a?(String) && n1 == n2
    return true
  end

  opts = DEFAULT_OPTS.merge(opts)

  # Resolve match options with format-specific defaults
  match_opts_hash = MatchOptions::Xml.resolve(
    format: :xml,
    match_profile: opts[:match_profile],
    match: opts[:match],
    preprocessing: opts[:preprocessing],
    global_profile: opts[:global_profile],
    global_options: opts[:global_options],
  )

  # Wrap in ResolvedMatchOptions for DiffClassifier
  match_opts = Canon::Comparison::ResolvedMatchOptions.new(
    match_opts_hash,
    format: :xml,
  )

  # Store resolved match options hash for use in comparison logic
  opts[:match_opts] = match_opts_hash

  # Use tree diff if semantic_diff option is enabled
  if match_opts.semantic_diff?
    return perform_semantic_tree_diff(n1, n2, opts, match_opts_hash)
  end

  # Create child_opts with resolved options
  child_opts = opts.merge(child_opts)

  # Determine if we should preserve whitespace during parsing.
  # Only structural_whitespace: :strict forces whitespace-only text
  # nodes to survive parsing.  whitespace_type is about distinguishing
  # Unicode whitespace *types* in surviving text-node content, and
  # does NOT require indent text nodes to be kept — libxml's NOBLANKS
  # only strips pure-ASCII whitespace-only nodes, so NBSP-only nodes
  # survive regardless.  Coupling whitespace_type: :strict to
  # parsing-time preservation made pretty-printed fixtures produce
  # spurious element-position diffs (issue #112).
  preserve_whitespace = match_opts_hash[:structural_whitespace] == :strict

  # Parse nodes if they are strings, applying preprocessing if needed
  node1 = parse_node(n1, match_opts_hash[:preprocessing],
                     preserve_whitespace: preserve_whitespace)
  node2 = parse_node(n2, match_opts_hash[:preprocessing],
                     preserve_whitespace: preserve_whitespace)

  # Store original strings for line diff display (before preprocessing)
  original1 = if n1.is_a?(String)
                n1
              else
                (n1.respond_to?(:to_xml) ? n1.to_xml : n1.to_s)
              end
  original2 = if n2.is_a?(String)
                n2
              else
                (n2.respond_to?(:to_xml) ? n2.to_xml : n2.to_s)
              end

  differences = []
  diff_children = opts[:diff_children] || false

  result = compare_nodes(node1, node2, opts, child_opts,
                         diff_children, differences)

  # Classify DiffNodes as normative/informative if we have verbose output
  if opts[:verbose] && !differences.empty?
    classifier = Canon::Diff::DiffClassifier.new(match_opts)
    classifier.classify_all(differences.grep(Canon::Diff::DiffNode))
  end

  if opts[:verbose]
    # Serialize parsed nodes for consistent formatting
    # This ensures both sides formatted identically, showing only real differences
    preprocessed = [
      serialize_node(node1).gsub("><", ">\n<"),
      serialize_node(node2).gsub("><", ">\n<"),
    ]

    ComparisonResult.new(
      differences: differences,
      preprocessed_strings: preprocessed,
      original_strings: [original1, original2],
      format: :xml,
      match_options: match_opts_hash,
      algorithm: :dom,
      parse_errors_expected: Comparison.parse_errors_for(node1),
      parse_errors_received: Comparison.parse_errors_for(node2),
    )
  elsif result != Comparison::EQUIVALENT && !differences.empty?
    # Non-verbose mode: check equivalence
    # If comparison found differences, classify them to determine if normative
    classifier = Canon::Diff::DiffClassifier.new(match_opts)
    classifier.classify_all(differences.grep(Canon::Diff::DiffNode))
    # Equivalent if no normative differences (matches semantic algorithm)
    differences.none?(&:normative?)
  else
    # Either equivalent or no differences tracked
    result == Comparison::EQUIVALENT
  end
end

.extract_attributes(node) ⇒ `Hash`^?

Extract attributes from a node as a normalized hash

Parameters:

node (Object, nil) —

Node to extract attributes from

Returns:

(Hash, nil) —

Normalized attributes hash

# File 'lib/canon/comparison/xml_comparator.rb', line 651

def extract_attributes(node)
  return nil if node.nil?

  Canon::Diff::NodeSerializer.extract_attributes(node)
end

.extract_element_path(node) ⇒ `Array<String>`

Extract element path for context (best effort)

Parameters:

node (Object) —

Node to extract path from

Returns:

(Array<String>) —

Path components

# File 'lib/canon/comparison/xml_comparator.rb', line 614

def extract_element_path(node)
  path = []
  current = node
  max_depth = 20
  depth = 0

  while current && depth < max_depth
    if current.respond_to?(:name) && current.name
      path.unshift(current.name)
    end

    break unless current.respond_to?(:parent)

    current = current.parent
    depth += 1

    # Stop at document root
    break if current.respond_to?(:root)
  end

  path
end

.extract_text_from_node(node) ⇒ `String`^?

Extract text from a node for diff reason

Parameters:

node (Object, nil) —

Node to extract text from

Returns:

(String, nil) —

Text content or nil

# File 'lib/canon/comparison/xml_comparator.rb', line 795

def extract_text_from_node(node)
  return nil if node.nil?

  # For Canon::Xml::Nodes::TextNode
  return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode)

  # For XML/HTML nodes with text_content method
  return node.text_content if node.respond_to?(:text_content)

  # For nodes with text method
  return node.text if node.respond_to?(:text)

  # For nodes with content method (Moxml::Text)
  return node.content if node.respond_to?(:content)

  # For nodes with value method (other types)
  return node.value if node.respond_to?(:value)

  # For simple text nodes or strings
  return node.to_s if node.is_a?(String)

  # For other node types, try to_s
  node.to_s
rescue StandardError
  nil
end

.in_preserve_element?(node, preserve_list) ⇒ `Boolean`

Check if a node is inside a whitespace-preserving element

Returns:

(Boolean)

# File 'lib/canon/comparison/xml_comparator.rb', line 517

def in_preserve_element?(node, preserve_list)
  current = node.parent
  while current.respond_to?(:name)
    return true if preserve_list.include?(current.name.downcase)

    # Stop at document root
    break if current.is_a?(Nokogiri::XML::Document) ||
      current.is_a?(Nokogiri::HTML4::Document) ||
      current.is_a?(Nokogiri::HTML5::Document)

    current = current.parent if current.respond_to?(:parent)
    break unless current
  end
  false
end

.non_ws_sibling_exists?(siblings, idx, direction) ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/canon/comparison/xml_comparator.rb', line 908

def non_ws_sibling_exists?(siblings, idx, direction)
  i = idx + direction
  while i >= 0 && i < siblings.length
    s = siblings[i]
    is_ws_text = NodeInspector.text_node?(s) &&
      NodeInspector.text_content(s).strip.empty?
    return true unless is_ws_text

    i += direction
  end
  false
end

.serialize_node(node) ⇒ `String`^?

Serialize a node to string for display

Parameters:

node (Object, nil) —

Node to serialize

Returns:

(String, nil) —

Serialized content

# File 'lib/canon/comparison/xml_comparator.rb', line 641

def serialize_node(node)
  return nil if node.nil?

  Canon::Diff::NodeSerializer.serialize(node)
end

.should_preserve_whitespace_strictly?(n1, n2, opts) ⇒ `Boolean`

Check if whitespace should be preserved strictly for these text nodes This applies to HTML elements like pre, code, textarea, script, style and elements with xml:space=“preserve” or in user-configured preserve list.

IMPORTANT: This returns true ONLY for :preserve classification. For :collapse classification, whitespace differences ARE acceptable (they are detected as formatting-only by DiffClassifier).

Returns:

(Boolean)

# File 'lib/canon/comparison/xml_comparator.rb', line 500

def should_preserve_whitespace_strictly?(n1, n2, opts)
  # Check both n1 and n2 - if either is in a preserve whitespace element, preserve strictly
  [n1, n2].each do |node|
    next unless node.respond_to?(:parent)

    parent = node.parent
    next unless parent

    classification = WhitespaceSensitivity.classify_element(parent,
                                                            opts[:match_opts])
    return true if classification == :preserve
  end

  false
end

.truncate_text(text, max_length = 40) ⇒ `String`

Truncate text for display in reason messages

Parameters:

text (String) —

Text to truncate
max_length (Integer) (defaults to: 40) —

Maximum length

Returns:

(String) —

Truncated text

# File 'lib/canon/comparison/xml_comparator.rb', line 1004

def truncate_text(text, max_length = 40)
  return "" if text.nil?

  text = text.to_s
  return text if text.length <= max_length

  "#{text[0...max_length]}..."
end

.visualize_whitespace(text) ⇒ `String`

Make whitespace visible in text content Uses the existing character visualization map from DiffFormatter (single source of truth)

Parameters:

text (String) —

Text to visualize

Returns:

(String) —

Text with visible whitespace markers

# File 'lib/canon/comparison/xml_comparator.rb', line 936

def visualize_whitespace(text)
  return "" if text.nil?

  # Use the character map loader as the single source of truth
  viz_map = character_visualization_map

  # Replace each character with its visualization
  text.chars.map { |char| viz_map[char] || char }.join
end

.whitespace_only?(text) ⇒ `Boolean`

Check if text is only whitespace

Parameters:

text (String) —

Text to check

Returns:

(Boolean) —

true if whitespace-only

# File 'lib/canon/comparison/xml_comparator.rb', line 925

def whitespace_only?(text)
  return false if text.nil?

  text.to_s.strip.empty?
end

.whitespace_partner_direction(ws_node) ⇒ `Object`

Direction of the partner content relative to the whitespace node, phrased from the partner’s point of view: “before” when the whitespace immediately precedes its next non-whitespace sibling (the alignment partner on the other side), “after” when the whitespace trails the previous non-whitespace sibling, or “adjacent to” as a degenerate fallback when neither neighbour exists.

# File 'lib/canon/comparison/xml_comparator.rb', line 891

def whitespace_partner_direction(ws_node)
  return "adjacent to" unless ws_node.is_a?(Canon::Xml::Node) ||
    ws_node.is_a?(Nokogiri::XML::Node)

  parent = ws_node.parent
  return "adjacent to" if parent.nil?

  siblings = parent.children
  idx = siblings.index(ws_node)
  return "adjacent to" unless idx

  if non_ws_sibling_exists?(siblings, idx, 1) then "before"
  elsif non_ws_sibling_exists?(siblings, idx, -1) then "after"
  else "adjacent to"
  end
end

Class: Canon::Comparison::XmlComparator

Overview

Constant Summary collapse

Class Method Summary collapse

Methods inherited from MarkupComparator

Class Method Details

.build_attribute_diff_reason(attrs1, attrs2) ⇒ String

.build_attribute_value_diff_reason(attrs1, attrs2) ⇒ String

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String

.build_text_diff_reason(text1, text2) ⇒ String

.build_whitespace_adjacency_reason(node1, node2) ⇒ Object

.character_visualization_map ⇒ Hash

.comment_vs_non_comment_comparison?(node1, node2) ⇒ Boolean

.compare_attribute_sets(n1, n2, opts, differences) ⇒ Object

.compare_children(n1, n2, opts, child_opts, diff_children, differences) ⇒ Object

.compare_comment_nodes(n1, n2, opts, differences) ⇒ Object

.compare_document_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ Object

.compare_element_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ Object

.compare_namespace_declarations(n1, n2, opts, differences) ⇒ Object

.compare_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ Object

.compare_processing_instruction_nodes(n1, n2, opts, differences) ⇒ Object

.compare_text_nodes(n1, n2, opts, differences) ⇒ Object

.describe_whitespace(text) ⇒ String

.equivalent?(n1, n2, opts = {}, child_opts = {}) ⇒ Boolean, Array

.extract_attributes(node) ⇒ Hash?

.extract_element_path(node) ⇒ Array<String>

.extract_text_from_node(node) ⇒ String?

.in_preserve_element?(node, preserve_list) ⇒ Boolean

.non_ws_sibling_exists?(siblings, idx, direction) ⇒ Boolean

.serialize_node(node) ⇒ String?

.should_preserve_whitespace_strictly?(n1, n2, opts) ⇒ Boolean

.truncate_text(text, max_length = 40) ⇒ String

.visualize_whitespace(text) ⇒ String

.whitespace_only?(text) ⇒ Boolean

.whitespace_partner_direction(ws_node) ⇒ Object

.build_attribute_diff_reason(attrs1, attrs2) ⇒ `String`

.build_attribute_value_diff_reason(attrs1, attrs2) ⇒ `String`

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ `String`

.build_text_diff_reason(text1, text2) ⇒ `String`

.build_whitespace_adjacency_reason(node1, node2) ⇒ `Object`

.character_visualization_map ⇒ `Hash`

.comment_vs_non_comment_comparison?(node1, node2) ⇒ `Boolean`

.compare_attribute_sets(n1, n2, opts, differences) ⇒ `Object`

.compare_children(n1, n2, opts, child_opts, diff_children, differences) ⇒ `Object`

.compare_comment_nodes(n1, n2, opts, differences) ⇒ `Object`

.compare_document_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ `Object`

.compare_element_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ `Object`

.compare_namespace_declarations(n1, n2, opts, differences) ⇒ `Object`

.compare_nodes(n1, n2, opts, child_opts, diff_children, differences) ⇒ `Object`

.compare_processing_instruction_nodes(n1, n2, opts, differences) ⇒ `Object`

.compare_text_nodes(n1, n2, opts, differences) ⇒ `Object`

.describe_whitespace(text) ⇒ `String`

.equivalent?(n1, n2, opts = {}, child_opts = {}) ⇒ `Boolean`, `Array`

.extract_attributes(node) ⇒ `Hash`^?

.extract_element_path(node) ⇒ `Array<String>`

.extract_text_from_node(node) ⇒ `String`^?

.in_preserve_element?(node, preserve_list) ⇒ `Boolean`

.non_ws_sibling_exists?(siblings, idx, direction) ⇒ `Boolean`

.serialize_node(node) ⇒ `String`^?

.should_preserve_whitespace_strictly?(n1, n2, opts) ⇒ `Boolean`

.truncate_text(text, max_length = 40) ⇒ `String`

.visualize_whitespace(text) ⇒ `String`

.whitespace_only?(text) ⇒ `Boolean`

.whitespace_partner_direction(ws_node) ⇒ `Object`