Class: Canon::Comparison::MarkupComparator

Inherits:

Object

Object
Canon::Comparison::MarkupComparator

show all

Defined in:: lib/canon/comparison/markup_comparator.rb

Overview

Base class for markup document comparison (XML, HTML)

Provides shared comparison functionality for markup documents, including node type checking, text extraction, filtering, and difference creation.

Format-specific comparators (XmlComparator, HtmlComparator) inherit from this class and add format-specific behavior.

Direct Known Subclasses

HtmlComparator, XmlComparator

Class Method Summary collapse

.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object

Add a difference to the differences array.
.build_attribute_difference_reason(attrs1, attrs2) ⇒ String

Build a clear reason message for attribute presence differences Shows which attributes are only in node1, only in node2, or different values.
.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String

Build a human-readable reason for a difference.
.build_path_for_node(node) ⇒ String^?

Build canonical path for a node.
.build_text_difference_reason(text1, text2) ⇒ String

Build a clear reason message for text content differences Shows the actual text content (truncated if too long).
.comment_node?(node) ⇒ Boolean

Check if a node is a comment node.
.determine_node_dimension(node) ⇒ Symbol

Determine the appropriate dimension for a node type.
.enrich_diff_metadata(node1, node2) ⇒ Hash

Enrich DiffNode with canonical path, serialized content, and attributes This extracts presentation-ready metadata from nodes for Stage 4 rendering.
.extract_attributes(node) ⇒ Hash^?

Extract attributes from a node.
.extract_text_content_from_node(node) ⇒ String^?

Extract text content from a node for diff reason.
.filter_children(children, opts) ⇒ Array

Filter children based on options.
.node_excluded?(node, opts) ⇒ Boolean

Check if node should be excluded from comparison.
.node_text(node) ⇒ String

Get text content from a node.
.same_node_type?(node1, node2) ⇒ Boolean

Check if two nodes are the same type.
.serialize_element_node(node) ⇒ String

Serialize an element node to string.
.serialize_node(node) ⇒ String^?

Serialize a node to string for display.
.text_node?(node) ⇒ Boolean

Check if a node is a text node.
.truncate_text(text, max_length = 40) ⇒ String

Truncate text for display in reason messages.
.whitespace_only_difference?(text1, text2) ⇒ Boolean

Check if difference between two texts is only whitespace.

Class Method Details

.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ `Object`

Add a difference to the differences array

Creates a DiffNode with enriched metadata including path, serialized content, and attributes for Stage 4 rendering.

Parameters:

node1 (Object, nil) —

First node
node2 (Object, nil) —

Second node
diff1 (Symbol) —

Difference type for node1
diff2 (Symbol) —

Difference type for node2
dimension (Symbol) —

The match dimension causing this difference
_opts (Hash) —

Options (unused but kept for interface compatibility)
differences (Array) —

Array to append difference to

# File 'lib/canon/comparison/markup_comparator.rb', line 31

def add_difference(node1, node2, diff1, diff2, dimension, _opts,
                   differences)
  # All differences must be DiffNode objects (OO architecture)
  if dimension.nil?
    raise ArgumentError,
          "dimension required for DiffNode"
  end

  # Build informative reason message
  reason = build_difference_reason(node1, node2, diff1, diff2,
                                   dimension)

  # Enrich with path, serialized content, and attributes for Stage 4 rendering
  metadata = enrich_diff_metadata(node1, node2)

  diff_node = Canon::Diff::DiffNode.new(
    node1: node1,
    node2: node2,
    dimension: dimension,
    reason: reason,
    **metadata,
  )
  differences << diff_node
end

.build_attribute_difference_reason(attrs1, attrs2) ⇒ `String`

Build a clear reason message for attribute presence differences Shows which attributes are only in node1, only in node2, or different values

Parameters:

attrs1 (Hash, nil) —

First node’s attributes
attrs2 (Hash, nil) —

Second node’s attributes

Returns:

(String) —

Clear explanation of the attribute difference

# File 'lib/canon/comparison/markup_comparator.rb', line 318

def build_attribute_difference_reason(attrs1, attrs2)
  return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2

  require "set"
  keys1 = attrs1.keys.to_set
  keys2 = attrs2.keys.to_set

  only_in_1 = keys1 - keys2
  only_in_2 = keys2 - keys1
  common = keys1 & keys2

  # Check if values differ for common keys
  different_values = common.reject { |k| attrs1[k] == attrs2[k] }

  parts = []
  parts << "only in first: #{only_in_1.to_a.sort.join(', ')}" if only_in_1.any?
  parts << "only in second: #{only_in_2.to_a.sort.join(', ')}" if only_in_2.any?
  parts << "different values: #{different_values.sort.join(', ')}" if different_values.any?

  if parts.empty?
    "#{keys1.size} vs #{keys2.size} attributes (same names)"
  else
    parts.join("; ")
  end
end

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ `String`

Build a human-readable reason for a difference

Parameters:

node1 (Object, nil) —

First node
node2 (Object, nil) —

Second node
diff1 (Symbol) —

Difference type for node1
diff2 (Symbol) —

Difference type for node2
dimension (Symbol) —

The dimension of the difference

Returns:

(String) —

Human-readable reason

# File 'lib/canon/comparison/markup_comparator.rb', line 293

def build_difference_reason(node1, node2, diff1, diff2, dimension)
  # For attribute presence differences, show what attributes differ
  if dimension == :attribute_presence
    attrs1 = extract_attributes(node1)
    attrs2 = extract_attributes(node2)
    return build_attribute_difference_reason(attrs1, attrs2)
  end

  # For text content differences, show the actual text (truncated if needed)
  if dimension == :text_content
    text1 = extract_text_content_from_node(node1)
    text2 = extract_text_content_from_node(node2)
    return build_text_difference_reason(text1, text2)
  end

  # Default reason - can be overridden in subclasses
  "#{diff1} vs #{diff2}"
end

.build_path_for_node(node) ⇒ `String`^?

Build canonical path for a node

Parameters:

node (Object) —

Node to build path for

Returns:

(String, nil) —

Canonical path with ordinal indices

# File 'lib/canon/comparison/markup_comparator.rb', line 76

def build_path_for_node(node)
  return nil if node.nil?

  Canon::Diff::PathBuilder.build(node, format: :document)
end

.build_text_difference_reason(text1, text2) ⇒ `String`

Build a clear reason message for text content differences Shows the actual text content (truncated if too long)

Parameters:

text1 (String, nil) —

First text content
text2 (String, nil) —

Second text content

Returns:

(String) —

Clear explanation of the text difference

# File 'lib/canon/comparison/markup_comparator.rb', line 381

def build_text_difference_reason(text1, text2)
  # Handle nil cases
  return "missing vs '#{truncate_text(text2)}'" if text1.nil? && text2
  return "'#{truncate_text(text1)}' vs missing" if text1 && text2.nil?
  return "both missing" if text1.nil? && text2.nil?

  # Both have content - show truncated versions
  "'#{truncate_text(text1)}' vs '#{truncate_text(text2)}'"
end

.comment_node?(node) ⇒ `Boolean`

Check if a node is a comment node

For XML/XHTML, this checks the node’s comment? method or node_type. For HTML, this also checks TEXT nodes that contain HTML-style comments (Nokogiri parses HTML comments as TEXT nodes with content like “<!– comment –>” or escaped like “<\!– comment –>” in full HTML documents).

Parameters:

node (Object) —

Node to check

Returns:

(Boolean) —

true if node is a comment

# File 'lib/canon/comparison/markup_comparator.rb', line 228

def comment_node?(node)
  return true if node.respond_to?(:comment?) && node.comment?
  return true if node.respond_to?(:node_type) && node.node_type == :comment

  # HTML comments are parsed as TEXT nodes by Nokogiri
  # Check if this is a text node with HTML comment content
  if text_node?(node)
    text = node_text(node)
    # Strip whitespace and backslashes for comparison
    # Nokogiri escapes HTML comments as "<\\!-- comment -->" in full documents
    text_stripped = text.to_s.strip.gsub("\\", "")
    return true if text_stripped.start_with?("<!--") && text_stripped.end_with?("-->")
  end

  false
end

.determine_node_dimension(node) ⇒ `Symbol`

Determine the appropriate dimension for a node type

Parameters:

node (Object) —

The node to check

Returns:

(Symbol) —

The dimension symbol

# File 'lib/canon/comparison/markup_comparator.rb', line 426

def determine_node_dimension(node)
  # Canon::Xml::Node types
  if node.respond_to?(:node_type) && node.node_type.is_a?(Symbol)
    case node.node_type
    when :comment then :comments
    when :text, :cdata then :text_content
    when :processing_instruction then :processing_instructions
    else :text_content
    end
  # Moxml/Nokogiri types
  elsif node.respond_to?(:comment?) && node.comment?
    :comments
  elsif node.respond_to?(:text?) && node.text?
    :text_content
  elsif node.respond_to?(:cdata?) && node.cdata?
    :text_content
  elsif node.respond_to?(:processing_instruction?) && node.processing_instruction?
    :processing_instructions
  else
    :text_content
  end
end

.enrich_diff_metadata(node1, node2) ⇒ `Hash`

Enrich DiffNode with canonical path, serialized content, and attributes This extracts presentation-ready metadata from nodes for Stage 4 rendering

Parameters:

node1 (Object, nil) —

First node
node2 (Object, nil) —

Second node

Returns:

(Hash) —

Enriched metadata hash

# File 'lib/canon/comparison/markup_comparator.rb', line 62

def enrich_diff_metadata(node1, node2)
  {
    path: build_path_for_node(node1 || node2),
    serialized_before: serialize_node(node1),
    serialized_after: serialize_node(node2),
    attributes_before: extract_attributes(node1),
    attributes_after: extract_attributes(node2),
  }
end

.extract_attributes(node) ⇒ `Hash`^?

Extract attributes from a node

Parameters:

node (Object, nil) —

Node to extract attributes from

Returns:

(Hash, nil) —

Hash of attribute name => value pairs

# File 'lib/canon/comparison/markup_comparator.rb', line 116

def extract_attributes(node)
  return nil if node.nil?

  # Canon::Xml::Node ElementNode
  if node.is_a?(Canon::Xml::Nodes::ElementNode)
    node.attribute_nodes.to_h do |attr|
      [attr.name, attr.value]
    end
  # Nokogiri nodes
  elsif node.respond_to?(:attributes)
    node.attributes.to_h do |_, attr|
      [attr.name, attr.value]
    end
  else
    {}
  end
end

.extract_text_content_from_node(node) ⇒ `String`^?

Extract text content from a node for diff reason

Parameters:

node (Object, nil) —

Node to extract text from

Returns:

(String, nil) —

Text content or nil

# File 'lib/canon/comparison/markup_comparator.rb', line 348

def extract_text_content_from_node(node)
  return nil if node.nil?

  # For Canon::Xml::Nodes::TextNode
  return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode)

  # For XML/HTML nodes with text_content method
  return node.text_content if node.respond_to?(:text_content)

  # For nodes with text method
  return node.text if node.respond_to?(:text)

  # For nodes with content method (Moxml::Text)
  return node.content if node.respond_to?(:content)

  # For nodes with value method (other types)
  return node.value if node.respond_to?(:value)

  # For simple text nodes or strings
  return node.to_s if node.is_a?(String)

  # For other node types, try to_s
  node.to_s
rescue StandardError
  nil
end

.filter_children(children, opts) ⇒ `Array`

Filter children based on options

Removes nodes that should be excluded from comparison based on options like :ignore_nodes, :ignore_comments, etc.

Parameters:

children (Array) —

Array of child nodes
opts (Hash) —

Comparison options

Returns:

(Array) —

Filtered array of children

# File 'lib/canon/comparison/markup_comparator.rb', line 142

def filter_children(children, opts)
  children.reject do |child|
    node_excluded?(child, opts)
  end
end

.node_excluded?(node, opts) ⇒ `Boolean`

Check if node should be excluded from comparison

Parameters:

node (Object) —

Node to check
opts (Hash) —

Comparison options

Returns:

(Boolean) —

true if node should be excluded

# File 'lib/canon/comparison/markup_comparator.rb', line 153

def node_excluded?(node, opts)
  return false if node.nil?

  return true if opts[:ignore_nodes]&.include?(node)
  return true if opts[:ignore_comments] && comment_node?(node)
  return true if opts[:ignore_text_nodes] && text_node?(node)

  # Check match options
  match_opts = opts[:match_opts]
  return false unless match_opts

  # Filter comments based on match options and format
  # HTML: Filter comments to avoid spurious differences from zip pairing
  #       BUT only when not in verbose mode (verbose needs differences recorded)
  # XML: Don't filter comments (allow informative differences to be recorded)
  if match_opts[:comments] == :ignore && comment_node?(node)
    # In verbose mode, don't filter comments - we want to record the differences
    return false if opts[:verbose]

    # Only filter comments for HTML, not XML (when not verbose)
    format = opts[:format] || match_opts[:format]
    if %i[html html4 html5].include?(format)
      return true
    end
  end

  # Strip whitespace-only text nodes based on parent element configuration.
  # Use preserve_whitespace_elements / strip_whitespace_elements to control.
  # Blacklist (strip) > preserve > collapse > format defaults.
  return false unless text_node?(node) && node.parent
  return false unless MatchOptions.normalize_text(node_text(node)).empty?

  return true unless WhitespaceSensitivity.whitespace_preserved?(
    node.parent, match_opts
  )

  # When the pretty-print-side flag is active (set by opts_for_side in
  # ChildComparison.compare), drop whitespace-only text nodes that start
  # with "\n" inside :collapse elements — they are structural indentation
  # from the pretty-printer, not content.  Space-only nodes (no initial "\n") are
  # real inline content and are kept for normalised comparison.
  # :preserve elements are always left unchanged.
  if match_opts[:_pretty_print_side_active]
    ws_class = WhitespaceSensitivity.classify_text_node(node, opts)
    return true if ws_class == :collapse && node_text(node).start_with?("\n")
  end

  false
end

.node_text(node) ⇒ `String`

Get text content from a node

Parameters:

node (Object) —

Node to get text from

Returns:

(String) —

Text content

# File 'lib/canon/comparison/markup_comparator.rb', line 259

def node_text(node)
  # Canon::Xml::Node TextNode uses .value
  if node.respond_to?(:value)
    node.value.to_s
  # Nokogiri nodes use .content
  elsif node.respond_to?(:content)
    node.content.to_s
  else
    node.to_s
  end
end

.same_node_type?(node1, node2) ⇒ `Boolean`

Check if two nodes are the same type

Parameters:

node1 (Object) —

First node
node2 (Object) —

Second node

Returns:

(Boolean) —

true if nodes are same type

# File 'lib/canon/comparison/markup_comparator.rb', line 208

def same_node_type?(node1, node2)
  return false if node1.class != node2.class

  # For Nokogiri/Canon::Xml nodes, check node type
  if node1.respond_to?(:node_type) && node2.respond_to?(:node_type)
    node1.node_type == node2.node_type
  else
    true
  end
end

.serialize_element_node(node) ⇒ `String`

Serialize an element node to string

Parameters:

node (Canon::Xml::Nodes::ElementNode) —

Element node

Returns:

(String) —

Serialized element

# File 'lib/canon/comparison/markup_comparator.rb', line 409

def serialize_element_node(node)
  attrs = node.attribute_nodes.map do |a|
    " #{a.name}=\"#{a.value}\""
  end.join
  children_xml = node.children.map { |c| serialize_node(c) }.join

  if children_xml.empty?
    "<#{node.name}#{attrs}/>"
  else
    "<#{node.name}#{attrs}>#{children_xml}</#{node.name}>"
  end
end

.serialize_node(node) ⇒ `String`^?

Serialize a node to string for display

Parameters:

node (Object, nil) —

Node to serialize

Returns:

(String, nil) —

Serialized content

# File 'lib/canon/comparison/markup_comparator.rb', line 86

def serialize_node(node)
  return nil if node.nil?

  # Canon::Xml::Node types
  if node.is_a?(Canon::Xml::Nodes::RootNode)
    # Serialize all children of root
    node.children.map { |child| serialize_node(child) }.join
  elsif node.is_a?(Canon::Xml::Nodes::ElementNode)
    serialize_element_node(node)
  elsif node.is_a?(Canon::Xml::Nodes::TextNode)
    # Use original text (with entity references) if available,
    # otherwise fall back to value (decoded text)
    node.original || node.value
  elsif node.is_a?(Canon::Xml::Nodes::CommentNode)
    "<!--#{node.value}-->"
  elsif node.is_a?(Canon::Xml::Nodes::ProcessingInstructionNode)
    "<?#{node.target} #{node.data}?>"
  elsif node.respond_to?(:to_xml)
    node.to_xml
  elsif node.respond_to?(:to_html)
    node.to_html
  else
    node.to_s
  end
end

.text_node?(node) ⇒ `Boolean`

Check if a node is a text node

Parameters:

node (Object) —

Node to check

Returns:

(Boolean) —

true if node is a text node

# File 'lib/canon/comparison/markup_comparator.rb', line 249

def text_node?(node)
  (node.respond_to?(:text?) && node.text? &&
    !node.respond_to?(:element?)) ||
    (node.respond_to?(:node_type) && node.node_type == :text)
end

.truncate_text(text, max_length = 40) ⇒ `String`

Truncate text for display in reason messages

Parameters:

text (String) —

Text to truncate
max_length (Integer) (defaults to: 40) —

Maximum length

Returns:

(String) —

Truncated text

# File 'lib/canon/comparison/markup_comparator.rb', line 396

def truncate_text(text, max_length = 40)
  return "" if text.nil?

  text = text.to_s
  return text if text.length <= max_length

  "#{text[0...max_length]}..."
end

.whitespace_only_difference?(text1, text2) ⇒ `Boolean`

Check if difference between two texts is only whitespace

Parameters:

text1 (String) —

First text
text2 (String) —

Second text

Returns:

(Boolean) —

true if difference is only in whitespace

# File 'lib/canon/comparison/markup_comparator.rb', line 276

def whitespace_only_difference?(text1, text2)
  # Normalize both texts (collapse/trim whitespace)
  norm1 = MatchOptions.normalize_text(text1)
  norm2 = MatchOptions.normalize_text(text2)

  # If normalized texts are the same, the difference was only whitespace
  norm1 == norm2
end

Class: Canon::Comparison::MarkupComparator

Overview

Direct Known Subclasses

Class Method Summary collapse

Class Method Details

.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object

.build_attribute_difference_reason(attrs1, attrs2) ⇒ String

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String

.build_path_for_node(node) ⇒ String?

.build_text_difference_reason(text1, text2) ⇒ String

.comment_node?(node) ⇒ Boolean

.determine_node_dimension(node) ⇒ Symbol

.enrich_diff_metadata(node1, node2) ⇒ Hash

.extract_attributes(node) ⇒ Hash?

.extract_text_content_from_node(node) ⇒ String?

.filter_children(children, opts) ⇒ Array

.node_excluded?(node, opts) ⇒ Boolean

.node_text(node) ⇒ String

.same_node_type?(node1, node2) ⇒ Boolean

.serialize_element_node(node) ⇒ String

.serialize_node(node) ⇒ String?

.text_node?(node) ⇒ Boolean

.truncate_text(text, max_length = 40) ⇒ String

.whitespace_only_difference?(text1, text2) ⇒ Boolean

.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ `Object`

.build_attribute_difference_reason(attrs1, attrs2) ⇒ `String`

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ `String`

.build_path_for_node(node) ⇒ `String`^?

.build_text_difference_reason(text1, text2) ⇒ `String`

.comment_node?(node) ⇒ `Boolean`

.determine_node_dimension(node) ⇒ `Symbol`

.enrich_diff_metadata(node1, node2) ⇒ `Hash`

.extract_attributes(node) ⇒ `Hash`^?

.extract_text_content_from_node(node) ⇒ `String`^?

.filter_children(children, opts) ⇒ `Array`

.node_excluded?(node, opts) ⇒ `Boolean`

.node_text(node) ⇒ `String`

.same_node_type?(node1, node2) ⇒ `Boolean`

.serialize_element_node(node) ⇒ `String`

.serialize_node(node) ⇒ `String`^?

.text_node?(node) ⇒ `Boolean`

.truncate_text(text, max_length = 40) ⇒ `String`

.whitespace_only_difference?(text1, text2) ⇒ `Boolean`