Class: Canon::Comparison::MarkupComparator

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/comparison/markup_comparator.rb

Overview

Base class for markup document comparison (XML, HTML)

Provides shared comparison functionality for markup documents, including node type checking, text extraction, filtering, and difference creation.

Format-specific comparators (XmlComparator, HtmlComparator) inherit from this class and add format-specific behavior.

Direct Known Subclasses

HtmlComparator, XmlComparator

Class Method Summary collapse

Class Method Details

.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object

Add a difference to the differences array

Creates a DiffNode with enriched metadata including path, serialized content, and attributes for Stage 4 rendering.

Parameters:

  • node1 (Object, nil)

    First node

  • node2 (Object, nil)

    Second node

  • diff1 (Symbol)

    Difference type for node1

  • diff2 (Symbol)

    Difference type for node2

  • dimension (Symbol)

    The match dimension causing this difference

  • _opts (Hash)

    Options (unused but kept for interface compatibility)

  • differences (Array)

    Array to append difference to



31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/canon/comparison/markup_comparator.rb', line 31

def add_difference(node1, node2, diff1, diff2, dimension, _opts,
                   differences)
  # All differences must be DiffNode objects (OO architecture)
  if dimension.nil?
    raise ArgumentError,
          "dimension required for DiffNode"
  end

  # Build informative reason message
  reason = build_difference_reason(node1, node2, diff1, diff2,
                                   dimension)

  # Enrich with path, serialized content, and attributes for Stage 4 rendering
   = (node1, node2)

  diff_node = Canon::Diff::DiffNode.new(
    node1: node1,
    node2: node2,
    dimension: dimension,
    reason: reason,
    **,
  )
  differences << diff_node
end

.build_attribute_difference_reason(attrs1, attrs2) ⇒ String

Build a clear reason message for attribute presence differences Shows which attributes are only in node1, only in node2, or different values

Parameters:

  • attrs1 (Hash, nil)

    First node’s attributes

  • attrs2 (Hash, nil)

    Second node’s attributes

Returns:

  • (String)

    Clear explanation of the attribute difference



318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
# File 'lib/canon/comparison/markup_comparator.rb', line 318

def build_attribute_difference_reason(attrs1, attrs2)
  return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2

  require "set"
  keys1 = attrs1.keys.to_set
  keys2 = attrs2.keys.to_set

  only_in_1 = keys1 - keys2
  only_in_2 = keys2 - keys1
  common = keys1 & keys2

  # Check if values differ for common keys
  different_values = common.reject { |k| attrs1[k] == attrs2[k] }

  parts = []
  parts << "only in first: #{only_in_1.to_a.sort.join(', ')}" if only_in_1.any?
  parts << "only in second: #{only_in_2.to_a.sort.join(', ')}" if only_in_2.any?
  parts << "different values: #{different_values.sort.join(', ')}" if different_values.any?

  if parts.empty?
    "#{keys1.size} vs #{keys2.size} attributes (same names)"
  else
    parts.join("; ")
  end
end

.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String

Build a human-readable reason for a difference

Parameters:

  • node1 (Object, nil)

    First node

  • node2 (Object, nil)

    Second node

  • diff1 (Symbol)

    Difference type for node1

  • diff2 (Symbol)

    Difference type for node2

  • dimension (Symbol)

    The dimension of the difference

Returns:

  • (String)

    Human-readable reason



293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
# File 'lib/canon/comparison/markup_comparator.rb', line 293

def build_difference_reason(node1, node2, diff1, diff2, dimension)
  # For attribute presence differences, show what attributes differ
  if dimension == :attribute_presence
    attrs1 = extract_attributes(node1)
    attrs2 = extract_attributes(node2)
    return build_attribute_difference_reason(attrs1, attrs2)
  end

  # For text content differences, show the actual text (truncated if needed)
  if dimension == :text_content
    text1 = extract_text_content_from_node(node1)
    text2 = extract_text_content_from_node(node2)
    return build_text_difference_reason(text1, text2)
  end

  # Default reason - can be overridden in subclasses
  "#{diff1} vs #{diff2}"
end

.build_path_for_node(node) ⇒ String?

Build canonical path for a node

Parameters:

  • node (Object)

    Node to build path for

Returns:

  • (String, nil)

    Canonical path with ordinal indices



76
77
78
79
80
# File 'lib/canon/comparison/markup_comparator.rb', line 76

def build_path_for_node(node)
  return nil if node.nil?

  Canon::Diff::PathBuilder.build(node, format: :document)
end

.build_text_difference_reason(text1, text2) ⇒ String

Build a clear reason message for text content differences Shows the actual text content (truncated if too long)

Parameters:

  • text1 (String, nil)

    First text content

  • text2 (String, nil)

    Second text content

Returns:

  • (String)

    Clear explanation of the text difference



381
382
383
384
385
386
387
388
389
# File 'lib/canon/comparison/markup_comparator.rb', line 381

def build_text_difference_reason(text1, text2)
  # Handle nil cases
  return "missing vs '#{truncate_text(text2)}'" if text1.nil? && text2
  return "'#{truncate_text(text1)}' vs missing" if text1 && text2.nil?
  return "both missing" if text1.nil? && text2.nil?

  # Both have content - show truncated versions
  "'#{truncate_text(text1)}' vs '#{truncate_text(text2)}'"
end

.comment_node?(node) ⇒ Boolean

Check if a node is a comment node

For XML/XHTML, this checks the node’s comment? method or node_type. For HTML, this also checks TEXT nodes that contain HTML-style comments (Nokogiri parses HTML comments as TEXT nodes with content like “<!– comment –>” or escaped like “<\!– comment –>” in full HTML documents).

Parameters:

  • node (Object)

    Node to check

Returns:

  • (Boolean)

    true if node is a comment



228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
# File 'lib/canon/comparison/markup_comparator.rb', line 228

def comment_node?(node)
  return true if node.respond_to?(:comment?) && node.comment?
  return true if node.respond_to?(:node_type) && node.node_type == :comment

  # HTML comments are parsed as TEXT nodes by Nokogiri
  # Check if this is a text node with HTML comment content
  if text_node?(node)
    text = node_text(node)
    # Strip whitespace and backslashes for comparison
    # Nokogiri escapes HTML comments as "<\\!-- comment -->" in full documents
    text_stripped = text.to_s.strip.gsub("\\", "")
    return true if text_stripped.start_with?("<!--") && text_stripped.end_with?("-->")
  end

  false
end

.determine_node_dimension(node) ⇒ Symbol

Determine the appropriate dimension for a node type

Parameters:

  • node (Object)

    The node to check

Returns:

  • (Symbol)

    The dimension symbol



426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
# File 'lib/canon/comparison/markup_comparator.rb', line 426

def determine_node_dimension(node)
  # Canon::Xml::Node types
  if node.respond_to?(:node_type) && node.node_type.is_a?(Symbol)
    case node.node_type
    when :comment then :comments
    when :text, :cdata then :text_content
    when :processing_instruction then :processing_instructions
    else :text_content
    end
  # Moxml/Nokogiri types
  elsif node.respond_to?(:comment?) && node.comment?
    :comments
  elsif node.respond_to?(:text?) && node.text?
    :text_content
  elsif node.respond_to?(:cdata?) && node.cdata?
    :text_content
  elsif node.respond_to?(:processing_instruction?) && node.processing_instruction?
    :processing_instructions
  else
    :text_content
  end
end

.enrich_diff_metadata(node1, node2) ⇒ Hash

Enrich DiffNode with canonical path, serialized content, and attributes This extracts presentation-ready metadata from nodes for Stage 4 rendering

Parameters:

  • node1 (Object, nil)

    First node

  • node2 (Object, nil)

    Second node

Returns:

  • (Hash)

    Enriched metadata hash



62
63
64
65
66
67
68
69
70
# File 'lib/canon/comparison/markup_comparator.rb', line 62

def (node1, node2)
  {
    path: build_path_for_node(node1 || node2),
    serialized_before: serialize_node(node1),
    serialized_after: serialize_node(node2),
    attributes_before: extract_attributes(node1),
    attributes_after: extract_attributes(node2),
  }
end

.extract_attributes(node) ⇒ Hash?

Extract attributes from a node

Parameters:

  • node (Object, nil)

    Node to extract attributes from

Returns:

  • (Hash, nil)

    Hash of attribute name => value pairs



116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/canon/comparison/markup_comparator.rb', line 116

def extract_attributes(node)
  return nil if node.nil?

  # Canon::Xml::Node ElementNode
  if node.is_a?(Canon::Xml::Nodes::ElementNode)
    node.attribute_nodes.to_h do |attr|
      [attr.name, attr.value]
    end
  # Nokogiri nodes
  elsif node.respond_to?(:attributes)
    node.attributes.to_h do |_, attr|
      [attr.name, attr.value]
    end
  else
    {}
  end
end

.extract_text_content_from_node(node) ⇒ String?

Extract text content from a node for diff reason

Parameters:

  • node (Object, nil)

    Node to extract text from

Returns:

  • (String, nil)

    Text content or nil



348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
# File 'lib/canon/comparison/markup_comparator.rb', line 348

def extract_text_content_from_node(node)
  return nil if node.nil?

  # For Canon::Xml::Nodes::TextNode
  return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode)

  # For XML/HTML nodes with text_content method
  return node.text_content if node.respond_to?(:text_content)

  # For nodes with text method
  return node.text if node.respond_to?(:text)

  # For nodes with content method (Moxml::Text)
  return node.content if node.respond_to?(:content)

  # For nodes with value method (other types)
  return node.value if node.respond_to?(:value)

  # For simple text nodes or strings
  return node.to_s if node.is_a?(String)

  # For other node types, try to_s
  node.to_s
rescue StandardError
  nil
end

.filter_children(children, opts) ⇒ Array

Filter children based on options

Removes nodes that should be excluded from comparison based on options like :ignore_nodes, :ignore_comments, etc.

Parameters:

  • children (Array)

    Array of child nodes

  • opts (Hash)

    Comparison options

Returns:

  • (Array)

    Filtered array of children



142
143
144
145
146
# File 'lib/canon/comparison/markup_comparator.rb', line 142

def filter_children(children, opts)
  children.reject do |child|
    node_excluded?(child, opts)
  end
end

.node_excluded?(node, opts) ⇒ Boolean

Check if node should be excluded from comparison

Parameters:

  • node (Object)

    Node to check

  • opts (Hash)

    Comparison options

Returns:

  • (Boolean)

    true if node should be excluded



153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# File 'lib/canon/comparison/markup_comparator.rb', line 153

def node_excluded?(node, opts)
  return false if node.nil?

  return true if opts[:ignore_nodes]&.include?(node)
  return true if opts[:ignore_comments] && comment_node?(node)
  return true if opts[:ignore_text_nodes] && text_node?(node)

  # Check match options
  match_opts = opts[:match_opts]
  return false unless match_opts

  # Filter comments based on match options and format
  # HTML: Filter comments to avoid spurious differences from zip pairing
  #       BUT only when not in verbose mode (verbose needs differences recorded)
  # XML: Don't filter comments (allow informative differences to be recorded)
  if match_opts[:comments] == :ignore && comment_node?(node)
    # In verbose mode, don't filter comments - we want to record the differences
    return false if opts[:verbose]

    # Only filter comments for HTML, not XML (when not verbose)
    format = opts[:format] || match_opts[:format]
    if %i[html html4 html5].include?(format)
      return true
    end
  end

  # Strip whitespace-only text nodes based on parent element configuration.
  # Use preserve_whitespace_elements / strip_whitespace_elements to control.
  # Blacklist (strip) > preserve > collapse > format defaults.
  return false unless text_node?(node) && node.parent
  return false unless MatchOptions.normalize_text(node_text(node)).empty?

  return true unless WhitespaceSensitivity.whitespace_preserved?(
    node.parent, match_opts
  )

  # When the pretty-print-side flag is active (set by opts_for_side in
  # ChildComparison.compare), drop whitespace-only text nodes that start
  # with "\n" inside :collapse elements — they are structural indentation
  # from the pretty-printer, not content.  Space-only nodes (no initial "\n") are
  # real inline content and are kept for normalised comparison.
  # :preserve elements are always left unchanged.
  if match_opts[:_pretty_print_side_active]
    ws_class = WhitespaceSensitivity.classify_text_node(node, opts)
    return true if ws_class == :collapse && node_text(node).start_with?("\n")
  end

  false
end

.node_text(node) ⇒ String

Get text content from a node

Parameters:

  • node (Object)

    Node to get text from

Returns:

  • (String)

    Text content



259
260
261
262
263
264
265
266
267
268
269
# File 'lib/canon/comparison/markup_comparator.rb', line 259

def node_text(node)
  # Canon::Xml::Node TextNode uses .value
  if node.respond_to?(:value)
    node.value.to_s
  # Nokogiri nodes use .content
  elsif node.respond_to?(:content)
    node.content.to_s
  else
    node.to_s
  end
end

.same_node_type?(node1, node2) ⇒ Boolean

Check if two nodes are the same type

Parameters:

  • node1 (Object)

    First node

  • node2 (Object)

    Second node

Returns:

  • (Boolean)

    true if nodes are same type



208
209
210
211
212
213
214
215
216
217
# File 'lib/canon/comparison/markup_comparator.rb', line 208

def same_node_type?(node1, node2)
  return false if node1.class != node2.class

  # For Nokogiri/Canon::Xml nodes, check node type
  if node1.respond_to?(:node_type) && node2.respond_to?(:node_type)
    node1.node_type == node2.node_type
  else
    true
  end
end

.serialize_element_node(node) ⇒ String

Serialize an element node to string

Parameters:

Returns:

  • (String)

    Serialized element



409
410
411
412
413
414
415
416
417
418
419
420
# File 'lib/canon/comparison/markup_comparator.rb', line 409

def serialize_element_node(node)
  attrs = node.attribute_nodes.map do |a|
    " #{a.name}=\"#{a.value}\""
  end.join
  children_xml = node.children.map { |c| serialize_node(c) }.join

  if children_xml.empty?
    "<#{node.name}#{attrs}/>"
  else
    "<#{node.name}#{attrs}>#{children_xml}</#{node.name}>"
  end
end

.serialize_node(node) ⇒ String?

Serialize a node to string for display

Parameters:

  • node (Object, nil)

    Node to serialize

Returns:

  • (String, nil)

    Serialized content



86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/canon/comparison/markup_comparator.rb', line 86

def serialize_node(node)
  return nil if node.nil?

  # Canon::Xml::Node types
  if node.is_a?(Canon::Xml::Nodes::RootNode)
    # Serialize all children of root
    node.children.map { |child| serialize_node(child) }.join
  elsif node.is_a?(Canon::Xml::Nodes::ElementNode)
    serialize_element_node(node)
  elsif node.is_a?(Canon::Xml::Nodes::TextNode)
    # Use original text (with entity references) if available,
    # otherwise fall back to value (decoded text)
    node.original || node.value
  elsif node.is_a?(Canon::Xml::Nodes::CommentNode)
    "<!--#{node.value}-->"
  elsif node.is_a?(Canon::Xml::Nodes::ProcessingInstructionNode)
    "<?#{node.target} #{node.data}?>"
  elsif node.respond_to?(:to_xml)
    node.to_xml
  elsif node.respond_to?(:to_html)
    node.to_html
  else
    node.to_s
  end
end

.text_node?(node) ⇒ Boolean

Check if a node is a text node

Parameters:

  • node (Object)

    Node to check

Returns:

  • (Boolean)

    true if node is a text node



249
250
251
252
253
# File 'lib/canon/comparison/markup_comparator.rb', line 249

def text_node?(node)
  (node.respond_to?(:text?) && node.text? &&
    !node.respond_to?(:element?)) ||
    (node.respond_to?(:node_type) && node.node_type == :text)
end

.truncate_text(text, max_length = 40) ⇒ String

Truncate text for display in reason messages

Parameters:

  • text (String)

    Text to truncate

  • max_length (Integer) (defaults to: 40)

    Maximum length

Returns:

  • (String)

    Truncated text



396
397
398
399
400
401
402
403
# File 'lib/canon/comparison/markup_comparator.rb', line 396

def truncate_text(text, max_length = 40)
  return "" if text.nil?

  text = text.to_s
  return text if text.length <= max_length

  "#{text[0...max_length]}..."
end

.whitespace_only_difference?(text1, text2) ⇒ Boolean

Check if difference between two texts is only whitespace

Parameters:

  • text1 (String)

    First text

  • text2 (String)

    Second text

Returns:

  • (Boolean)

    true if difference is only in whitespace



276
277
278
279
280
281
282
283
# File 'lib/canon/comparison/markup_comparator.rb', line 276

def whitespace_only_difference?(text1, text2)
  # Normalize both texts (collapse/trim whitespace)
  norm1 = MatchOptions.normalize_text(text1)
  norm2 = MatchOptions.normalize_text(text2)

  # If normalized texts are the same, the difference was only whitespace
  norm1 == norm2
end