Class: Canon::Comparison::MarkupComparator
- Inherits:
-
Object
- Object
- Canon::Comparison::MarkupComparator
- Defined in:
- lib/canon/comparison/markup_comparator.rb
Overview
Base class for markup document comparison (XML, HTML)
Provides shared comparison functionality for markup documents, including node type checking, text extraction, filtering, and difference creation.
Format-specific comparators (XmlComparator, HtmlComparator) inherit from this class and add format-specific behavior.
Direct Known Subclasses
Class Method Summary collapse
-
.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object
Add a difference to the differences array.
-
.build_attribute_difference_reason(attrs1, attrs2) ⇒ String
Build a clear reason message for attribute presence differences Shows which attributes are only in node1, only in node2, or different values.
-
.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String
Build a human-readable reason for a difference.
-
.build_path_for_node(node) ⇒ String?
Build canonical path for a node.
-
.build_text_difference_reason(text1, text2) ⇒ String
Build a clear reason message for text content differences Shows the actual text content (truncated if too long).
-
.comment_node?(node) ⇒ Boolean
Check if a node is a comment node.
-
.determine_node_dimension(node) ⇒ Symbol
Determine the appropriate dimension for a node type.
-
.enrich_diff_metadata(node1, node2) ⇒ Hash
Enrich DiffNode with canonical path, serialized content, and attributes This extracts presentation-ready metadata from nodes for Stage 4 rendering.
-
.extract_attributes(node) ⇒ Hash?
Extract attributes from a node.
-
.extract_text_content_from_node(node) ⇒ String?
Extract text content from a node for diff reason.
-
.filter_children(children, opts) ⇒ Array
Filter children based on options.
-
.node_excluded?(node, opts) ⇒ Boolean
Check if node should be excluded from comparison.
-
.node_text(node) ⇒ String
Get text content from a node.
-
.same_node_type?(node1, node2) ⇒ Boolean
Check if two nodes are the same type.
-
.serialize_element_node(node) ⇒ String
Serialize an element node to string.
-
.serialize_node(node) ⇒ String?
Serialize a node to string for display.
-
.text_node?(node) ⇒ Boolean
Check if a node is a text node.
-
.truncate_text(text, max_length = 40) ⇒ String
Truncate text for display in reason messages.
-
.whitespace_only_difference?(text1, text2) ⇒ Boolean
Check if difference between two texts is only whitespace.
Class Method Details
.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object
Add a difference to the differences array
Creates a DiffNode with enriched metadata including path, serialized content, and attributes for Stage 4 rendering.
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
# File 'lib/canon/comparison/markup_comparator.rb', line 32 def add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) # All differences must be DiffNode objects (OO architecture) if dimension.nil? raise ArgumentError, "dimension required for DiffNode" end # Build informative reason message reason = build_difference_reason(node1, node2, diff1, diff2, dimension) # Enrich with path, serialized content, and attributes for Stage 4 rendering = (node1, node2) diff_node = Canon::Diff::DiffNode.new( node1: node1, node2: node2, dimension: dimension, reason: reason, **, ) differences << diff_node end |
.build_attribute_difference_reason(attrs1, attrs2) ⇒ String
Build a clear reason message for attribute presence differences Shows which attributes are only in node1, only in node2, or different values
316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 |
# File 'lib/canon/comparison/markup_comparator.rb', line 316 def build_attribute_difference_reason(attrs1, attrs2) return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2 require "set" keys1 = attrs1.keys.to_set keys2 = attrs2.keys.to_set only_in_1 = keys1 - keys2 only_in_2 = keys2 - keys1 common = keys1 & keys2 # Check if values differ for common keys different_values = common.reject { |k| attrs1[k] == attrs2[k] } parts = [] parts << "only in first: #{only_in_1.to_a.sort.join(', ')}" if only_in_1.any? parts << "only in second: #{only_in_2.to_a.sort.join(', ')}" if only_in_2.any? parts << "different values: #{different_values.sort.join(', ')}" if different_values.any? if parts.empty? "#{keys1.size} vs #{keys2.size} attributes (same names)" else parts.join("; ") end end |
.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String
Build a human-readable reason for a difference
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 |
# File 'lib/canon/comparison/markup_comparator.rb', line 287 def build_difference_reason(node1, node2, diff1, diff2, dimension) # For attribute presence differences, show what attributes differ if dimension == :attribute_presence attrs1 = extract_attributes(node1) attrs2 = extract_attributes(node2) return build_attribute_difference_reason(attrs1, attrs2) end # For text content differences, show the actual text (truncated if needed) if dimension == :text_content text1 = extract_text_content_from_node(node1) text2 = extract_text_content_from_node(node2) return build_text_difference_reason(text1, text2) end # Default reason - can be overridden in subclasses if diff1 == Canon::Comparison::MISSING_NODE && diff2 == Canon::Comparison::MISSING_NODE "element structure mismatch (children differ)" else Canon::Comparison.code_pair_label(diff1, diff2) end end |
.build_path_for_node(node) ⇒ String?
Build canonical path for a node
77 78 79 80 81 |
# File 'lib/canon/comparison/markup_comparator.rb', line 77 def build_path_for_node(node) return nil if node.nil? Canon::Diff::PathBuilder.build(node, format: :document) end |
.build_text_difference_reason(text1, text2) ⇒ String
Build a clear reason message for text content differences Shows the actual text content (truncated if too long)
371 372 373 374 375 376 377 378 379 |
# File 'lib/canon/comparison/markup_comparator.rb', line 371 def build_text_difference_reason(text1, text2) # Handle nil cases return "missing vs '#{truncate_text(text2)}'" if text1.nil? && text2 return "'#{truncate_text(text1)}' vs missing" if text1 && text2.nil? return "both missing" if text1.nil? && text2.nil? # Both have content - show truncated versions "'#{truncate_text(text1)}' vs '#{truncate_text(text2)}'" end |
.comment_node?(node) ⇒ Boolean
Check if a node is a comment node
For XML/XHTML, this checks the node’s comment? method or node_type. For HTML, this also checks TEXT nodes that contain HTML-style comments (Nokogiri parses HTML comments as TEXT nodes with content like “<!– comment –>” or escaped like “<\!– comment –>” in full HTML documents).
245 246 247 |
# File 'lib/canon/comparison/markup_comparator.rb', line 245 def comment_node?(node) NodeInspector.comment_node?(node) end |
.determine_node_dimension(node) ⇒ Symbol
Determine the appropriate dimension for a node type
Used by ChildComparison to tag per-child orphan diffs with a dimension that matches what the node is, so the formatter renders correctly. An element orphan tagged :text_content would otherwise route through PR #126’s one-sided text formatter and render as text “” instead of as the actual element (see lutaml/canon#125 follow-up).
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 |
# File 'lib/canon/comparison/markup_comparator.rb', line 423 def determine_node_dimension(node) case node when Canon::Xml::Node case node.node_type when :element then :element_structure when :comment then :comments when :text, :cdata then :text_content when :processing_instruction then :processing_instructions else :text_content end when Nokogiri::XML::Node if node.comment? :comments elsif node.text? || node.cdata? :text_content elsif node.processing_instruction? :processing_instructions elsif node.element? :element_structure else :text_content end else :text_content end end |
.enrich_diff_metadata(node1, node2) ⇒ Hash
Enrich DiffNode with canonical path, serialized content, and attributes This extracts presentation-ready metadata from nodes for Stage 4 rendering
63 64 65 66 67 68 69 70 71 |
# File 'lib/canon/comparison/markup_comparator.rb', line 63 def (node1, node2) { path: build_path_for_node(node1 || node2), serialized_before: serialize_node(node1), serialized_after: serialize_node(node2), attributes_before: extract_attributes(node1), attributes_after: extract_attributes(node2), } end |
.extract_attributes(node) ⇒ Hash?
Extract attributes from a node
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
# File 'lib/canon/comparison/markup_comparator.rb', line 114 def extract_attributes(node) return nil if node.nil? # Canon::Xml::Node ElementNode if node.is_a?(Canon::Xml::Nodes::ElementNode) node.attribute_nodes.to_h do |attr| [attr.name, attr.value] end # Nokogiri elements elsif node.is_a?(Nokogiri::XML::Element) node.attributes.to_h do |_, attr| [attr.name, attr.value] end else {} end end |
.extract_text_content_from_node(node) ⇒ String?
Extract text content from a node for diff reason
346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 |
# File 'lib/canon/comparison/markup_comparator.rb', line 346 def extract_text_content_from_node(node) return nil if node.nil? case node when Canon::Xml::Nodes::TextNode node.value when Canon::Xml::Node node.text_content when Nokogiri::XML::Node node.content.to_s when String node else node.to_s end rescue StandardError nil end |
.filter_children(children, opts) ⇒ Array
Filter children based on options
Removes nodes that should be excluded from comparison based on options like :ignore_nodes, :ignore_comments, etc.
140 141 142 143 144 |
# File 'lib/canon/comparison/markup_comparator.rb', line 140 def filter_children(children, opts) children.reject do |child| node_excluded?(child, opts) end end |
.node_excluded?(node, opts) ⇒ Boolean
Check if node should be excluded from comparison
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
# File 'lib/canon/comparison/markup_comparator.rb', line 151 def node_excluded?(node, opts) return false if node.nil? return true if opts[:ignore_nodes]&.include?(node) return true if opts[:ignore_comments] && comment_node?(node) return true if opts[:ignore_text_nodes] && text_node?(node) # Check match options match_opts = opts[:match_opts] return false unless match_opts # Filter comments based on match options and format # HTML: Filter comments to avoid spurious differences from zip pairing # BUT only when not in verbose mode (verbose needs differences recorded) # XML: Don't filter comments (allow informative differences to be recorded) if match_opts[:comments] == :ignore && comment_node?(node) # In verbose mode, don't filter comments - we want to record the differences return false if opts[:verbose] # Only filter comments for HTML, not XML (when not verbose) format = opts[:format] || match_opts[:format] if %i[html html4 html5].include?(format) return true end end # Strip whitespace-only text nodes based on parent element configuration. # Use preserve_whitespace_elements / strip_whitespace_elements to control. # Blacklist (strip) > preserve > collapse > format defaults. return false unless text_node?(node) && node.parent return false unless MatchOptions.normalize_text(node_text(node)).empty? # NBSP (U+00A0) is never insignificant whitespace — # it always renders as a visible non-breaking space. # For HTML: always preserve NBSP nodes. # For XML with whitespace_type: :strict: preserve NBSP nodes so # different Unicode whitespace types remain distinguishable. format = opts[:format] || match_opts[:format] whitespace_type = match_opts[:whitespace_type] || :strict if (%i[html html4 html5].include?(format) || whitespace_type == :strict) && WhitespaceSensitivity.contains_nbsp?(node_text(node)) return false end if %i[html html4 html5].include?(format) && WhitespaceSensitivity.inline_whitespace_significant?(node) # Whitespace between inline element siblings is semantically # significant (renders as a visible gap) and must not be stripped. return false end return true unless WhitespaceSensitivity.whitespace_preserved?( node.parent, match_opts ) # When the pretty-print-side flag is active (set by opts_for_side in # ChildComparison.compare), drop whitespace-only text nodes that start # with "\n" inside :collapse elements — they are structural indentation # from the pretty-printer, not content. Space-only nodes (no initial "\n") are # real inline content and are kept for normalised comparison. # :preserve elements are always left unchanged. if match_opts[:_pretty_print_side_active] ws_class = WhitespaceSensitivity.classify_text_node(node, opts) return true if ws_class == :collapse && node_text(node).start_with?("\n") end false end |
.node_text(node) ⇒ String
Get text content from a node
261 262 263 |
# File 'lib/canon/comparison/markup_comparator.rb', line 261 def node_text(node) NodeInspector.text_content(node) end |
.same_node_type?(node1, node2) ⇒ Boolean
Check if two nodes are the same type
225 226 227 228 229 230 231 232 233 234 |
# File 'lib/canon/comparison/markup_comparator.rb', line 225 def same_node_type?(node1, node2) return false if node1.class != node2.class case node1 when Canon::Xml::Node, Nokogiri::XML::Node node1.node_type == node2.node_type else true end end |
.serialize_element_node(node) ⇒ String
Serialize an element node to string
399 400 401 402 403 404 405 406 407 408 409 410 |
# File 'lib/canon/comparison/markup_comparator.rb', line 399 def serialize_element_node(node) attrs = node.attribute_nodes.map do |a| " #{a.name}=\"#{a.value}\"" end.join children_xml = node.children.map { |c| serialize_node(c) }.join if children_xml.empty? "<#{node.name}#{attrs}/>" else "<#{node.name}#{attrs}>#{children_xml}</#{node.name}>" end end |
.serialize_node(node) ⇒ String?
Serialize a node to string for display
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
# File 'lib/canon/comparison/markup_comparator.rb', line 87 def serialize_node(node) return nil if node.nil? # Canon::Xml::Node types case node when Canon::Xml::Nodes::RootNode # Serialize all children of root node.children.map { |child| serialize_node(child) }.join when Canon::Xml::Nodes::ElementNode serialize_element_node(node) when Canon::Xml::Nodes::TextNode # Use original text (with entity references) if available, # otherwise fall back to value (decoded text) node.original || node.value when Canon::Xml::Nodes::CommentNode "<!--#{node.value}-->" when Canon::Xml::Nodes::ProcessingInstructionNode "<?#{node.target} #{node.data}?>" else node.to_s end end |
.text_node?(node) ⇒ Boolean
Check if a node is a text node
253 254 255 |
# File 'lib/canon/comparison/markup_comparator.rb', line 253 def text_node?(node) NodeInspector.text_node?(node) end |
.truncate_text(text, max_length = 40) ⇒ String
Truncate text for display in reason messages
386 387 388 389 390 391 392 393 |
# File 'lib/canon/comparison/markup_comparator.rb', line 386 def truncate_text(text, max_length = 40) return "" if text.nil? text = text.to_s return text if text.length <= max_length "#{text[0...max_length]}..." end |
.whitespace_only_difference?(text1, text2) ⇒ Boolean
Check if difference between two texts is only whitespace
270 271 272 273 274 275 276 277 |
# File 'lib/canon/comparison/markup_comparator.rb', line 270 def whitespace_only_difference?(text1, text2) # Normalize both texts (collapse/trim whitespace) norm1 = MatchOptions.normalize_text(text1) norm2 = MatchOptions.normalize_text(text2) # If normalized texts are the same, the difference was only whitespace norm1 == norm2 end |