Class: Canon::Comparison::MarkupComparator
- Inherits:
-
Object
- Object
- Canon::Comparison::MarkupComparator
- Defined in:
- lib/canon/comparison/markup_comparator.rb
Overview
Base class for markup document comparison (XML, HTML)
Provides shared comparison functionality for markup documents, including node type checking, text extraction, filtering, and difference creation.
Format-specific comparators (XmlComparator, HtmlComparator) inherit from this class and add format-specific behavior.
Direct Known Subclasses
Class Method Summary collapse
-
.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object
Add a difference to the differences array.
-
.build_attribute_difference_reason(attrs1, attrs2) ⇒ String
Build a clear reason message for attribute presence differences Shows which attributes are only in node1, only in node2, or different values.
-
.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String
Build a human-readable reason for a difference.
-
.build_path_for_node(node) ⇒ String?
Build canonical path for a node.
-
.build_text_difference_reason(text1, text2) ⇒ String
Build a clear reason message for text content differences Shows the actual text content (truncated if too long).
-
.comment_node?(node) ⇒ Boolean
Check if a node is a comment node.
-
.determine_node_dimension(node) ⇒ Symbol
Determine the appropriate dimension for a node type.
-
.enrich_diff_metadata(node1, node2) ⇒ Hash
Enrich DiffNode with canonical path, serialized content, and attributes This extracts presentation-ready metadata from nodes for Stage 4 rendering.
-
.extract_attributes(node) ⇒ Hash?
Extract attributes from a node.
-
.extract_text_content_from_node(node) ⇒ String?
Extract text content from a node for diff reason.
-
.filter_children(children, opts) ⇒ Array
Filter children based on options.
-
.node_excluded?(node, opts) ⇒ Boolean
Check if node should be excluded from comparison.
-
.node_text(node) ⇒ String
Get text content from a node.
-
.same_node_type?(node1, node2) ⇒ Boolean
Check if two nodes are the same type.
-
.serialize_element_node(node) ⇒ String
Serialize an element node to string.
-
.serialize_node(node) ⇒ String?
Serialize a node to string for display.
-
.text_node?(node) ⇒ Boolean
Check if a node is a text node.
-
.truncate_text(text, max_length = 40) ⇒ String
Truncate text for display in reason messages.
-
.whitespace_only_difference?(text1, text2) ⇒ Boolean
Check if difference between two texts is only whitespace.
Class Method Details
.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object
Add a difference to the differences array
Creates a DiffNode with enriched metadata including path, serialized content, and attributes for Stage 4 rendering.
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# File 'lib/canon/comparison/markup_comparator.rb', line 31 def add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) # All differences must be DiffNode objects (OO architecture) if dimension.nil? raise ArgumentError, "dimension required for DiffNode" end # Build informative reason message reason = build_difference_reason(node1, node2, diff1, diff2, dimension) # Enrich with path, serialized content, and attributes for Stage 4 rendering = (node1, node2) diff_node = Canon::Diff::DiffNode.new( node1: node1, node2: node2, dimension: dimension, reason: reason, **, ) differences << diff_node end |
.build_attribute_difference_reason(attrs1, attrs2) ⇒ String
Build a clear reason message for attribute presence differences Shows which attributes are only in node1, only in node2, or different values
318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 |
# File 'lib/canon/comparison/markup_comparator.rb', line 318 def build_attribute_difference_reason(attrs1, attrs2) return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2 require "set" keys1 = attrs1.keys.to_set keys2 = attrs2.keys.to_set only_in_1 = keys1 - keys2 only_in_2 = keys2 - keys1 common = keys1 & keys2 # Check if values differ for common keys different_values = common.reject { |k| attrs1[k] == attrs2[k] } parts = [] parts << "only in first: #{only_in_1.to_a.sort.join(', ')}" if only_in_1.any? parts << "only in second: #{only_in_2.to_a.sort.join(', ')}" if only_in_2.any? parts << "different values: #{different_values.sort.join(', ')}" if different_values.any? if parts.empty? "#{keys1.size} vs #{keys2.size} attributes (same names)" else parts.join("; ") end end |
.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ String
Build a human-readable reason for a difference
293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
# File 'lib/canon/comparison/markup_comparator.rb', line 293 def build_difference_reason(node1, node2, diff1, diff2, dimension) # For attribute presence differences, show what attributes differ if dimension == :attribute_presence attrs1 = extract_attributes(node1) attrs2 = extract_attributes(node2) return build_attribute_difference_reason(attrs1, attrs2) end # For text content differences, show the actual text (truncated if needed) if dimension == :text_content text1 = extract_text_content_from_node(node1) text2 = extract_text_content_from_node(node2) return build_text_difference_reason(text1, text2) end # Default reason - can be overridden in subclasses "#{diff1} vs #{diff2}" end |
.build_path_for_node(node) ⇒ String?
Build canonical path for a node
76 77 78 79 80 |
# File 'lib/canon/comparison/markup_comparator.rb', line 76 def build_path_for_node(node) return nil if node.nil? Canon::Diff::PathBuilder.build(node, format: :document) end |
.build_text_difference_reason(text1, text2) ⇒ String
Build a clear reason message for text content differences Shows the actual text content (truncated if too long)
381 382 383 384 385 386 387 388 389 |
# File 'lib/canon/comparison/markup_comparator.rb', line 381 def build_text_difference_reason(text1, text2) # Handle nil cases return "missing vs '#{truncate_text(text2)}'" if text1.nil? && text2 return "'#{truncate_text(text1)}' vs missing" if text1 && text2.nil? return "both missing" if text1.nil? && text2.nil? # Both have content - show truncated versions "'#{truncate_text(text1)}' vs '#{truncate_text(text2)}'" end |
.comment_node?(node) ⇒ Boolean
Check if a node is a comment node
For XML/XHTML, this checks the node’s comment? method or node_type. For HTML, this also checks TEXT nodes that contain HTML-style comments (Nokogiri parses HTML comments as TEXT nodes with content like “<!– comment –>” or escaped like “<\!– comment –>” in full HTML documents).
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
# File 'lib/canon/comparison/markup_comparator.rb', line 228 def comment_node?(node) return true if node.respond_to?(:comment?) && node.comment? return true if node.respond_to?(:node_type) && node.node_type == :comment # HTML comments are parsed as TEXT nodes by Nokogiri # Check if this is a text node with HTML comment content if text_node?(node) text = node_text(node) # Strip whitespace and backslashes for comparison # Nokogiri escapes HTML comments as "<\\!-- comment -->" in full documents text_stripped = text.to_s.strip.gsub("\\", "") return true if text_stripped.start_with?("<!--") && text_stripped.end_with?("-->") end false end |
.determine_node_dimension(node) ⇒ Symbol
Determine the appropriate dimension for a node type
426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 |
# File 'lib/canon/comparison/markup_comparator.rb', line 426 def determine_node_dimension(node) # Canon::Xml::Node types if node.respond_to?(:node_type) && node.node_type.is_a?(Symbol) case node.node_type when :comment then :comments when :text, :cdata then :text_content when :processing_instruction then :processing_instructions else :text_content end # Moxml/Nokogiri types elsif node.respond_to?(:comment?) && node.comment? :comments elsif node.respond_to?(:text?) && node.text? :text_content elsif node.respond_to?(:cdata?) && node.cdata? :text_content elsif node.respond_to?(:processing_instruction?) && node.processing_instruction? :processing_instructions else :text_content end end |
.enrich_diff_metadata(node1, node2) ⇒ Hash
Enrich DiffNode with canonical path, serialized content, and attributes This extracts presentation-ready metadata from nodes for Stage 4 rendering
62 63 64 65 66 67 68 69 70 |
# File 'lib/canon/comparison/markup_comparator.rb', line 62 def (node1, node2) { path: build_path_for_node(node1 || node2), serialized_before: serialize_node(node1), serialized_after: serialize_node(node2), attributes_before: extract_attributes(node1), attributes_after: extract_attributes(node2), } end |
.extract_attributes(node) ⇒ Hash?
Extract attributes from a node
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
# File 'lib/canon/comparison/markup_comparator.rb', line 116 def extract_attributes(node) return nil if node.nil? # Canon::Xml::Node ElementNode if node.is_a?(Canon::Xml::Nodes::ElementNode) node.attribute_nodes.to_h do |attr| [attr.name, attr.value] end # Nokogiri nodes elsif node.respond_to?(:attributes) node.attributes.to_h do |_, attr| [attr.name, attr.value] end else {} end end |
.extract_text_content_from_node(node) ⇒ String?
Extract text content from a node for diff reason
348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 |
# File 'lib/canon/comparison/markup_comparator.rb', line 348 def extract_text_content_from_node(node) return nil if node.nil? # For Canon::Xml::Nodes::TextNode return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode) # For XML/HTML nodes with text_content method return node.text_content if node.respond_to?(:text_content) # For nodes with text method return node.text if node.respond_to?(:text) # For nodes with content method (Moxml::Text) return node.content if node.respond_to?(:content) # For nodes with value method (other types) return node.value if node.respond_to?(:value) # For simple text nodes or strings return node.to_s if node.is_a?(String) # For other node types, try to_s node.to_s rescue StandardError nil end |
.filter_children(children, opts) ⇒ Array
Filter children based on options
Removes nodes that should be excluded from comparison based on options like :ignore_nodes, :ignore_comments, etc.
142 143 144 145 146 |
# File 'lib/canon/comparison/markup_comparator.rb', line 142 def filter_children(children, opts) children.reject do |child| node_excluded?(child, opts) end end |
.node_excluded?(node, opts) ⇒ Boolean
Check if node should be excluded from comparison
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
# File 'lib/canon/comparison/markup_comparator.rb', line 153 def node_excluded?(node, opts) return false if node.nil? return true if opts[:ignore_nodes]&.include?(node) return true if opts[:ignore_comments] && comment_node?(node) return true if opts[:ignore_text_nodes] && text_node?(node) # Check match options match_opts = opts[:match_opts] return false unless match_opts # Filter comments based on match options and format # HTML: Filter comments to avoid spurious differences from zip pairing # BUT only when not in verbose mode (verbose needs differences recorded) # XML: Don't filter comments (allow informative differences to be recorded) if match_opts[:comments] == :ignore && comment_node?(node) # In verbose mode, don't filter comments - we want to record the differences return false if opts[:verbose] # Only filter comments for HTML, not XML (when not verbose) format = opts[:format] || match_opts[:format] if %i[html html4 html5].include?(format) return true end end # Strip whitespace-only text nodes based on parent element configuration. # Use preserve_whitespace_elements / strip_whitespace_elements to control. # Blacklist (strip) > preserve > collapse > format defaults. return false unless text_node?(node) && node.parent return false unless MatchOptions.normalize_text(node_text(node)).empty? return true unless WhitespaceSensitivity.whitespace_preserved?( node.parent, match_opts ) # When the pretty-print-side flag is active (set by opts_for_side in # ChildComparison.compare), drop whitespace-only text nodes that start # with "\n" inside :collapse elements — they are structural indentation # from the pretty-printer, not content. Space-only nodes (no initial "\n") are # real inline content and are kept for normalised comparison. # :preserve elements are always left unchanged. if match_opts[:_pretty_print_side_active] ws_class = WhitespaceSensitivity.classify_text_node(node, opts) return true if ws_class == :collapse && node_text(node).start_with?("\n") end false end |
.node_text(node) ⇒ String
Get text content from a node
259 260 261 262 263 264 265 266 267 268 269 |
# File 'lib/canon/comparison/markup_comparator.rb', line 259 def node_text(node) # Canon::Xml::Node TextNode uses .value if node.respond_to?(:value) node.value.to_s # Nokogiri nodes use .content elsif node.respond_to?(:content) node.content.to_s else node.to_s end end |
.same_node_type?(node1, node2) ⇒ Boolean
Check if two nodes are the same type
208 209 210 211 212 213 214 215 216 217 |
# File 'lib/canon/comparison/markup_comparator.rb', line 208 def same_node_type?(node1, node2) return false if node1.class != node2.class # For Nokogiri/Canon::Xml nodes, check node type if node1.respond_to?(:node_type) && node2.respond_to?(:node_type) node1.node_type == node2.node_type else true end end |
.serialize_element_node(node) ⇒ String
Serialize an element node to string
409 410 411 412 413 414 415 416 417 418 419 420 |
# File 'lib/canon/comparison/markup_comparator.rb', line 409 def serialize_element_node(node) attrs = node.attribute_nodes.map do |a| " #{a.name}=\"#{a.value}\"" end.join children_xml = node.children.map { |c| serialize_node(c) }.join if children_xml.empty? "<#{node.name}#{attrs}/>" else "<#{node.name}#{attrs}>#{children_xml}</#{node.name}>" end end |
.serialize_node(node) ⇒ String?
Serialize a node to string for display
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/canon/comparison/markup_comparator.rb', line 86 def serialize_node(node) return nil if node.nil? # Canon::Xml::Node types if node.is_a?(Canon::Xml::Nodes::RootNode) # Serialize all children of root node.children.map { |child| serialize_node(child) }.join elsif node.is_a?(Canon::Xml::Nodes::ElementNode) serialize_element_node(node) elsif node.is_a?(Canon::Xml::Nodes::TextNode) # Use original text (with entity references) if available, # otherwise fall back to value (decoded text) node.original || node.value elsif node.is_a?(Canon::Xml::Nodes::CommentNode) "<!--#{node.value}-->" elsif node.is_a?(Canon::Xml::Nodes::ProcessingInstructionNode) "<?#{node.target} #{node.data}?>" elsif node.respond_to?(:to_xml) node.to_xml elsif node.respond_to?(:to_html) node.to_html else node.to_s end end |
.text_node?(node) ⇒ Boolean
Check if a node is a text node
249 250 251 252 253 |
# File 'lib/canon/comparison/markup_comparator.rb', line 249 def text_node?(node) (node.respond_to?(:text?) && node.text? && !node.respond_to?(:element?)) || (node.respond_to?(:node_type) && node.node_type == :text) end |
.truncate_text(text, max_length = 40) ⇒ String
Truncate text for display in reason messages
396 397 398 399 400 401 402 403 |
# File 'lib/canon/comparison/markup_comparator.rb', line 396 def truncate_text(text, max_length = 40) return "" if text.nil? text = text.to_s return text if text.length <= max_length "#{text[0...max_length]}..." end |
.whitespace_only_difference?(text1, text2) ⇒ Boolean
Check if difference between two texts is only whitespace
276 277 278 279 280 281 282 283 |
# File 'lib/canon/comparison/markup_comparator.rb', line 276 def whitespace_only_difference?(text1, text2) # Normalize both texts (collapse/trim whitespace) norm1 = MatchOptions.normalize_text(text1) norm2 = MatchOptions.normalize_text(text2) # If normalized texts are the same, the difference was only whitespace norm1 == norm2 end |