Class: Canon::Comparison::MarkupComparator
- Inherits:
-
Object
- Object
- Canon::Comparison::MarkupComparator
- Defined in:
- lib/canon/comparison/markup_comparator.rb
Overview
Base class for markup document comparison (XML, HTML)
Provides shared comparison functionality for markup documents, including node type checking, text extraction, filtering, and difference creation.
Format-specific comparators (XmlComparator, HtmlComparator) inherit from this class and add format-specific behavior.
Direct Known Subclasses
Class Method Summary collapse
-
.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object
Add a difference to the differences array.
-
.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ Object
Build a human-readable reason for a difference.
-
.comment_node?(node) ⇒ Boolean
Check if a node is a comment node.
-
.determine_node_dimension(node) ⇒ Symbol
Determine the appropriate dimension for a node type.
-
.extract_attributes(node) ⇒ Hash?
Extract attributes from a node.
-
.extract_text_content_from_node(node) ⇒ String?
Extract text content from a node for diff reason.
-
.filter_children(children, opts) ⇒ Array
Filter children based on options.
-
.node_excluded?(node, opts) ⇒ Boolean
Check if node should be excluded from comparison.
-
.node_text(node) ⇒ String
Get text content from a node.
-
.same_node_type?(node1, node2) ⇒ Boolean
Check if two nodes are the same type.
-
.serialize_node(node) ⇒ String?
Serialize a node to string for display.
-
.text_node?(node) ⇒ Boolean
Check if a node is a text node.
-
.truncate_text(text, max_length = 40) ⇒ Object
Truncate text for display in reason messages.
-
.whitespace_only_difference?(text1, text2) ⇒ Boolean
Check if difference between two texts is only whitespace.
Class Method Details
.add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) ⇒ Object
Add a difference to the differences array.
Delegates to DiffNodeBuilder, the single DiffNode factory for the DOM comparison path.
19 20 21 22 23 24 25 |
# File 'lib/canon/comparison/markup_comparator.rb', line 19 def add_difference(node1, node2, diff1, diff2, dimension, _opts, differences) differences << Canon::Comparison::DiffNodeBuilder.build( node1: node1, node2: node2, diff1: diff1, diff2: diff2, dimension: dimension ) end |
.build_difference_reason(node1, node2, diff1, diff2, dimension) ⇒ Object
Build a human-readable reason for a difference
Delegates to DiffNodeBuilder for consistency.
197 198 199 200 201 |
# File 'lib/canon/comparison/markup_comparator.rb', line 197 def build_difference_reason(node1, node2, diff1, diff2, dimension) Canon::Comparison::DiffNodeBuilder.build_reason( node1, node2, diff1, diff2, dimension ) end |
.comment_node?(node) ⇒ Boolean
Check if a node is a comment node
For XML/XHTML, this checks the node’s comment? method or node_type. For HTML, this also checks TEXT nodes that contain HTML-style comments (Nokogiri parses HTML comments as TEXT nodes with content like “<!– comment –>” or escaped like “<\!– comment –>” in full HTML documents).
160 161 162 |
# File 'lib/canon/comparison/markup_comparator.rb', line 160 def comment_node?(node) NodeInspector.comment_node?(node) end |
.determine_node_dimension(node) ⇒ Symbol
Determine the appropriate dimension for a node type
Used by ChildComparison to tag per-child orphan diffs with a dimension that matches what the node is, so the formatter renders correctly. An element orphan tagged :text_content would otherwise route through PR #126’s one-sided text formatter and render as text “” instead of as the actual element (see lutaml/canon#125 follow-up).
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 |
# File 'lib/canon/comparison/markup_comparator.rb', line 227 def determine_node_dimension(node) case node when Canon::Xml::Node case node.node_type when :element then :element_structure when :comment then :comments when :text, :cdata then :text_content when :processing_instruction then :processing_instructions else :text_content end when Nokogiri::XML::Node if node.comment? :comments elsif node.text? || node.cdata? :text_content elsif node.processing_instruction? :processing_instructions elsif node.element? :element_structure else :text_content end else :text_content end end |
.extract_attributes(node) ⇒ Hash?
Extract attributes from a node
41 42 43 44 45 |
# File 'lib/canon/comparison/markup_comparator.rb', line 41 def extract_attributes(node) return nil if node.nil? Canon::Diff::NodeSerializer.extract_attributes(node) end |
.extract_text_content_from_node(node) ⇒ String?
Extract text content from a node for diff reason
207 208 209 |
# File 'lib/canon/comparison/markup_comparator.rb', line 207 def extract_text_content_from_node(node) Canon::Comparison::DiffNodeBuilder.extract_text_content(node) end |
.filter_children(children, opts) ⇒ Array
Filter children based on options
Removes nodes that should be excluded from comparison based on options like :ignore_nodes, :ignore_comments, etc.
55 56 57 58 59 |
# File 'lib/canon/comparison/markup_comparator.rb', line 55 def filter_children(children, opts) children.reject do |child| node_excluded?(child, opts) end end |
.node_excluded?(node, opts) ⇒ Boolean
Check if node should be excluded from comparison
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
# File 'lib/canon/comparison/markup_comparator.rb', line 66 def node_excluded?(node, opts) return false if node.nil? return true if opts[:ignore_nodes]&.include?(node) return true if opts[:ignore_comments] && comment_node?(node) return true if opts[:ignore_text_nodes] && text_node?(node) # Check match options match_opts = opts[:match_opts] return false unless match_opts # Filter comments based on match options and format # HTML: Filter comments to avoid spurious differences from zip pairing # BUT only when not in verbose mode (verbose needs differences recorded) # XML: Don't filter comments (allow informative differences to be recorded) if match_opts[:comments] == :ignore && comment_node?(node) # In verbose mode, don't filter comments - we want to record the differences return false if opts[:verbose] # Only filter comments for HTML, not XML (when not verbose) format = opts[:format] || match_opts[:format] if %i[html html4 html5].include?(format) return true end end # Strip whitespace-only text nodes based on parent element configuration. # Use preserve_whitespace_elements / strip_whitespace_elements to control. # Blacklist (strip) > preserve > collapse > format defaults. return false unless text_node?(node) && node.parent return false unless MatchOptions.normalize_text(node_text(node)).empty? # NBSP (U+00A0) is never insignificant whitespace — # it always renders as a visible non-breaking space. # For HTML: always preserve NBSP nodes. # For XML with whitespace_type: :strict: preserve NBSP nodes so # different Unicode whitespace types remain distinguishable. format = opts[:format] || match_opts[:format] whitespace_type = match_opts[:whitespace_type] || :strict if (%i[html html4 html5].include?(format) || whitespace_type == :strict) && WhitespaceSensitivity.contains_nbsp?(node_text(node)) return false end if %i[html html4 html5].include?(format) && WhitespaceSensitivity.inline_whitespace_significant?(node) # Whitespace between inline element siblings is semantically # significant (renders as a visible gap) and must not be stripped. return false end return true unless WhitespaceSensitivity.whitespace_preserved?( node.parent, match_opts ) # When the pretty-print-side flag is active (set by opts_for_side in # ChildComparison.compare), drop whitespace-only text nodes that start # with "\n" inside :collapse elements — they are structural indentation # from the pretty-printer, not content. Space-only nodes (no initial "\n") are # real inline content and are kept for normalised comparison. # :preserve elements are always left unchanged. if match_opts[:_pretty_print_side_active] ws_class = WhitespaceSensitivity.classify_text_node(node, opts) return true if ws_class == :collapse && node_text(node).start_with?("\n") end false end |
.node_text(node) ⇒ String
Get text content from a node
176 177 178 |
# File 'lib/canon/comparison/markup_comparator.rb', line 176 def node_text(node) NodeInspector.text_content(node) end |
.same_node_type?(node1, node2) ⇒ Boolean
Check if two nodes are the same type
140 141 142 143 144 145 146 147 148 149 |
# File 'lib/canon/comparison/markup_comparator.rb', line 140 def same_node_type?(node1, node2) return false if node1.class != node2.class case node1 when Canon::Xml::Node, Nokogiri::XML::Node node1.node_type == node2.node_type else true end end |
.serialize_node(node) ⇒ String?
Serialize a node to string for display
31 32 33 34 35 |
# File 'lib/canon/comparison/markup_comparator.rb', line 31 def serialize_node(node) return nil if node.nil? Canon::Diff::NodeSerializer.serialize(node) end |
.text_node?(node) ⇒ Boolean
Check if a node is a text node
168 169 170 |
# File 'lib/canon/comparison/markup_comparator.rb', line 168 def text_node?(node) NodeInspector.text_node?(node) end |
.truncate_text(text, max_length = 40) ⇒ Object
Truncate text for display in reason messages
212 213 214 |
# File 'lib/canon/comparison/markup_comparator.rb', line 212 def truncate_text(text, max_length = 40) Canon::Comparison::DiffNodeBuilder.truncate(text, max_length) end |
.whitespace_only_difference?(text1, text2) ⇒ Boolean
Check if difference between two texts is only whitespace
185 186 187 188 189 190 191 192 |
# File 'lib/canon/comparison/markup_comparator.rb', line 185 def whitespace_only_difference?(text1, text2) # Normalize both texts (collapse/trim whitespace) norm1 = MatchOptions.normalize_text(text1) norm2 = MatchOptions.normalize_text(text2) # If normalized texts are the same, the difference was only whitespace norm1 == norm2 end |