Module: Canon::Comparison
- Defined in:
- lib/canon/comparison.rb,
lib/canon/comparison/dimensions.rb,
lib/canon/comparison/xml_parser.rb,
lib/canon/comparison/html_parser.rb,
lib/canon/comparison/json_parser.rb,
lib/canon/comparison/match_options.rb,
lib/canon/comparison/node_inspector.rb,
lib/canon/comparison/xml_comparator.rb,
lib/canon/comparison/base_comparator.rb,
lib/canon/comparison/compare_profile.rb,
lib/canon/comparison/format_detector.rb,
lib/canon/comparison/html_comparator.rb,
lib/canon/comparison/json_comparator.rb,
lib/canon/comparison/yaml_comparator.rb,
lib/canon/comparison/child_realignment.rb,
lib/canon/comparison/comparison_result.rb,
lib/canon/comparison/markup_comparator.rb,
lib/canon/comparison/profile_definition.rb,
lib/canon/comparison/dimensions/registry.rb,
lib/canon/comparison/xml_node_comparison.rb,
lib/canon/comparison/html_compare_profile.rb,
lib/canon/comparison/ruby_object_comparator.rb,
lib/canon/comparison/whitespace_sensitivity.rb,
lib/canon/comparison/dimensions/base_dimension.rb,
lib/canon/comparison/match_options/xml_resolver.rb,
lib/canon/comparison/xml_comparator/node_parser.rb,
lib/canon/comparison/match_options/base_resolver.rb,
lib/canon/comparison/match_options/json_resolver.rb,
lib/canon/comparison/match_options/yaml_resolver.rb,
lib/canon/comparison/dimensions/comments_dimension.rb,
lib/canon/comparison/strategies/base_match_strategy.rb,
lib/canon/comparison/xml_comparator/attribute_filter.rb,
lib/canon/comparison/xml_comparator/child_comparison.rb,
lib/canon/comparison/xml_comparator/diff_node_builder.rb,
lib/canon/comparison/dimensions/text_content_dimension.rb,
lib/canon/comparison/strategies/match_strategy_factory.rb,
lib/canon/comparison/xml_comparator/attribute_comparator.rb,
lib/canon/comparison/xml_comparator/namespace_comparator.rb,
lib/canon/comparison/xml_comparator/node_type_comparator.rb,
lib/canon/comparison/dimensions/attribute_order_dimension.rb,
lib/canon/comparison/dimensions/attribute_values_dimension.rb,
lib/canon/comparison/dimensions/element_position_dimension.rb,
lib/canon/comparison/dimensions/attribute_presence_dimension.rb,
lib/canon/comparison/strategies/semantic_tree_match_strategy.rb,
lib/canon/comparison/dimensions/structural_whitespace_dimension.rb
Overview
Comparison module for XML, HTML, JSON, and YAML documents
This module provides a unified comparison API for multiple serialization formats. It auto-detects the format and delegates to specialized comparators while maintaining a CompareXML-compatible API.
Supported Formats
-
XML: Uses Moxml for parsing, supports namespaces
-
HTML: Uses Nokogiri, handles HTML4/HTML5 differences
-
JSON: Direct Ruby object comparison with deep equality
-
YAML: Parses to Ruby objects, compares semantically
Format Detection
The module automatically detects format from:
-
Object type (Moxml::Node, Nokogiri::HTML::Document, Hash, Array)
-
String content (DOCTYPE, opening tags, YAML/JSON syntax)
Comparison Options
Common options across all formats:
-
profile: Comparison profile (Symbol for preset, Hash for custom)
-
Presets: :strict, :rendered, :html4, :html5, :spec_friendly, :content_only
-
Custom: { text_content: :normalize, comments: :ignore, … }
-
-
diff_algorithm: Algorithm to use (:dom or :semantic, default: :dom)
-
verbose: Return detailed diff array (default: false)
Usage Examples
# XML comparison with default profile
Canon::Comparison.equivalent?(xml1, xml2)
# XML comparison with preset profile
Canon::Comparison.equivalent?(xml1, xml2, profile: :strict)
Canon::Comparison.equivalent?(xml1, xml2, profile: :spec_friendly)
# HTML comparison with custom inline profile
Canon::Comparison.equivalent?(html1, html2,
profile: { text_content: :normalize, comments: :ignore })
# Define and use a custom profile
Canon::Comparison.define_profile(:my_custom) do
text_content :normalize
comments :ignore
preprocessing :rendered
end
Canon::Comparison.equivalent?(doc1, doc2, profile: :my_custom)
# JSON comparison with semantic tree diff
Canon::Comparison.equivalent?(json1, json2,
diff_algorithm: :semantic, profile: :spec_friendly)
# With detailed output
diffs = Canon::Comparison.equivalent?(doc1, doc2, verbose: true)
diffs.each { |diff| puts diff.inspect }
XML Declaration Handling
XML declarations (‘<?xml version=“1.0” encoding=“UTF-8”?>`) are stripped during preprocessing for semantic comparison. This means:
-
Documents with and without declarations are considered equivalent
-
Declaration encoding differences are ignored
-
Entity declarations within DTD are resolved before comparison
This behavior ensures documents are compared by their content, not their serialization format.
Return Values
-
When verbose: false (default) → Boolean (true if equivalent)
-
When verbose: true → Array of difference hashes with details
Difference Hash Format
Each difference contains:
-
node1, node2: The nodes being compared (XML/HTML)
-
diff1, diff2: Comparison result codes
-
OR for JSON/YAML:
-
path: String path to the difference (e.g., “user.address.city”)
-
value1, value2: The differing values
-
diff_code: Type of difference
Defined Under Namespace
Modules: BaseComparator, ChildRealignment, Dimensions, MatchOptions, NodeInspector, RubyObjectComparator, Strategies, WhitespaceSensitivity, XmlComparatorHelpers, XmlNodeComparison Classes: CompareProfile, ComparisonResult, DiffNodeBuilder, FormatDetector, HtmlComparator, HtmlCompareProfile, HtmlParser, JsonComparator, JsonParser, MarkupComparator, ProfileDefinition, ProfileError, ResolvedMatchOptions, XmlComparator, XmlParser, YamlComparator
Constant Summary collapse
- EQUIVALENT =
Comparison result constants
1- MISSING_ATTRIBUTE =
2- MISSING_NODE =
3- UNEQUAL_ATTRIBUTES =
4- UNEQUAL_COMMENTS =
5- UNEQUAL_DOCUMENTS =
6- UNEQUAL_ELEMENTS =
7- UNEQUAL_NODES_TYPES =
8- UNEQUAL_TEXT_CONTENTS =
9- MISSING_HASH_KEY =
10- UNEQUAL_HASH_VALUES =
11- UNEQUAL_HASH_KEY_ORDER =
12- UNEQUAL_ARRAY_LENGTHS =
13- UNEQUAL_ARRAY_ELEMENTS =
14- UNEQUAL_TYPES =
15- UNEQUAL_PRIMITIVES =
16- CODE_LABELS =
Human-readable labels for the integer comparison-result constants above. Used by the diff reason builders so user-facing reason text never leaks raw numeric codes (e.g. “7 vs 7” — see lutaml/canon#127). String diff codes (e.g. “position 3” emitted by ChildComparison) pass through
code_labelunchanged. { EQUIVALENT => "equivalent", MISSING_ATTRIBUTE => "missing attribute", MISSING_NODE => "missing", UNEQUAL_ATTRIBUTES => "attributes differ", UNEQUAL_COMMENTS => "comments differ", UNEQUAL_DOCUMENTS => "documents differ", UNEQUAL_ELEMENTS => "elements differ", UNEQUAL_NODES_TYPES => "node types differ", UNEQUAL_TEXT_CONTENTS => "text content differs", MISSING_HASH_KEY => "missing hash key", UNEQUAL_HASH_VALUES => "hash values differ", UNEQUAL_HASH_KEY_ORDER => "hash key order differs", UNEQUAL_ARRAY_LENGTHS => "array lengths differ", UNEQUAL_ARRAY_ELEMENTS => "array elements differ", UNEQUAL_TYPES => "types differ", UNEQUAL_PRIMITIVES => "primitives differ", }.freeze
Class Method Summary collapse
-
.available_profiles ⇒ Array<Symbol>
List all available profiles (custom + presets).
-
.code_label(code) ⇒ String
Translate a comparison result code (Integer constant or String label like “position 3”) into a human-readable reason fragment.
-
.code_pair_label(diff1, diff2) ⇒ String
Build a “diff1 [vs diff2]” reason fragment that never leaks raw integer constants.
-
.decode_html_entities(str) ⇒ String
Decode HTML named entities ( etc.) to their numeric character reference equivalents so that Nokogiri::XML.fragment (which only understands the five XML entities) preserves them as text nodes instead of silently dropping them.
-
.define_profile(name) {|ProfileDefinition| ... } ⇒ Symbol
Define a custom comparison profile with DSL syntax.
-
.detect_format(obj) ⇒ Symbol
Detect the format of an object (delegates to FormatDetector).
-
.detect_string_format(str) ⇒ Symbol
Detect the format of a string (delegates to FormatDetector).
-
.dom_diff(obj1, obj2, opts = {}) ⇒ Object
Perform DOM-based comparison (original behavior).
-
.equivalent?(obj1, obj2, opts = {}) ⇒ Boolean, Array
Auto-detect format and compare two objects.
-
.extract_original_string(obj, _format = nil) ⇒ String
Extract original string from various input types This preserves the original formatting without minification.
-
.format_from_opts(opts) ⇒ Symbol
Helper to extract format from opts for validation.
-
.load_profile(name) ⇒ Hash
Load a profile (custom or preset).
-
.normalize_format_for_tree_diff(format) ⇒ Symbol
Normalize format for TreeDiff (html4/html5 -> html).
-
.parse_errors_for(node) ⇒ Array<String>
Extract parse-time errors from a parsed-tree or Nokogiri fragment.
-
.parse_html(content, format) ⇒ Object
Parse HTML string into Nokogiri document (delegates to HtmlParser).
-
.parse_with_cache(doc, format, preprocessing) { ... } ⇒ Object
Parse a document with caching.
-
.parse_with_comparator(obj1, obj2, format, match_opts_hash) ⇒ Array<Object, Object>
Parse documents using comparator’s parse logic (reuses preprocessing).
-
.process_profile_parameter(opts) ⇒ Hash
Process unified profile parameter.
-
.resolve_match_options(format, opts) ⇒ Hash
Resolve match options for a format.
-
.semantic_diff(obj1, obj2, opts = {}) ⇒ Object
Perform semantic tree diff comparison.
-
.serialize_document(doc, format) ⇒ Object
Serialize document back to string.
-
.strip_xml_preamble(str) ⇒ Object
Strip XML declarations and DOCTYPE preambles from an HTML string so it can be safely parsed with Nokogiri::XML.fragment without generating processing-instruction nodes.
-
.summarize(obj1, obj2, opts = {}) ⇒ String
Summarize the first difference between two documents.
-
.valid_dimensions_for_format(format) ⇒ Array<Symbol>
Get valid dimensions for a format.
-
.validate_custom_profile!(profile, format) ⇒ Object
Validate custom profile hash.
Class Method Details
.available_profiles ⇒ Array<Symbol>
List all available profiles (custom + presets)
281 282 283 284 285 |
# File 'lib/canon/comparison.rb', line 281 def available_profiles custom = @custom_profiles&.keys || [] presets = MatchOptions::Xml::MATCH_PROFILES.keys (custom + presets).sort.uniq end |
.code_label(code) ⇒ String
Translate a comparison result code (Integer constant or String label like “position 3”) into a human-readable reason fragment. Unknown values pass through via to_s as a defensive fallback.
157 158 159 160 161 |
# File 'lib/canon/comparison.rb', line 157 def self.code_label(code) return code if code.is_a?(String) CODE_LABELS[code] || code.to_s end |
.code_pair_label(diff1, diff2) ⇒ String
Build a “diff1 [vs diff2]” reason fragment that never leaks raw integer constants. When both codes are equal, returns the single label (e.g. “elements differ”) rather than “elements differ vs elements differ”. See lutaml/canon#127.
171 172 173 174 175 |
# File 'lib/canon/comparison.rb', line 171 def self.code_pair_label(diff1, diff2) return code_label(diff1) if diff1 == diff2 "#{code_label(diff1)} vs #{code_label(diff2)}" end |
.decode_html_entities(str) ⇒ String
Decode HTML named entities ( etc.) to their numeric character reference equivalents so that Nokogiri::XML.fragment (which only understands the five XML entities) preserves them as text nodes instead of silently dropping them.
Uses Nokogiri’s HTML4 parser to resolve the entities — the text is extracted from a fragment so no structural tags are added.
810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 |
# File 'lib/canon/comparison.rb', line 810 def decode_html_entities(str) # Fast path: skip if no ampersands present return str unless str.include?("&") # Parse as HTML fragment to resolve named entities, then # re-serialize as text. This converts → U+00A0, etc. doc = Nokogiri::HTML4.fragment(str) # Serialize back, preserving the resolved characters. # to_html re-encodes characters, so use inner_html which # keeps the character form. doc.inner_html # If the serialization re-encoded characters as entities, # that's fine — the XML parser understands numeric refs like   end |
.define_profile(name) {|ProfileDefinition| ... } ⇒ Symbol
Define a custom comparison profile with DSL syntax
250 251 252 253 254 255 256 257 |
# File 'lib/canon/comparison.rb', line 250 def define_profile(name, &block) definition = ProfileDefinition.define(name, &block) @custom_profiles ||= {} @custom_profiles[name] = definition name end |
.detect_format(obj) ⇒ Symbol
Detect the format of an object (delegates to FormatDetector)
831 832 833 |
# File 'lib/canon/comparison.rb', line 831 def detect_format(obj) FormatDetector.detect(obj) end |
.detect_string_format(str) ⇒ Symbol
Detect the format of a string (delegates to FormatDetector)
839 840 841 |
# File 'lib/canon/comparison.rb', line 839 def detect_string_format(str) FormatDetector.detect_string(str) end |
.dom_diff(obj1, obj2, opts = {}) ⇒ Object
Perform DOM-based comparison (original behavior)
685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 |
# File 'lib/canon/comparison.rb', line 685 def dom_diff(obj1, obj2, opts = {}) # Use format hint if provided if opts[:format] format1 = format2 = opts[:format] # Parse HTML strings if format is html/html4/html5 if %i[html html4 html5].include?(opts[:format]) # Preserve original strings for display (HTML fragment # parsers can mutate the DOM). opts[:_original_str1] = obj1.dup if obj1.is_a?(String) opts[:_original_str2] = obj2.dup if obj2.is_a?(String) # Parse all HTML formats (:html, :html4, :html5) with # Nokogiri::HTML5 so that html4 and html5 share HTML's # whitespace-sensitivity semantics (issue #118). # # The previous html/html4 branch used Nokogiri::XML.fragment # to dodge Nokogiri::HTML4.fragment's destructive DOM # mutations. That avoided one problem but introduced a # bigger one: XML whitespace rules were being applied to # HTML content. HTML's content model — identical between # HTML4 and HTML5 — treats whitespace-only text between # block-level children as insignificant; XML treats every # whitespace text node as significant. Routing html4 input # through an XML parser therefore made # be_html4_equivalent_to reject inputs that # be_html5_equivalent_to (correctly) accepts. # Nokogiri::HTML5.fragment is non-destructive (the original # HTML4.fragment concern does not apply to it) and applies # HTML's content model uniformly. obj1 = HtmlParser.parse(obj1, :html5) if obj1.is_a?(String) obj2 = HtmlParser.parse(obj2, :html5) if obj2.is_a?(String) end else format1 = FormatDetector.detect(obj1) format2 = FormatDetector.detect(obj2) end # Handle string format (plain text comparison) if format1 == :string if opts[:verbose] return obj1.to_s == obj2.to_s ? [] : [:different] else return obj1.to_s == obj2.to_s end end # Allow comparing json/yaml strings with ruby objects # since they parse to the same structure formats_compatible = format1 == format2 || (%i[json ruby_object].include?(format1) && %i[json ruby_object].include?(format2)) || (%i[yaml ruby_object].include?(format1) && %i[yaml ruby_object].include?(format2)) unless formats_compatible raise Canon::CompareFormatMismatchError.new(format1, format2) end # Normalize format for comparison comparison_format = case format1 when :ruby_object # If comparing ruby_object with json/yaml, use that format %i[json yaml].include?(format2) ? format2 : :json else format1 end # get match_profile if it is not defined in options # but defined in config if %i[xml html json yaml string].include?(comparison_format) format_config = Canon::Config.instance.public_send(comparison_format) if opts[:global_profile].nil? && format_config.match.profile # Config-sourced profile has *global* priority (applied before # global_options), so that YAML profile_options like # whitespace_type: :normalize can override the built-in profile # (e.g. :spec_friendly)'s whitespace_type: :strict. Writing to # :match_profile here gave the config profile per-call priority, # which incorrectly overrode the YAML's own overrides. opts[:global_profile] = format_config.match.profile end # Pass YAML profile's extra match options (e.g., preserve_whitespace_elements) # that are stored in MatchConfig's resolver but not exposed via the # built-in MATCH_PROFILES system. These supplement the built-in profile. profile_opts = format_config.match. if profile_opts.any? && opts[:global_options].nil? opts[:global_options] = profile_opts elsif profile_opts.any? # Merge: global_options already set (e.g., per-call) takes precedence opts[:global_options] = opts[:global_options].merge(profile_opts) end end case comparison_format when :xml XmlComparator.equivalent?(obj1, obj2, opts) when :html, :html4, :html5 HtmlComparator.equivalent?(obj1, obj2, opts) when :json JsonComparator.equivalent?(obj1, obj2, opts) when :yaml YamlComparator.equivalent?(obj1, obj2, opts) end end |
.equivalent?(obj1, obj2, opts = {}) ⇒ Boolean, Array
Auto-detect format and compare two objects
197 198 199 200 201 202 203 204 205 206 |
# File 'lib/canon/comparison.rb', line 197 def equivalent?(obj1, obj2, opts = {}) # Check if semantic tree diff is requested # Support both :semantic and :semantic_tree for backward compatibility if %i[semantic semantic_tree].include?(opts[:diff_algorithm]) return semantic_diff(obj1, obj2, opts) end # Otherwise use DOM-based comparison (default) dom_diff(obj1, obj2, opts) end |
.extract_original_string(obj, _format = nil) ⇒ String
Extract original string from various input types This preserves the original formatting without minification
646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 |
# File 'lib/canon/comparison.rb', line 646 def extract_original_string(obj, _format = nil) case obj when String obj when Nokogiri::XML::Document, Nokogiri::HTML::Document, Nokogiri::XML::DocumentFragment, Nokogiri::HTML::DocumentFragment obj.to_html else if Canon::XmlParsing.xml_node?(obj) || obj.is_a?(Canon::Xml::Node) Canon::XmlParsing.serialize(obj) else obj.to_s end end end |
.format_from_opts(opts) ⇒ Symbol
Helper to extract format from opts for validation
553 554 555 |
# File 'lib/canon/comparison.rb', line 553 def format_from_opts(opts) opts[:format] || :xml end |
.load_profile(name) ⇒ Hash
Load a profile (custom or preset)
263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
# File 'lib/canon/comparison.rb', line 263 def load_profile(name) # Check custom profiles first if @custom_profiles&.key?(name) return @custom_profiles[name].dup end # Fall back to presets - try Xml first (most common) begin MatchOptions::Xml.(name) rescue Error # Try other formats MatchOptions::Json.(name) end end |
.normalize_format_for_tree_diff(format) ⇒ Symbol
Normalize format for TreeDiff (html4/html5 -> html)
631 632 633 634 635 636 637 638 |
# File 'lib/canon/comparison.rb', line 631 def normalize_format_for_tree_diff(format) case format when :html4, :html5 :html else format end end |
.parse_errors_for(node) ⇒ Array<String>
Extract parse-time errors from a parsed-tree or Nokogiri fragment. Delegates to NodeInspector for cross-backend type dispatch.
182 183 184 |
# File 'lib/canon/comparison.rb', line 182 def self.parse_errors_for(node) NodeInspector.parse_errors(node) end |
.parse_html(content, format) ⇒ Object
Parse HTML string into Nokogiri document (delegates to HtmlParser)
848 849 850 |
# File 'lib/canon/comparison.rb', line 848 def parse_html(content, format) HtmlParser.parse(content, format) end |
.parse_with_cache(doc, format, preprocessing) { ... } ⇒ Object
Parse a document with caching
616 617 618 619 620 621 622 623 624 625 |
# File 'lib/canon/comparison.rb', line 616 def parse_with_cache(doc, format, preprocessing) # If already a parsed node, return as-is return doc unless doc.is_a?(String) # Use cache for string documents Cache.fetch(:document_parse, Cache.key_for_document(doc, format, preprocessing)) do # rubocop:disable Lint/UselessDefaultValueArgument yield doc end end |
.parse_with_comparator(obj1, obj2, format, match_opts_hash) ⇒ Array<Object, Object>
Parse documents using comparator’s parse logic (reuses preprocessing)
564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 |
# File 'lib/canon/comparison.rb', line 564 def parse_with_comparator(obj1, obj2, format, match_opts_hash) preprocessing = match_opts_hash[:preprocessing] || :none case format when :xml # Delegate to XmlComparator's parse - returns Canon::Xml::Node doc1 = parse_with_cache(obj1, format, preprocessing) do |doc| XmlComparator.parse(doc, preprocessing) end doc2 = parse_with_cache(obj2, format, preprocessing) do |doc| XmlComparator.parse(doc, preprocessing) end [doc1, doc2] when :html, :html4, :html5 [ parse_with_cache(obj1, format, preprocessing) do |doc| HtmlComparator.parse(doc, preprocessing) end, parse_with_cache(obj2, format, preprocessing) do |doc| HtmlComparator.parse(doc, preprocessing) end, ] when :json [ parse_with_cache(obj1, format, :none) do |doc| JsonComparator.parse(doc) end, parse_with_cache(obj2, format, :none) do |doc| JsonComparator.parse(doc) end, ] when :yaml [ parse_with_cache(obj1, format, :none) do |doc| YamlComparator.parse(doc) end, parse_with_cache(obj2, format, :none) do |doc| YamlComparator.parse(doc) end, ] else [obj1, obj2] end end |
.process_profile_parameter(opts) ⇒ Hash
Process unified profile parameter
Converts the new :profile parameter into the legacy format expected by MatchOptions resolvers. Handles:
-
Symbol → preset profile (uses :match_profile)
-
Hash → custom profile (validates and uses :match)
467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 |
# File 'lib/canon/comparison.rb', line 467 def process_profile_parameter(opts) processed = opts.dup # Handle unified :profile parameter if opts.key?(:profile) profile = opts[:profile] case profile when Symbol # Preset profile name processed[:match_profile] = profile when Hash # Inline custom profile - validate and use as :match validate_custom_profile!(profile, format_from_opts(opts)) processed[:match] = profile else raise Canon::Error, "Invalid profile type: #{profile.class}. " \ "Expected Symbol (preset name) or Hash (custom profile)." end end processed end |
.resolve_match_options(format, opts) ⇒ Hash
Resolve match options for a format
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 |
# File 'lib/canon/comparison.rb', line 421 def (format, opts) # Process unified profile parameter first processed_opts = process_profile_parameter(opts) case format when :xml, :html, :html4, :html5 MatchOptions::Xml.resolve( format: format, match_profile: processed_opts[:match_profile], match: processed_opts[:match], preprocessing: processed_opts[:preprocessing], global_profile: processed_opts[:global_profile], global_options: processed_opts[:global_options], ) when :json MatchOptions::Json.resolve( format: format, match_profile: processed_opts[:match_profile], match: processed_opts[:match], preprocessing: processed_opts[:preprocessing], global_profile: processed_opts[:global_profile], global_options: processed_opts[:global_options], ) when :yaml MatchOptions::Yaml.resolve( format: format, match_profile: processed_opts[:match_profile], match: processed_opts[:match], preprocessing: processed_opts[:preprocessing], global_profile: processed_opts[:global_profile], global_options: processed_opts[:global_options], ) else processed_opts[:match] || {} end end |
.semantic_diff(obj1, obj2, opts = {}) ⇒ Object
Perform semantic tree diff comparison
290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 |
# File 'lib/canon/comparison.rb', line 290 def semantic_diff(obj1, obj2, opts = {}) require_relative "tree_diff" # Capture original strings BEFORE any parsing/transformation # These are used for display to preserve original formatting format_hint = opts[:format] original_str1 = extract_original_string(obj1, format_hint) original_str2 = extract_original_string(obj2, format_hint) # Detect format for both objects format1 = opts[:format] || FormatDetector.detect(obj1) format2 = opts[:format] || FormatDetector.detect(obj2) # Handle string format (plain text comparison) - semantic tree doesn't support it if format1 == :string if opts[:verbose] return obj1.to_s == obj2.to_s ? [] : [:different] else return obj1.to_s == obj2.to_s end end # Ensure formats match unless format1 == format2 raise Canon::CompareFormatMismatchError.new(format1, format2) end # Get global config options if not defined in opts # This is needed because semantic_diff doesn't go through dom_diff's config handling if !(opts[:match_profile] || opts[:global_options]) && %i[xml html json yaml string].include?(format1) format_config = Canon::Config.instance.public_send(format1) if format_config.match.profile opts[:match_profile] = format_config.match.profile end if format_config.match. && !format_config.match..empty? opts[:global_options] = format_config.match. end end # Resolve match options for the format match_opts_hash = (format1, opts) # Also read diff options from config (e.g., max_node_count for large documents) # This is independent of match options and needs to be passed to TreeDiffIntegrator if !match_opts_hash[:max_node_count] && %i[xml html json yaml string].include?(format1) diff_max_node = Canon::Config.instance.public_send(format1).diff.max_node_count if diff_max_node > 10_000 match_opts_hash[:max_node_count] = diff_max_node end end # Delegate parsing to comparators (reuses existing preprocessing logic) doc1, doc2 = parse_with_comparator(obj1, obj2, format1, match_opts_hash) # Normalize format for TreeDiff (html4/html5 -> html) tree_diff_format = normalize_format_for_tree_diff(format1) # Create TreeDiff integrator for the format # CRITICAL: Use match_opts_hash (resolved options with profile) not opts[:match] integrator = Canon::TreeDiff::TreeDiffIntegrator.new( format: tree_diff_format, options: match_opts_hash, ) # Perform diff tree_diff_result = integrator.diff(doc1, doc2) # Extract only match-related keys for OperationConverter and SemanticTreeMatchStrategy # These components expect match options, not diff options like max_node_count match_only_keys = %i[match_profile match preprocessing text_content structural_whitespace attribute_presence attribute_order attribute_values element_position comments format similarity_threshold hash_matching similarity_matching propagation preserve_whitespace_elements collapse_whitespace_elements strip_whitespace_elements respect_xml_space] = match_opts_hash.slice(*match_only_keys) # Convert operations to DiffNodes for unified pipeline # CRITICAL: Use match_opts_hash (resolved options with profile) not opts[:match] converter = Canon::TreeDiff::OperationConverter.new( format: format1, match_options: , ) diff_nodes = converter.convert(tree_diff_result[:operations]) # CRITICAL: Use strategy's preprocess_for_display to ensure proper line-breaking # This matches DOM diff preprocessing pattern (xml_comparator.rb:106-109) require_relative "comparison/strategies/semantic_tree_match_strategy" strategy = Comparison::Strategies::SemanticTreeMatchStrategy.new( format: format1, match_options: , ) str1, str2 = strategy.preprocess_for_display(doc1, doc2) # Store tree diff data in match_options for access via result = match_opts_hash.merge( tree_diff_operations: tree_diff_result[:operations], tree_diff_statistics: tree_diff_result[:statistics], tree_diff_matching: tree_diff_result[:matching], ) # Create ComparisonResult for unified handling result = Canon::Comparison::ComparisonResult.new( differences: diff_nodes, preprocessed_strings: [str1, str2], original_strings: [original_str1, original_str2], format: format1, html_version: %i[html4 html5].include?(format1) ? format1 : nil, match_options: , algorithm: :semantic, ) # Return boolean or ComparisonResult based on verbose flag if opts[:verbose] result else result.equivalent? end end |
.serialize_document(doc, format) ⇒ Object
Serialize document back to string
663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 |
# File 'lib/canon/comparison.rb', line 663 def serialize_document(doc, format) case format when :xml, :html, :html4, :html5 if Canon::XmlParsing.xml_node?(doc) || doc.is_a?(Canon::Xml::Node) Canon::XmlParsing.serialize(doc) else doc.to_s end when :json require "json" JSON.pretty_generate(doc) when :yaml require "yaml" doc.to_yaml else doc.to_s end rescue StandardError doc.to_s end |
.strip_xml_preamble(str) ⇒ Object
Strip XML declarations and DOCTYPE preambles from an HTML string so it can be safely parsed with Nokogiri::XML.fragment without generating processing-instruction nodes.
791 792 793 794 795 796 797 798 |
# File 'lib/canon/comparison.rb', line 791 def strip_xml_preamble(str) str = str.sub(/\A\s*<\?xml[^?]*\?>\s*/m, "") if (i = str.index(/<!DOCTYPE/i)) j = str.index(">", i) str = (str[0...i] + str[(j + 1)..]).strip if j end str end |
.summarize(obj1, obj2, opts = {}) ⇒ String
Summarize the first difference between two documents.
Returns a human-readable string describing the first difference when documents differ, or “Equivalent” when they match. This is a lightweight alternative to equivalent? with verbose: true.
225 226 227 228 229 230 231 232 233 234 235 |
# File 'lib/canon/comparison.rb', line 225 def summarize(obj1, obj2, opts = {}) result = equivalent?(obj1, obj2, opts.merge(verbose: true)) if result.is_a?(ComparisonResult) result.summary elsif result == true "Equivalent" else "Not equivalent" end end |
.valid_dimensions_for_format(format) ⇒ Array<Symbol>
Get valid dimensions for a format
536 537 538 539 540 541 542 543 544 545 546 547 |
# File 'lib/canon/comparison.rb', line 536 def valid_dimensions_for_format(format) case format when :xml, :html, :html4, :html5 MatchOptions::Xml::MATCH_DIMENSIONS when :json MatchOptions::Json::MATCH_DIMENSIONS when :yaml MatchOptions::Yaml::MATCH_DIMENSIONS else [] end end |
.validate_custom_profile!(profile, format) ⇒ Object
Validate custom profile hash
Ensures all dimensions and behaviors in a custom profile are valid. Uses ProfileDefinition validation logic.
500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 |
# File 'lib/canon/comparison.rb', line 500 def validate_custom_profile!(profile, format) profile.each do |dimension, behavior| # Skip preprocessing and special options next if dimension == :preprocessing next if dimension == :semantic_diff next if dimension == :similarity_threshold # Validate dimension is known valid_dimensions = valid_dimensions_for_format(format) unless valid_dimensions.include?(dimension) raise Canon::Error, "Unknown dimension: #{dimension}. " \ "Valid dimensions for #{format}: #{valid_dimensions.join(', ')}" end # Validate behavior is allowed for this dimension valid_behaviors = ProfileDefinition::DIMENSION_BEHAVIORS[dimension] if valid_behaviors && !valid_behaviors.include?(behavior) raise Canon::Error, "Invalid behavior '#{behavior}' for dimension '#{dimension}'. " \ "Valid behaviors: #{valid_behaviors.join(', ')}" end # Validate behavior is in general MATCH_BEHAVIORS unless MatchOptions::MATCH_BEHAVIORS.include?(behavior) raise Canon::Error, "Unknown match behavior: #{behavior}. " \ "Valid behaviors: #{MatchOptions::MATCH_BEHAVIORS.join(', ')}" end end end |