Module: Canon::Comparison::Pipeline

Defined in:
lib/canon/comparison/pipeline.rb

Overview

Shared comparison pipeline helpers used by both algorithms.

Both ‘dom_diff` and `semantic_diff` need to:

  • detect document format from inputs (with optional hint)

  • validate that the two formats are comparable

  • merge global config-sourced profile / options into the opts hash

  • capture original-string snapshots before parsing mutates inputs

  • parse both inputs through the format-specific comparator

These steps are pure pipeline mechanics — they have nothing to do with the comparison algorithm itself. Keeping them here ensures the two algorithm entrypoints cannot drift out of sync (see lutaml/canon “Two Comparison Algorithms — Distinct by Design” in CLAUDE.md —the algorithm cores stay separate; only shared infrastructure is consolidated).

Constant Summary collapse

CONFIG_BACKED_FORMATS =

Formats whose Canon::Config exposes a match profile / options.

%i[xml html json yaml string].freeze
COMPATIBLE_FORMAT_GROUPS =

Cross-format compatibility groups. DOM comparison accepts these pairings because both sides parse to the same Ruby structure. Semantic comparison does not — it requires exact format match.

[
  %i[json ruby_object].freeze,
  %i[yaml ruby_object].freeze,
].freeze

Class Method Summary collapse

Class Method Details

.capture_originals(obj1, obj2) ⇒ Array<String, String>

Capture pre-parse string snapshots for diff display.

Parsing (especially HTML) can mutate inputs, so originals must be captured before any parsing happens. Strings pass through unchanged; parsed nodes are serialized via NodeSerializer.

Parameters:

  • obj1 (Object)
  • obj2 (Object)

Returns:

  • (Array<String, String>)

    Captured original strings



122
123
124
# File 'lib/canon/comparison/pipeline.rb', line 122

def capture_originals(obj1, obj2)
  [extract_original_string(obj1), extract_original_string(obj2)]
end

.detect_formats(obj1, obj2, format_hint) ⇒ Array<Symbol, Symbol>

Detect formats for both inputs, honouring an explicit hint.

Parameters:

  • obj1 (Object)

    First input

  • obj2 (Object)

    Second input

  • format_hint (Symbol, nil)

    Explicit format override

Returns:

  • (Array<Symbol, Symbol>)

    Detected or hinted formats



39
40
41
42
43
# File 'lib/canon/comparison/pipeline.rb', line 39

def detect_formats(obj1, obj2, format_hint)
  return [format_hint, format_hint] if format_hint

  [FormatDetector.detect(obj1), FormatDetector.detect(obj2)]
end

.formats_compatible?(format1, format2, strict: false) ⇒ Boolean

True when the two formats can be compared by the DOM algorithm.

DOM allows ‘ruby_object` to be compared against `json` or `yaml` because both sides parse to the same Ruby structure. Semantic comparison does not allow this — it requires exact format match.

Parameters:

  • format1 (Symbol)
  • format2 (Symbol)
  • strict (Boolean) (defaults to: false)

    When true, require exact match (semantic)

Returns:

  • (Boolean)


55
56
57
58
59
60
61
62
# File 'lib/canon/comparison/pipeline.rb', line 55

def formats_compatible?(format1, format2, strict: false)
  return true if format1 == format2
  return false if strict

  COMPATIBLE_FORMAT_GROUPS.any? do |group|
    group.include?(format1) && group.include?(format2)
  end
end

.html_string?(obj) ⇒ Boolean

True when the input is a String AND should be treated as HTML.

Parameters:

  • obj (Object)

Returns:

  • (Boolean)


206
207
208
# File 'lib/canon/comparison/pipeline.rb', line 206

def html_string?(obj)
  obj.is_a?(String)
end

.parse_pair(obj1, obj2, format, match_opts_hash) ⇒ Array<Object, Object>

Parse both inputs through the format-specific comparator.

Delegates to ‘XmlComparator`, `HtmlComparator`, `JsonComparator`, or `YamlComparator` based on format. Uses `Cache` so the same string is not re-parsed across runs.

Parameters:

  • obj1 (Object)
  • obj2 (Object)
  • format (Symbol)
  • match_opts_hash (Hash)

    Resolved match options

Returns:

  • (Array<Object, Object>)

    Parsed documents



137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
# File 'lib/canon/comparison/pipeline.rb', line 137

def parse_pair(obj1, obj2, format, match_opts_hash)
  preprocessing = match_opts_hash[:preprocessing] || :none

  case format
  when :xml
    [
      parse_with_cache(obj1, format, preprocessing) do |doc|
        XmlComparator.parse(doc, preprocessing)
      end,
      parse_with_cache(obj2, format, preprocessing) do |doc|
        XmlComparator.parse(doc, preprocessing)
      end,
    ]
  when :html, :html4, :html5
    [
      parse_with_cache(obj1, format, preprocessing) do |doc|
        HtmlComparator.parse(doc, preprocessing)
      end,
      parse_with_cache(obj2, format, preprocessing) do |doc|
        HtmlComparator.parse(doc, preprocessing)
      end,
    ]
  when :json
    [
      parse_with_cache(obj1, format, :none) do |doc|
        JsonComparator.parse(doc)
      end,
      parse_with_cache(obj2, format, :none) do |doc|
        JsonComparator.parse(doc)
      end,
    ]
  when :yaml
    [
      parse_with_cache(obj1, format, :none) do |doc|
        YamlComparator.parse(doc)
      end,
      parse_with_cache(obj2, format, :none) do |doc|
        YamlComparator.parse(doc)
      end,
    ]
  else
    [obj1, obj2]
  end
end

.preparse_html_pair(obj1, obj2) ⇒ Array<Object, Object>

Pre-parse HTML strings through ‘HtmlParser.parse(_, :html5)`.

The DOM comparator needs HTML4 and HTML5 inputs to share HTML’s whitespace-sensitivity semantics, which means routing both through Nokogiri::HTML5.fragment up front (issue #118). The semantic comparator does not need this — it uses Canon’s own HTML data model downstream — so this helper is opt-in.

Returns the inputs unchanged if they are not strings.

Parameters:

  • obj1 (Object)
  • obj2 (Object)

Returns:

  • (Array<Object, Object>)

    Potentially pre-parsed HTML inputs



195
196
197
198
199
200
# File 'lib/canon/comparison/pipeline.rb', line 195

def preparse_html_pair(obj1, obj2)
  [
    html_string?(obj1) ? HtmlParser.parse(obj1, :html5) : obj1,
    html_string?(obj2) ? HtmlParser.parse(obj2, :html5) : obj2,
  ]
end

.resolve_config(format, opts) ⇒ Hash

Merge global config-sourced profile and options into ‘opts`.

Reads ‘Canon::Config.instance.<format>.match` for a global `profile` and `profile_options`, and merges them into a copy of the supplied opts hash. Caller-supplied values always win: config-derived `profile_options` extend rather than replace caller-supplied `global_options`.

Returns the original opts hash unchanged when the format is not config-backed (e.g. ‘:ruby_object`).

Parameters:

  • format (Symbol)
  • opts (Hash)

    Caller opts (will not be mutated)

Returns:

  • (Hash)

    New opts hash with config globals merged in



91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# File 'lib/canon/comparison/pipeline.rb', line 91

def resolve_config(format, opts)
  return opts unless CONFIG_BACKED_FORMATS.include?(format)

  format_config = Canon::Config.instance.public_send(format)
  match_config = format_config.match
  profile = match_config.profile
  profile_opts = match_config.profile_options

  resolved = opts.dup
  if resolved[:global_profile].nil? && profile
    resolved[:global_profile] = profile
  end

  if profile_opts.any?
    resolved[:global_options] = merge_profile_options(
      resolved[:global_options], profile_opts
    )
  end

  resolved
end

.validate_compatible!(format1, format2, strict: false) ⇒ void

This method returns an undefined value.

Raise a helpful error if formats are incompatible.

Parameters:

  • format1 (Symbol)
  • format2 (Symbol)
  • strict (Boolean) (defaults to: false)

Raises:



71
72
73
74
75
# File 'lib/canon/comparison/pipeline.rb', line 71

def validate_compatible!(format1, format2, strict: false)
  return if formats_compatible?(format1, format2, strict: strict)

  raise Canon::CompareFormatMismatchError.new(format1, format2)
end