Module: Canon::Comparison::Pipeline
- Defined in:
- lib/canon/comparison/pipeline.rb
Overview
Shared comparison pipeline helpers used by both algorithms.
Both ‘dom_diff` and `semantic_diff` need to:
-
detect document format from inputs (with optional hint)
-
validate that the two formats are comparable
-
merge global config-sourced profile / options into the opts hash
-
capture original-string snapshots before parsing mutates inputs
-
parse both inputs through the format-specific comparator
These steps are pure pipeline mechanics — they have nothing to do with the comparison algorithm itself. Keeping them here ensures the two algorithm entrypoints cannot drift out of sync (see lutaml/canon “Two Comparison Algorithms — Distinct by Design” in CLAUDE.md —the algorithm cores stay separate; only shared infrastructure is consolidated).
Constant Summary collapse
- CONFIG_BACKED_FORMATS =
Formats whose Canon::Config exposes a match profile / options.
%i[xml html json yaml string].freeze
- COMPATIBLE_FORMAT_GROUPS =
Cross-format compatibility groups. DOM comparison accepts these pairings because both sides parse to the same Ruby structure. Semantic comparison does not — it requires exact format match.
[ %i[json ruby_object].freeze, %i[yaml ruby_object].freeze, ].freeze
Class Method Summary collapse
-
.capture_originals(obj1, obj2) ⇒ Array<String, String>
Capture pre-parse string snapshots for diff display.
-
.detect_formats(obj1, obj2, format_hint) ⇒ Array<Symbol, Symbol>
Detect formats for both inputs, honouring an explicit hint.
-
.formats_compatible?(format1, format2, strict: false) ⇒ Boolean
True when the two formats can be compared by the DOM algorithm.
-
.html_string?(obj) ⇒ Boolean
True when the input is a String AND should be treated as HTML.
-
.parse_pair(obj1, obj2, format, match_opts_hash) ⇒ Array<Object, Object>
Parse both inputs through the format-specific comparator.
-
.preparse_html_pair(obj1, obj2) ⇒ Array<Object, Object>
Pre-parse HTML strings through ‘HtmlParser.parse(_, :html5)`.
-
.resolve_config(format, opts) ⇒ Hash
Merge global config-sourced profile and options into ‘opts`.
-
.validate_compatible!(format1, format2, strict: false) ⇒ void
Raise a helpful error if formats are incompatible.
Class Method Details
.capture_originals(obj1, obj2) ⇒ Array<String, String>
Capture pre-parse string snapshots for diff display.
Parsing (especially HTML) can mutate inputs, so originals must be captured before any parsing happens. Strings pass through unchanged; parsed nodes are serialized via NodeSerializer.
122 123 124 |
# File 'lib/canon/comparison/pipeline.rb', line 122 def capture_originals(obj1, obj2) [extract_original_string(obj1), extract_original_string(obj2)] end |
.detect_formats(obj1, obj2, format_hint) ⇒ Array<Symbol, Symbol>
Detect formats for both inputs, honouring an explicit hint.
39 40 41 42 43 |
# File 'lib/canon/comparison/pipeline.rb', line 39 def detect_formats(obj1, obj2, format_hint) return [format_hint, format_hint] if format_hint [FormatDetector.detect(obj1), FormatDetector.detect(obj2)] end |
.formats_compatible?(format1, format2, strict: false) ⇒ Boolean
True when the two formats can be compared by the DOM algorithm.
DOM allows ‘ruby_object` to be compared against `json` or `yaml` because both sides parse to the same Ruby structure. Semantic comparison does not allow this — it requires exact format match.
55 56 57 58 59 60 61 62 |
# File 'lib/canon/comparison/pipeline.rb', line 55 def formats_compatible?(format1, format2, strict: false) return true if format1 == format2 return false if strict COMPATIBLE_FORMAT_GROUPS.any? do |group| group.include?(format1) && group.include?(format2) end end |
.html_string?(obj) ⇒ Boolean
True when the input is a String AND should be treated as HTML.
206 207 208 |
# File 'lib/canon/comparison/pipeline.rb', line 206 def html_string?(obj) obj.is_a?(String) end |
.parse_pair(obj1, obj2, format, match_opts_hash) ⇒ Array<Object, Object>
Parse both inputs through the format-specific comparator.
Delegates to ‘XmlComparator`, `HtmlComparator`, `JsonComparator`, or `YamlComparator` based on format. Uses `Cache` so the same string is not re-parsed across runs.
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
# File 'lib/canon/comparison/pipeline.rb', line 137 def parse_pair(obj1, obj2, format, match_opts_hash) preprocessing = match_opts_hash[:preprocessing] || :none case format when :xml [ parse_with_cache(obj1, format, preprocessing) do |doc| XmlComparator.parse(doc, preprocessing) end, parse_with_cache(obj2, format, preprocessing) do |doc| XmlComparator.parse(doc, preprocessing) end, ] when :html, :html4, :html5 [ parse_with_cache(obj1, format, preprocessing) do |doc| HtmlComparator.parse(doc, preprocessing) end, parse_with_cache(obj2, format, preprocessing) do |doc| HtmlComparator.parse(doc, preprocessing) end, ] when :json [ parse_with_cache(obj1, format, :none) do |doc| JsonComparator.parse(doc) end, parse_with_cache(obj2, format, :none) do |doc| JsonComparator.parse(doc) end, ] when :yaml [ parse_with_cache(obj1, format, :none) do |doc| YamlComparator.parse(doc) end, parse_with_cache(obj2, format, :none) do |doc| YamlComparator.parse(doc) end, ] else [obj1, obj2] end end |
.preparse_html_pair(obj1, obj2) ⇒ Array<Object, Object>
Pre-parse HTML strings through ‘HtmlParser.parse(_, :html5)`.
The DOM comparator needs HTML4 and HTML5 inputs to share HTML’s whitespace-sensitivity semantics, which means routing both through Nokogiri::HTML5.fragment up front (issue #118). The semantic comparator does not need this — it uses Canon’s own HTML data model downstream — so this helper is opt-in.
Returns the inputs unchanged if they are not strings.
195 196 197 198 199 200 |
# File 'lib/canon/comparison/pipeline.rb', line 195 def preparse_html_pair(obj1, obj2) [ html_string?(obj1) ? HtmlParser.parse(obj1, :html5) : obj1, html_string?(obj2) ? HtmlParser.parse(obj2, :html5) : obj2, ] end |
.resolve_config(format, opts) ⇒ Hash
Merge global config-sourced profile and options into ‘opts`.
Reads ‘Canon::Config.instance.<format>.match` for a global `profile` and `profile_options`, and merges them into a copy of the supplied opts hash. Caller-supplied values always win: config-derived `profile_options` extend rather than replace caller-supplied `global_options`.
Returns the original opts hash unchanged when the format is not config-backed (e.g. ‘:ruby_object`).
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
# File 'lib/canon/comparison/pipeline.rb', line 91 def resolve_config(format, opts) return opts unless CONFIG_BACKED_FORMATS.include?(format) format_config = Canon::Config.instance.public_send(format) match_config = format_config.match profile = match_config.profile profile_opts = match_config. resolved = opts.dup if resolved[:global_profile].nil? && profile resolved[:global_profile] = profile end if profile_opts.any? resolved[:global_options] = ( resolved[:global_options], profile_opts ) end resolved end |
.validate_compatible!(format1, format2, strict: false) ⇒ void
This method returns an undefined value.
Raise a helpful error if formats are incompatible.
71 72 73 74 75 |
# File 'lib/canon/comparison/pipeline.rb', line 71 def validate_compatible!(format1, format2, strict: false) return if formats_compatible?(format1, format2, strict: strict) raise Canon::CompareFormatMismatchError.new(format1, format2) end |