CompareXML
CompareXML is a fast, lightweight and feature-rich tool that will solve your XML/HTML comparison or diffing needs. Its purpose is to compare two instances of Nokogiri::XML::Node or Nokogiri::XML::NodeSet for equality or equivalency.
Features
- Fast, light-weight and highly customizable
- Compares XML/HTML documents and document fragments
- Can produce both detailed diffing discrepancies or execute silently
- Has the ability to exclude specific nodes or attributes from all comparisons
Installation
Add this line to your application's Gemfile:
gem 'compare-xml'
And then execute:
bundle
Or install it yourself as:
gem install compare-xml
Usage
Using CompareXML is as simple as
CompareXML.equivalent?(doc1, doc2)
where doc1 and doc2 are instances of Nokogiri::XML::Node or Nokogiri::XML::NodeSet.
Example
Suppose you have two files 1.html and 2.html that you would like to compare. You could do it as follows:
doc1 = Nokogiri::HTML(open('1.html'))
doc2 = Nokogiri::HTML(open('2.html'))
puts CompareXML.equivalent?(doc1, doc2)
The above code will print true or false depending on the result of the comparison.
If you are using CompareXML in a script, then you need to require it manually with:
require 'compare-xml'
Options at a Glance
CompareXML has a variety of options that can be invoked as an optional argument, e.g.:
CompareXML.equivalent?(doc1, doc2, {collapse_whitespace: false, verbose: true})
-
collapse_whitespace: {true|false}default:trueshow examples ⇨- when
true, trims and collapses whitespace
- when
-
ignore_attr_order: {true|false}default:trueshow examples ⇨- when
true, ignores attribute order within tags
- when
-
ignore_attr_content: [string1, string2, ...]default:[]show examples ⇨- when provided, ignores all attributes that contain substrings
string,string2, etc.
- when provided, ignores all attributes that contain substrings
-
ignore_attrs: [css_selector1, css_selector1, ...]default:[]show examples ⇨- when provided, ignores specific attributes using CSS selectors
-
ignore_attrs_by_name: [matcher1, matcher2, ...]default:[]show examples ⇨- when provided, ignores attributes whose name matches a
String(exact) orRegexp(pattern)
- when provided, ignores attributes whose name matches a
-
ignore_comments: {true|false}default:trueshow examples ⇨- when
true, ignores comments, such as<!-- comment -->
- when
-
ignore_nodes: [css_selector1, css_selector1, ...]default:[]show examples ⇨- when provided, ignores specific nodes using CSS selectors
-
ignore_text_nodes: {true|false}default:falseshow examples ⇨- when
true, ignores all text content within a document
- when
-
verbose: {true|false}default:falseshow examples ⇨- when
true, instead of a boolean,CompareXML.equivalent?returns an array of discrepancies.
- when
-
ignore_children {true|false}defaultfalseshow examples ⇨- when
true, the subnodes of a node in the xml are ignored
- when
-
force_children {true|false}defaultfalseshow examples ⇨- when
true, the subnodes of a node are checked independently of the status of the parent node
- when
-
align_children {true|false}defaultfalseshow examples ⇨- when
true(inverbosemode), aligns child nodes so insertions and removals are reported as additions/removals instead of changes
- when
Options in Depth
-
collapse_whitespace: {true|false}default:trueWhen `true`, all text content within the document is trimmed (i.e. space removed from left and right) and whitespace is collapsed (i.e. tabs, new lines, multiple whitespace characters are replaced by a single whitespace). **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {collapse_whitespace: true})` **Example:** When `true` the following HTML strings are considered equal: <a href="/admin"> SOME TEXT CONTENT </a> <a href="/index"> SOME TEXT CONTENT </a> **Example:** When `true` the following HTML strings are considered equal: <html> <title> This is my title </title> </html> <html><title>This is my title</title></html>
-
ignore_attr_order: {true|false}default:trueWhen `true`, all attributes are sorted before comparison and only attributes of the same type are compared. **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_attr_order: true})` **Example:** When `true` the following HTML strings are considered equal: <a href="/admin" class="button" target="_blank">Link</a> <a class="button" target="_blank" href="/admin">Link</a> **Example:** When `false` the above HTML strings are compared as follows: href="admin" != class="button The comparison of the `<a>` element will stop at this point, since a discrepancy is found. **Example:** When `true` the following HTML strings are compared as follows: <a href="/admin" class="button" target="_blank">Link</a> <a class="button" target="_blank" href="/admin" rel="nofollow">Link</a> class="button" == class="button" href="/admin" == href="/admin" =! rel="nofollow" target="_blank" == target="_blank"
-
ignore_attr_content: [string1, string2, ...]default:[]When provided, ignores all **attributes** that contain any of the given substrings. **Note:** types of attributes still have to match (i.e. `<p>` = `<p>`, `<div>` = `<div>`, etc). **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_attr_content: ['button']})` **Example:** With `ignore_attr_content: ['button']` the following HTML strings are considered equal: <a href="/admin" id="button_1" class="blue button">Link</a> <a href="/admin" id="button_2" class="info button">Link</a> **Example:** With `ignore_attr_content: ['menu']` the following HTML strings are considered equal: <a class="menu left" data-scope="abrth$menu" role="side-menu">Link</a> <a class="main menu" data-scope="ergeh$menu" role="main-menu">Link</a>
-
ignore_attrs: [css_selector1, css_selector1, ...]default:[]When provided, ignores all **attributes** that satisfy a particular rule using [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp). **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_attrs: ['a[rel="nofollow"]', 'input[type="hidden"']})` **Example:** With `ignore_attrs: ['a[rel="nofollow"]', 'a[target]']` the following HTML strings are considered equal: <a href="/admin" class="button" target="_blank">Link</a> <a href="/admin" class="button" target="_self" rel="nofollow">Link</a> **Example:** With `ignore_attrs: ['a[href^="http"]', 'a[class*="button"]']` the following HTML strings are considered equal: <a href="http://google.ca" class="primary button">Link</a> <a href="https://google.com" class="primary button rounded">Link</a>
-
ignore_attrs_by_name: [matcher1, matcher2, ...]default:[]When provided, ignores all **attributes** whose name matches one of the given matchers. A `String` matches the attribute name exactly, while a `Regexp` matches it as a pattern. **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_attrs_by_name: ['target', /^data-/]})` **Example:** With `ignore_attrs_by_name: ['target', 'rel']` the following HTML strings are considered equal: <a href="/admin" class="button" target="_blank">Link</a> <a href="/admin" class="button" target="_self" rel="nofollow">Link</a> **Example:** With `ignore_attrs_by_name: [/^data-/]` the following HTML strings are considered equal: <div data-id="1" data-role="row">Link</div> <div data-id="2" data-role="cell">Link</div> An ignored attribute does not need to be present on both elements. With `ignore_attrs_by_name: ['class']` the following HTML strings are considered equal: <div class="foo"></div> <div></div>
-
ignore_comments: {true|false}default:trueWhen `true`, ignores comments, such as `<!-- This is a comment -->`. **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_comments: true})` **Example:** When `true` the following HTML strings are considered equal: <!-- This is a comment --> <!-- This is another comment --> **Example:** When `true` the following HTML strings are considered equal: <a href="/admin"><!-- This is a comment -->Link</a> <a href="/admin">Link</a>
-
ignore_nodes: [css_selector1, css_selector1, ...]default:[]When provided, ignores all **nodes** that satisfy a particular rule using [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp). **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_nodes: ['script', 'object']})` **Example:** With `ignore_nodes: ['a[rel="nofollow"]', 'a[target]']` the following HTML strings are considered equal: <a href="/admin" class="icon" target="_blank">Link 1</a> <a href="/index" class="button" target="_self" rel="nofollow">Link 2</a> **Example:** With `ignore_nodes: ['b', 'i']` the following HTML strings are considered equal: <a href="/admin"><i class"icon bulb"></i><b>Warning:</b> Link</a> <a href="/admin"><i class"icon info"></i><b>Message:</b> Link</a>
-
ignore_text_nodes: {true|false}default:falseWhen `true`, ignores all text content. Text content is anything that is included between an opening and a closing tag, e.g. `<tag>THIS IS TEXT CONTENT</tag>`. **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_text_nodes: true})` **Example:** When `true` the following HTML strings are considered equal: <a href="/admin">SOME TEXT CONTENT</a> <a href="/admin">DIFFERENT TEXT CONTENT</a> **Example:** When `true` the following HTML strings are considered equal: <i class="icon></i> <b>Warning:</b> <i class="icon> </i> <b>Message:</b>
-
verbose: {true|false}default:falseWhen `true`, instead of returning a boolean value `CompareXML.equivalent?` returns an array of all errors encountered when performing a comparison. > **Warning:** When `true`, the comparison takes longer! Not only because more processing is required to produce meaningful differences, but also because in this mode, comparison does **NOT** stop when a first difference is encountered, because the goal is to capture as many differences as possible. **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {verbose: true})` **Example:** When `true` given the following HTML strings:  `CompareXML.equivalent?(doc1, doc2, {verbose: true})` will produce an array shown below. ```ruby [ { node1: '<title>TITLE</title>', node2: '<title>ANOTHER TITLE</title>', diff1: 'TITLE', diff2: 'ANOTHER TITLE' }, { node1: '<h1>SOME HEADING</h1>', node2: '<h1 id="main">SOME HEADING</h1>', diff1: nil, diff2: 'id="main"' }, { node1: '<a href="/admin" rel="icon">Link</a>', node2: '<a rel="button" href="/admin">Link</a>', diff1: '"rel="icon"', diff2: '"rel="button"' }, { node1: '<cite>Author Name</cite>', node2: nil, diff1: '<cite>Author Name</cite>', diff2: nil }, { node1: '<p class="footer">FOOTER</p>', node2: '<div class="footer">FOOTER</div>', diff1: 'p', diff2: 'div' } ] ``` The structure of each hash inside the array is: node1: [Nokogiri::XML::Node] left node that contains the difference node2: [Nokogiri::XML::Node] right node that contains the difference diff1: [Nokogiri::XML::Node|String] left difference diff2: [Nokogiri::XML::Node|String] right difference
-
ignore_children: {true|false}default:falseWhen provided, ignores all **subnodes** of any node. **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_children: true})` **Example:** With `ignore_children: true` the following HTML strings are considered equal: <body><a href="/admin" class="icon" target="_blank">Link 1</a></body> <body><a href="/index" class="button" target="_self" rel="nofollow">Link 2</a></body>
-
force_children: {true|false}default:falseWhen provided, compares all **subnodes** of any node. **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {force_children: true})`
-
align_children: {true|false}default:falseOnly takes effect in `verbose` mode. When `true`, sibling nodes are aligned before comparison so that an inserted or removed node is reported as an addition or removal (with one side `nil`) rather than causing the remaining siblings to be reported as a series of changes. Identical subtrees are matched as anchors, and the unmatched nodes between them are compared in order. **Usage Example:** `CompareXML.equivalent?(doc1, doc2, {verbose: true, align_children: true})` **Example:** Given a left document with three children and a right document with only the third: <properties><a/><b/><c/></properties> <properties><c/></properties> With `align_children: true`, `<a>` and `<b>` are reported as removals (`node2` is `nil`); without it, they collapse into a single positional change.
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create new Pull Request
Credits
This gem was inspired by Michael B. Klein's gem equivalent-xml - another excellent tool for XML comparison.
License
The gem is available as open source under the terms of the MIT License.
