CompareXML

CompareXML is a fast, lightweight and feature-rich tool that will solve your XML/HTML comparison or diffing needs. Its purpose is to compare two instances of Nokogiri::XML::Node or Nokogiri::XML::NodeSet for equality or equivalency.

Features

Fast, light-weight and highly customizable
Compares XML/HTML documents and document fragments
Can produce both detailed diffing discrepancies or execute silently
Has the ability to exclude specific nodes or attributes from all comparisons

Installation

Add this line to your application's Gemfile:

gem 'compare-xml'

And then execute:

bundle

Or install it yourself as:

gem install compare-xml

Usage

Using CompareXML is as simple as

CompareXML.equivalent?(doc1, doc2)

where doc1 and doc2 are instances of Nokogiri::XML::Node or Nokogiri::XML::NodeSet.

Example

Suppose you have two files 1.html and 2.html that you would like to compare. You could do it as follows:

doc1 = Nokogiri::HTML(open('1.html'))
doc2 = Nokogiri::HTML(open('2.html'))
puts CompareXML.equivalent?(doc1, doc2)

The above code will print true or false depending on the result of the comparison.

If you are using CompareXML in a script, then you need to require it manually with:

require 'compare-xml'

Options at a Glance

CompareXML has a variety of options that can be invoked as an optional argument, e.g.:

CompareXML.equivalent?(doc1, doc2, {collapse_whitespace: false, verbose: true})

collapse_whitespace: {true|false} default: true show examples ⇨
- when true, trims and collapses whitespace
ignore_attr_order: {true|false} default: true show examples ⇨
- when true, ignores attribute order within tags
ignore_attr_content: [string1, string2, ...] default: [] show examples ⇨
- when provided, ignores all attributes that contain substrings string, string2, etc.
ignore_attrs: [css_selector1, css_selector1, ...] default: [] show examples ⇨
- when provided, ignores specific attributes using CSS selectors
ignore_attrs_by_name: [matcher1, matcher2, ...] default: [] show examples ⇨
- when provided, ignores attributes whose name matches a String (exact) or Regexp (pattern)
ignore_comments: {true|false} default: true show examples ⇨
- when true, ignores comments, such as 
ignore_nodes: [css_selector1, css_selector1, ...] default: [] show examples ⇨
- when provided, ignores specific nodes using CSS selectors
ignore_text_nodes: {true|false} default: false show examples ⇨
- when true, ignores all text content within a document
verbose: {true|false} default: false show examples ⇨
- when true, instead of a boolean, CompareXML.equivalent? returns an array of discrepancies.
ignore_children {true|false} default false show examples ⇨
- when true, the subnodes of a node in the xml are ignored
force_children {true|false} default false show examples ⇨
- when true, the subnodes of a node are checked independently of the status of the parent node
align_children {true|false} default false show examples ⇨
- when true (in verbose mode), aligns child nodes so insertions and removals are reported as additions/removals instead of changes

Options in Depth

collapse_whitespace: {true|false} default: true

When `true`, all text content within the document is trimmed (i.e. space removed from left and right) and whitespace is collapsed (i.e. tabs, new lines, multiple whitespace characters are replaced by a single whitespace).

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {collapse_whitespace: true})`

**Example:** When `true` the following HTML strings are considered equal:

  <a href="/admin">   SOME TEXT CONTENT   </a>
  <a href="/index"> SOME    TEXT    CONTENT </a>

**Example:** When `true` the following HTML strings are considered equal:

  <html>
      <title>
          This is my title
      </title>
  </html>

  <html><title>This is my title</title></html>

ignore_attr_order: {true|false} default: true

When `true`, all attributes are sorted before comparison and only attributes of the same type are compared.

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_attr_order: true})`

**Example:** When `true` the following HTML strings are considered equal:

  <a href="/admin" class="button" target="_blank">Link</a>
  <a class="button" target="_blank" href="/admin">Link</a>

**Example:** When `false` the above HTML strings are compared as follows:

  href="admin" != class="button

The comparison of the `<a>` element will stop at this point, since a discrepancy is found.

**Example:** When `true` the following HTML strings are compared as follows:

  <a href="/admin" class="button" target="_blank">Link</a>
  <a class="button" target="_blank" href="/admin" rel="nofollow">Link</a>

  class="button"  == class="button"
  href="/admin"   == href="/admin"
                  =! rel="nofollow"
  target="_blank" == target="_blank"

ignore_attr_content: [string1, string2, ...] default: []

When provided, ignores all **attributes** that contain any of the given substrings. **Note:** types of attributes still have to match (i.e. `<p>` = `<p>`, `<div>` = `<div>`,  etc).

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_attr_content: ['button']})`

**Example:** With `ignore_attr_content: ['button']` the following HTML strings are considered equal:

  <a href="/admin" id="button_1" class="blue button">Link</a>
  <a href="/admin" id="button_2" class="info button">Link</a>

**Example:** With `ignore_attr_content: ['menu']` the following HTML strings are considered equal:

  <a class="menu left" data-scope="abrth$menu" role="side-menu">Link</a>
  <a class="main menu" data-scope="ergeh$menu" role="main-menu">Link</a>

ignore_attrs: [css_selector1, css_selector1, ...] default: []

When provided, ignores all **attributes** that satisfy a particular rule using [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_attrs: ['a[rel="nofollow"]', 'input[type="hidden"']})`

**Example:** With `ignore_attrs: ['a[rel="nofollow"]', 'a[target]']` the following HTML strings are considered equal:

  <a href="/admin" class="button" target="_blank">Link</a>
  <a href="/admin" class="button" target="_self" rel="nofollow">Link</a>

**Example:** With `ignore_attrs: ['a[href^="http"]', 'a[class*="button"]']` the following HTML strings are considered equal:

  <a href="http://google.ca" class="primary button">Link</a>
  <a href="https://google.com" class="primary button rounded">Link</a>

ignore_attrs_by_name: [matcher1, matcher2, ...] default: []

When provided, ignores all **attributes** whose name matches one of the given matchers. A `String` matches the attribute name exactly, while a `Regexp` matches it as a pattern.

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_attrs_by_name: ['target', /^data-/]})`

**Example:** With `ignore_attrs_by_name: ['target', 'rel']` the following HTML strings are considered equal:

  <a href="/admin" class="button" target="_blank">Link</a>
  <a href="/admin" class="button" target="_self" rel="nofollow">Link</a>

**Example:** With `ignore_attrs_by_name: [/^data-/]` the following HTML strings are considered equal:

  <div data-id="1" data-role="row">Link</div>
  <div data-id="2" data-role="cell">Link</div>

An ignored attribute does not need to be present on both elements. With `ignore_attrs_by_name: ['class']` the following HTML strings are considered equal:

  <div class="foo"></div>
  <div></div>

ignore_comments: {true|false} default: true

When `true`, ignores comments, such as `<!-- This is a comment -->`.

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_comments: true})`

**Example:** When `true` the following HTML strings are considered equal:

  <!-- This is a comment -->
  <!-- This is another comment -->

**Example:** When `true` the following HTML strings are considered equal:

  <a href="/admin"><!-- This is a comment -->Link</a>
  <a href="/admin">Link</a>

ignore_nodes: [css_selector1, css_selector1, ...] default: []

When provided, ignores all **nodes** that satisfy a particular rule using [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_nodes: ['script', 'object']})`

**Example:** With `ignore_nodes: ['a[rel="nofollow"]', 'a[target]']` the following HTML strings are considered equal:

  <a href="/admin" class="icon" target="_blank">Link 1</a>
  <a href="/index" class="button" target="_self" rel="nofollow">Link 2</a>

**Example:** With `ignore_nodes: ['b', 'i']` the following HTML strings are considered equal:

  <a href="/admin"><i class"icon bulb"></i><b>Warning:</b> Link</a>
  <a href="/admin"><i class"icon info"></i><b>Message:</b> Link</a>

ignore_text_nodes: {true|false} default: false

When `true`, ignores all text content. Text content is anything that is included between an opening and a closing tag, e.g. `<tag>THIS IS TEXT CONTENT</tag>`.

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_text_nodes: true})`

**Example:** When `true` the following HTML strings are considered equal:

  <a href="/admin">SOME TEXT CONTENT</a>
  <a href="/admin">DIFFERENT TEXT CONTENT</a>

**Example:** When `true` the following HTML strings are considered equal:

  <i class="icon></i>  <b>Warning:</b>
  <i class="icon>  </i>    <b>Message:</b>

verbose: {true|false} default: false

When `true`, instead of returning a boolean value  `CompareXML.equivalent?` returns an array of all errors encountered when performing a comparison.

> **Warning:** When `true`, the comparison takes longer! Not only because more processing is required to produce meaningful differences, but also because in this mode, comparison does **NOT** stop when a first difference is encountered, because the goal is to capture as many differences as possible.

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {verbose: true})`

**Example:** When `true` given the following HTML strings:

![diffing](https://github.com/vkononov/compare-xml/raw/main/img/diffing.png)

`CompareXML.equivalent?(doc1, doc2, {verbose: true})` will produce an array shown below.

```ruby
[
{
  node1: '<title>TITLE</title>',
  node2: '<title>ANOTHER TITLE</title>',
  diff1: 'TITLE',
  diff2: 'ANOTHER TITLE'
},
{
  node1: '<h1>SOME HEADING</h1>',
  node2: '<h1 id="main">SOME HEADING</h1>',
  diff1: nil,
  diff2: 'id="main"'
},
{
  node1: '<a href="/admin" rel="icon">Link</a>',
  node2: '<a rel="button" href="/admin">Link</a>',
  diff1: '"rel="icon"',
  diff2: '"rel="button"'
},
{
  node1: '<cite>Author Name</cite>',
  node2: nil,
  diff1: '<cite>Author Name</cite>',
  diff2: nil
},
{
  node1: '<p class="footer">FOOTER</p>',
  node2: '<div class="footer">FOOTER</div>',
  diff1: 'p',
  diff2: 'div'
}
]
```

The structure of each hash inside the array is:

  node1: [Nokogiri::XML::Node] left node that contains the difference
  node2: [Nokogiri::XML::Node] right node that contains the difference
  diff1: [Nokogiri::XML::Node|String] left difference
  diff2: [Nokogiri::XML::Node|String] right difference

ignore_children: {true|false} default: false

When provided, ignores all **subnodes** of any node.

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {ignore_children: true})`

**Example:** With `ignore_children: true` the following HTML strings are considered equal:

  <body><a href="/admin" class="icon" target="_blank">Link 1</a></body>
  <body><a href="/index" class="button" target="_self" rel="nofollow">Link 2</a></body>

force_children: {true|false} default: false

When provided, compares all **subnodes** of any node.

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {force_children: true})`

align_children: {true|false} default: false

Only takes effect in `verbose` mode. When `true`, sibling nodes are aligned before comparison so that an inserted or removed node is reported as an addition or removal (with one side `nil`) rather than causing the remaining siblings to be reported as a series of changes. Identical subtrees are matched as anchors, and the unmatched nodes between them are compared in order.

**Usage Example:** `CompareXML.equivalent?(doc1, doc2, {verbose: true, align_children: true})`

**Example:** Given a left document with three children and a right document with only the third:

  <properties><a/><b/><c/></properties>
  <properties><c/></properties>

With `align_children: true`, `<a>` and `<b>` are reported as removals (`node2` is `nil`); without it, they collapse into a single positional change.

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

Credits

This gem was inspired by Michael B. Klein's gem equivalent-xml - another excellent tool for XML comparison.

License

The gem is available as open source under the terms of the MIT License.