ruby-duplicates

A small duplicate-code metric for Ruby.

ruby-duplicates parses Ruby with the standard library Ripper, normalizes syntax trees so names and literal values do not dominate the comparison, fingerprints method subtrees, and reports methods with high Jaccard similarity.

It is inspired by Uncle Bob's dry4clj, which applies the same broad idea to Clojure code: compare normalized structure instead of doing plain text clone detection.

This is a metric tool, not a refactoring engine. It points at suspiciously similar methods so a human or coding agent can decide whether the duplication is accidental, intentional symmetry, or data-shaped boilerplate.

Install

From this repo:

bundle install
exe/ruby-duplicates app lib test

As a gem from a local checkout:

gem build ruby-duplicates.gemspec
gem install ruby-duplicates-*.gem
ruby-duplicates app lib test

From another project before a RubyGems release, point at the GitHub repo:

gem "ruby-duplicates", git: "https://github.com/barturba/ruby-duplicates.git"

Usage

ruby-duplicates [options] [file-or-directory ...]

Examples:

ruby-duplicates app lib test
ruby-duplicates --threshold 0.9 --min-lines 5 --min-nodes 30 app
ruby-duplicates --json app/models app/controllers

Options:

--threshold N    Minimum similarity score, default 0.82
--min-lines N    Minimum method source lines, default 4
--min-nodes N    Minimum normalized syntax nodes, default 20
--max-results N  Maximum matches to print, default 50
--format F       text or json, default text
--json           Same as --format json
--ignore-dir N   Directory basename or path to skip; may be repeated

Example output:

ruby_duplicates candidates=3 matches=1 threshold=0.82

DUPLICATE score=1.00 shared=21
  examples/duplicate_sample.rb:1-4 alpha nodes=64
  examples/duplicate_sample.rb:7-10 beta nodes=64

How It Works

For each Ruby method, the scanner:

  1. Parses the file with Ripper.sexp.
  2. Extracts def and defs method nodes.
  3. Normalizes identifiers, constants, instance variables, globals, labels, strings, and numbers into token classes.
  4. Normalizes most non-head symbols so tiny operator/name differences do not hide repeated shape.
  5. Fingerprints every normalized subtree with SHA1.
  6. Compares method fingerprint sets with Jaccard similarity.

The defaults intentionally favor high-signal matches. Lower --threshold, --min-lines, or --min-nodes when exploring.

Limits

  • It only scans Ruby methods, not arbitrary repeated blocks.
  • It is structural, not semantic.
  • Metaprogrammed code can look sparse because the useful behavior is hidden in data.
  • Rails controllers and tests can produce intentional symmetry. Treat those as review candidates, not automatic refactors.

Development

ruby -Ilib test/ruby_duplicates_test.rb
gem build ruby-duplicates.gemspec

Inspiration