Sitediff
SiteDiff makes it easy to see differences between two versions of a website.
Install / Use
/learn @evolvingweb/SitediffREADME
SiteDiff CLI
Warning: SiteDiff 1.2.0 requires at least Ruby 3.1.2.
Warning: SiteDiff 1.0.0 introduces some backwards incompatible changes.
Table of contents
- Introduction
- Installation
- Demo
- Usage
- Command Line Options
- Configuration
- Tips and Tricks
- Acknowledgements
Introduction
SiteDiff makes it easy to see how a website changes. It can compare two similar sites or it can show how a single site changed over time. It helps identify undesirable changes to the site's HTML and it's a useful tool for conducting QA on re-deployments, site upgrades, and more!
When you run SiteDiff, it produces an HTML report showing whether pages on your site have changed or not. For pages that have changed, you can see a colorized diff exactly what changed, or compare the visual differences side-by-side in a browser.
SiteDiff supports a range of normalization / sanitization rules. These allow you to eliminate spurious differences, narrowing down differences to the ones that materially affect the site.
Installation
SiteDiff is fairly easy to install. Please refer to the installation docs.
Demo
After installing all dependencies including the bundle version 2 gem, you can quickly
see what SiteDiff can do. Simply use the following commands:
git clone https://github.com/evolvingweb/sitediff
cd sitediff
bundle install
bundle exec thor fixture:serve
Then visit http://localhost:13080/ to view the report.
SiteDiff shows you an overview of all the pages and clearly indicates which
pages have changed and not changed.

When you click on a changed page, you see a colorized diff of the page's markup
showing exactly what changed on the page.

Usage
Here are some instructions on getting started with SiteDiff. To see a list of commands that SiteDiff offers, you can run:
sitediff help
To get help for a particular command, say, diff, you can run:
sitediff help diff
Getting started
To use SiteDiff on your site, create a configuration for your site:
sitediff init http://mysite.example.com
SiteDiff will generate a configuration file named sitediff.yaml by default.
You can open the configuration file sitediff/sitediff.yaml to see the
default configuration generated by SiteDiff.
The the configuration reference section explains the contents
of this file and helps you customize it as per your requirements.
Then get SiteDiff to crawl your site by using:
sitediff crawl
SiteDiff will then crawl your site, finding pages and caching their
contents. A list of discovered paths will be saved to a paths.txt file.
Now, you can make alterations to your site. For example, change a word on your site's front page. After you're done, you can check what actually changed:
sitediff diff
For each page, SiteDiff will report whether it did or did not change. For pages that changed, it will display a diff. You can also see an HTML version of the report using the following command:
sitediff serve
SiteDiff will start an internal web server and open a report page on your browser. For each page, you can see the diff and a side-by-side view of the old and new versions.
You can now see if the changes were as you expected, or if some things didn't quite work out as you hoped. If you noticed unexpected changes, congratulations: SiteDiff just helped you find an issue you would have otherwise missed!
As you fix any issues, you can continue to alter your site and run
sitediff diff to check the changes against the old version. Once you're
satisfied with the state of your site, you can inform SiteDiff that it should
re-cache your site:
sitediff store
This takes a snapshot of your website and the next time you run
sitediff diff, it will use this new version as the reference for
comparison.
Happy diffing!
Comparing 2 sites
Sometimes you have two sites that you want to compare, for example a production site hosted on a public server and a development site hosted on your computer. SiteDiff can handle this situation, too! Just inform SiteDiff that there are two sites to compare:
sitediff init http://mysite.example.com http://localhost/mysite
Then when you run sitediff diff, it will compare the cached version of the
first site with the current version of the second site.
If both the first and second sites may be changing, you should tell SiteDiff not to cache either site:
sitediff diff --cached=none
Spurious diffs
Sometimes sites have spurious differences, that you don't want to show up in a comparison. For example, many sites protect against Cross-Site Request Forgery using a semi-random token. Since this token changes on each HTTP GET, you probably don't care about such a change.
To help with issues such as this, SiteDiff allows you to normalize the HTML it
fetches as it compares pages. In the sitediff.yaml configuration file,
you can add "sanitization rules", which specify either DOM transformations or
regular expression substitutions.
Here's an example of a rule you might add to remove CSRF-protection tokens generated by Django:
dom_transform:
- title: Remove CSRF tokens
type: remove
selector: input[name=csrfmiddlewaretoken]
You can use one of the presets to apply framework-specific sanitization. Currently, SiteDiff only comes with Drupal-specific presets.
See the preset section for more details.
Command Line Options
Finding configuration files
By default SiteDiff will put everything in the sitediff folder. You can use
the --directory flag to specify a different directory.
sitediff init -C my_project_folder https://example.com
sitediff diff -C my_project_folder
sitediff serve -C my_project_folder
Specifying paths
When you run sitediff diff, you can specify which pages to look at in
2 ways:
-
The option
--paths /foo /bar ....If you're trying to fix one page in particular, specifying just that one path will make
sitediff diffrun quickly! -
The option
--paths-file FILEwith a newline-delimited text file.
This is particularly useful when you're trying to eliminate all diffs.
SiteDiff creates a file output/failures.txt containing all paths
which had differences, so as you try to fix differences, you can run:
sitediff diff --paths-file sitediff/failures.txt
Debugging rules
When a sanitization rule isn't working quite right for you, you might run
sitediff diff many times over. If fetching all the pages is taking too long,
try adding the option --cached=all. This tells SiteDiff not to re-fetch
the content, but just compare previously cached versions — it's a lot faster!
Including and Excluding URLs
By default sitediff crawls pages that are indicated with an HTML anchor using
the <A HREF syntax. Most pages linked will be HTML pages, but some links
will contain binaries such as PDF documents and images.
Using the option --exclude='.*\.pdf' ensures the crawler skips links
for document with a .pdf extension. Note that the regular expression is
applied to the path of the URL, not the base of the URL.
For example --include='.*\.com' will not match http://www.google.com/,
because the path of that URL is / while the base is www.google.com.
paths / paths-file
SiteDiff allows you to specify a list of paths that you want it to work with. Alternatively, it can crawl the entire site and detect all paths.
-
Running
sitediff initconfigures SiteDiff for crawling and seeing differences. -
Running
