FPDiff
FPDiff is a tool for automated, end-to-end differential testing that, given only library source code as input, extracts numerical function signatures, synthesizes drivers, creates equivalence classes of functions that are synonymous, and executes differential tests over these classes to detect meaningful numerical discrepancies between implementations. FPDiff's current scope covers special functions across numerical libraries written in different programming languages. This artifact in particular includes the following libraries: the C library GSL (The GNU Scientific Library, version 2.6), the Python libraries SciPy (version 1.3.1) and mpmath (version 1.1.0), and the JavaScript library jmat (commit 21d15fc3eb5a924beca612e337f5cb00605c03f3).
Install / Use
/learn @ucd-plse/FPDiffREADME
FPDiff
This is the artifact accompanying the ISSTA'20 paper "Discovering Discrepancies in Numerical Libraries".

FPDiff is a tool for automated, end-to-end differential
testing that, given only library source code as input,
extracts numerical function signatures, synthesizes drivers, creates
equivalence classes of functions that are synonymous, and executes
differential tests over these classes to detect meaningful numerical
discrepancies between implementations. FPDiff's current scope covers
special functions across numerical libraries written in different
programming languages. This artifact in particular includes the
following libraries: the C library GSL (The GNU Scientific Library,
version 2.6), the Python libraries SciPy (version 1.3.1) and
mpmath (version 1.1.0), and the JavaScript library jmat (commit
21d15fc3eb5a924beca612e337f5cb00605c03f3).
In following this README, we will run the tool end-to-end to generate
results and then evaluate those results. This evaluation involves
comparing the generated logs against a set of expected results,
reconstructing all the examples used in the paper, and reconstructing
the information in Table 4 which represent the final results of
FPDiff. (Note that Table 1 simply describes our categorization of
discrepancies and Tables 2, 3, and 5 are the results of manual
inspection; supporting data can be found in the
resources/spreadsheets directory, with a discussion in [3.2].)
[0] Requirements
[1] Running FPDiff
------------[1.1] Running FPDiff on all discoverable functions
------------[1.2] Running FPDiff on a representative subset of functions
[2] Evaluating Results: Consistency with the Paper
------------[2.1] Running Automated Checks
------------[2.2] Reconstructing Examples
------------[2.3] Reconstructing Tables
[3] Supplemental Material
------------[3.1] Descriptions of Intermediate Results
------------[3.2] Spreadsheets
------------[3.3] Adding New Libraries
[0] Requirements
All requirements for running FPDiff are packaged in a 1.3GB docker
image. As such, the host machine must have docker installed. If you
receive a permissions error when running the commands in this README,
check to make sure that your user is in the docker
group.
Note that the execution times listed below for running FPDiff do not include the time required to download the docker image. When executing the pipeline for the first time, the image will be downloaded automatically if not found locally. Optionally, one can also run the command below to acquire the docker image manually.
$ docker pull ucdavisplse/sp-diff-testing/artifact
A Dockerfile and the build.sh script called by that Dockerfile are
included so that those interested may view the requirements packaged
in the docker image and/or build the image locally.
The artifact was tested on a workstation with a 3.60 GHz Intel i7-4790 and 32 GB of RAM running Ubuntu 16.04.
[1] Running FPDiff
This artifact provides two choices for the running of FPDiff: (1) An
execution on all discoverable functions which will take longer but
will reproduce the full results of the paper and (2) An execution on a
subset of functions which is much quicker but provides only a subset
of representative results. The subset execution need not be undertaken
if the full run is performed; it is provided as a convenience.
Hyperparameters for FPDiff executions may be adjusted by editing header.py.
[1.1] Running FPDiff on all discoverable functions (approx. 1 hour)
$ nohup ./run.sh
From the same directory containing this README, execute the above
command. This will automatically download the docker image with all
the dependencies if it is not found locally, spin up a container,
mount the workspace directory in that container, and execute
workspace/runExperiment.sh to conduct a full pipeline execution.
FPDiff will parse the source code of the libraries, extract function
signatures, generate drivers for those functions, place them in
equivalence classes, and finally perform differential testing to
discover numerical discrepancies.
[1.2] Running FPDiff on a subset of representative functions (approx. 8 minutes)
$ nohup ./run.sh subset
The subset execution need not be undertaken if the full run is performed, though it may be opted for as a convenience. Please note that because the above command only generates a subset of results, the portion of the evaluation described in [2.3] re: reconstruction of tables cannot be completed.
From the same directory containing this README, execute the above command. This will conduct an FPDiff execution in a manner similar to that described above but on a subset of all discoverable functions in the libraries. This subset encompasses ~13% of the equivalence classes and ~25% of the discrepancies discovered by the full run. It also includes all of the examples used in the paper.
[2] Evaluating Results: Consistency with the Paper
This evaluation involves comparing the generated logs against a set of
expected logs, reconstructing all the examples used in the paper, and
reconstructing the information in Table 4 which represent the final
results of FPDiff. (Note that Table 1 simply describes our
categorization of discrepancies and Tables 2 and 3 are the results of
manual inspection; the supporting data can be found in the
resources/spreadsheets directory.) FPDiff will generate results that can be
found in workspace/logs. Depending on the choice made with respect
to the scope of the FPDiff execution, expected results can be found in
either frozenState/full_logs or frozenState/subset_logs.
Contents of the logs directory: equivalenceClasses.csv and
reducedDiffTestingResults.csv contain main results. statistics.txt
contains stats generated from each component of FPDiff, giving an
overview of the pipeline execution. These files are used to
demonstrate consistency with the paper (see [2.2] and [2.3]). Files with leading underscores
contain information on intermediate results from the different
pipeline components. See [3.1] for a breakdown of these files.
[2.1] Running Automated Checks
$ ./check.sh
To perform an automated series of checks, execute the above command.
This will perform a diff between the generated statistics.txt and
the expected results kept in the frozenState directory (thus
verifying information which is used to populate the tables, described
in [2.3]) as well conduct a search of the generated logs for the
examples used in the paper (described in [2.2]).
Expected Output. The expected output is a series of [PASS] tags.
However, off-by-one differences might be observed in statistics.txt,
denoted by a [!] tag and printouts of the offending lines. These
differences are sometimes caused by the flakiness of SciPy's
implementation
of Tricomi's confluent hypergeometric
function
or some non-determinism with respect to detection of timeouts which
can be affected by other processes running on the machine or the
hardware itself.
Further Investigating Any Differences. If faced with off-by-one
differences in statistics.txt and further investigation is desired,
running the command below will re-perform the above checks with an
additional line-by-line diff of all logs, printing the EXPECTED
lines and the corresponding GENERATED lines (if any).
./check.sh verbose
Confirm that any offending lines pertain to discrepancies that include
EXCEPTION: TIMEOUT or discrepancies involving the equivalence class
of the Tricomi confluent hypergeometric function described above. See
below for an excerpt of the verbose check script's output that exhibits
the latter of the two for one particular FPDiff execution.
In reducedDiffTestingResults.csv:
EXPECTED:
< 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num038,"[-10.5, 100.4, 80.0]",mpmath,mpmath_hyperu_arg3_DRIVER0,hyperu,203411319101444.38,,,3.0
< 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num038,"[-10.5, 100.4, 80.0]",mpmath,mpmath_fp_hyperu_arg3_DRIVER0,hyperu,203411299428896.34,,,3.0
< 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num038,"[-10.5, 100.4, 80.0]",scipy,scipy_special_hyperu_arg3_DRIVER0,hyperu,nan,,,3.0
< 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num038,"[-10.5, 100.4, 80.0]",gsl,gsl_sf_hyperg_U_DRIVER,gsl_sf_hyperg_U,203411319101444.4,,,3.0
GENERATED:
> 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num026,"[-10.5, 1.1, 50.0]",mpmath,mpmath_hyperu_arg3_DRIVER0,hyperu,3.661978424114089e+16,,,3.0
> 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num026,"[-10.5, 1.1, 50.0]",mpmath,mpmath_fp_hyperu_arg3_DRIVER0,hyperu,3.661978424114079e+16,,,3.0
> 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num026,"[-10.5, 1.1, 50.0]",scipy,scipy_special_hyperu_arg3_DRIVER0,hyperu,nan,,,3.0
> 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num026,"[-10.5, 1.1, 50.0]",gsl,gsl_sf_hyperg_U_DRIVER,gsl_sf_hyperg_U,3.661978424114087e+16,,,3.0
EXPECTED:
< caba0b6725c770243091e62e26223111,277.9439994014186,3,gsl_sf_hyperg_U_DRIVER~input_num040,"[-10.5, 100.4, 200.0]",mpmath,mpmath_hyperu_arg3_DRIVER0,hyperu,1.4308157061056496e+20,"{'mpmath_hyperu_arg3_DRIVER0/mpmath_fp_hyperu_arg3_DRIVER0': 2602074112.0, 'mpmath_hyperu_
