FPDiff

This is the artifact accompanying the ISSTA'20 paper "Discovering Discrepancies in Numerical Libraries".

The FPDiff Pipeline

FPDiff is a tool for automated, end-to-end differential testing that, given only library source code as input, extracts numerical function signatures, synthesizes drivers, creates equivalence classes of functions that are synonymous, and executes differential tests over these classes to detect meaningful numerical discrepancies between implementations. FPDiff's current scope covers special functions across numerical libraries written in different programming languages. This artifact in particular includes the following libraries: the C library GSL (The GNU Scientific Library, version 2.6), the Python libraries SciPy (version 1.3.1) and mpmath (version 1.1.0), and the JavaScript library jmat (commit 21d15fc3eb5a924beca612e337f5cb00605c03f3).

In following this README, we will run the tool end-to-end to generate results and then evaluate those results. This evaluation involves comparing the generated logs against a set of expected results, reconstructing all the examples used in the paper, and reconstructing the information in Table 4 which represent the final results of FPDiff. (Note that Table 1 simply describes our categorization of discrepancies and Tables 2, 3, and 5 are the results of manual inspection; supporting data can be found in the resources/spreadsheets directory, with a discussion in [3.2].)

[0] Requirements

[1] Running FPDiff

------------[1.1] Running FPDiff on all discoverable functions

------------[1.2] Running FPDiff on a representative subset of functions

[2] Evaluating Results: Consistency with the Paper

------------[2.1] Running Automated Checks

------------[2.2] Reconstructing Examples

------------[2.3] Reconstructing Tables

[3] Supplemental Material

------------[3.1] Descriptions of Intermediate Results

------------[3.2] Spreadsheets

------------[3.3] Adding New Libraries

[0] Requirements

All requirements for running FPDiff are packaged in a 1.3GB docker image. As such, the host machine must have docker installed. If you receive a permissions error when running the commands in this README, check to make sure that your user is in the docker group.

Note that the execution times listed below for running FPDiff do not include the time required to download the docker image. When executing the pipeline for the first time, the image will be downloaded automatically if not found locally. Optionally, one can also run the command below to acquire the docker image manually.

$ docker pull ucdavisplse/sp-diff-testing/artifact

A Dockerfile and the build.sh script called by that Dockerfile are included so that those interested may view the requirements packaged in the docker image and/or build the image locally.

The artifact was tested on a workstation with a 3.60 GHz Intel i7-4790 and 32 GB of RAM running Ubuntu 16.04.

[1] Running FPDiff

This artifact provides two choices for the running of FPDiff: (1) An execution on all discoverable functions which will take longer but will reproduce the full results of the paper and (2) An execution on a subset of functions which is much quicker but provides only a subset of representative results. The subset execution need not be undertaken if the full run is performed; it is provided as a convenience. Hyperparameters for FPDiff executions may be adjusted by editing header.py.

[1.1] Running FPDiff on all discoverable functions (approx. 1 hour)

$ nohup ./run.sh

From the same directory containing this README, execute the above command. This will automatically download the docker image with all the dependencies if it is not found locally, spin up a container, mount the workspace directory in that container, and execute workspace/runExperiment.sh to conduct a full pipeline execution. FPDiff will parse the source code of the libraries, extract function signatures, generate drivers for those functions, place them in equivalence classes, and finally perform differential testing to discover numerical discrepancies.

[1.2] Running FPDiff on a subset of representative functions (approx. 8 minutes)

$ nohup ./run.sh subset

The subset execution need not be undertaken if the full run is performed, though it may be opted for as a convenience. Please note that because the above command only generates a subset of results, the portion of the evaluation described in [2.3] re: reconstruction of tables cannot be completed.

From the same directory containing this README, execute the above command. This will conduct an FPDiff execution in a manner similar to that described above but on a subset of all discoverable functions in the libraries. This subset encompasses ~13% of the equivalence classes and ~25% of the discrepancies discovered by the full run. It also includes all of the examples used in the paper.

[2] Evaluating Results: Consistency with the Paper

This evaluation involves comparing the generated logs against a set of expected logs, reconstructing all the examples used in the paper, and reconstructing the information in Table 4 which represent the final results of FPDiff. (Note that Table 1 simply describes our categorization of discrepancies and Tables 2 and 3 are the results of manual inspection; the supporting data can be found in the resources/spreadsheets directory.) FPDiff will generate results that can be found in workspace/logs. Depending on the choice made with respect to the scope of the FPDiff execution, expected results can be found in either frozenState/full_logs or frozenState/subset_logs.

Contents of the logs directory: equivalenceClasses.csv and reducedDiffTestingResults.csv contain main results. statistics.txt contains stats generated from each component of FPDiff, giving an overview of the pipeline execution. These files are used to demonstrate consistency with the paper (see [2.2] and [2.3]). Files with leading underscores contain information on intermediate results from the different pipeline components. See [3.1] for a breakdown of these files.

[2.1] Running Automated Checks

$ ./check.sh

To perform an automated series of checks, execute the above command. This will perform a diff between the generated statistics.txt and the expected results kept in the frozenState directory (thus verifying information which is used to populate the tables, described in [2.3]) as well conduct a search of the generated logs for the examples used in the paper (described in [2.2]).

Expected Output. The expected output is a series of [PASS] tags. However, off-by-one differences might be observed in statistics.txt, denoted by a [!] tag and printouts of the offending lines. These differences are sometimes caused by the flakiness of SciPy's implementation of Tricomi's confluent hypergeometric function or some non-determinism with respect to detection of timeouts which can be affected by other processes running on the machine or the hardware itself.

Further Investigating Any Differences. If faced with off-by-one differences in statistics.txt and further investigation is desired, running the command below will re-perform the above checks with an additional line-by-line diff of all logs, printing the EXPECTED lines and the corresponding GENERATED lines (if any).

./check.sh verbose

Confirm that any offending lines pertain to discrepancies that include EXCEPTION: TIMEOUT or discrepancies involving the equivalence class of the Tricomi confluent hypergeometric function described above. See below for an excerpt of the verbose check script's output that exhibits the latter of the two for one particular FPDiff execution.

   In reducedDiffTestingResults.csv:

	EXPECTED: 
		< 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num038,"[-10.5, 100.4, 80.0]",mpmath,mpmath_hyperu_arg3_DRIVER0,hyperu,203411319101444.38,,,3.0
		< 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num038,"[-10.5, 100.4, 80.0]",mpmath,mpmath_fp_hyperu_arg3_DRIVER0,hyperu,203411299428896.34,,,3.0
		< 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num038,"[-10.5, 100.4, 80.0]",scipy,scipy_special_hyperu_arg3_DRIVER0,hyperu,nan,,,3.0
		< 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num038,"[-10.5, 100.4, 80.0]",gsl,gsl_sf_hyperg_U_DRIVER,gsl_sf_hyperg_U,203411319101444.4,,,3.0
	GENERATED:
		> 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num026,"[-10.5, 1.1, 50.0]",mpmath,mpmath_hyperu_arg3_DRIVER0,hyperu,3.661978424114089e+16,,,3.0
		> 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num026,"[-10.5, 1.1, 50.0]",mpmath,mpmath_fp_hyperu_arg3_DRIVER0,hyperu,3.661978424114079e+16,,,3.0
		> 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num026,"[-10.5, 1.1, 50.0]",scipy,scipy_special_hyperu_arg3_DRIVER0,hyperu,nan,,,3.0
		> 8327fff0b9f85b3e5ce4a6c6e345fd34,277.9439994014186,2,gsl_sf_hyperg_U_DRIVER~input_num026,"[-10.5, 1.1, 50.0]",gsl,gsl_sf_hyperg_U_DRIVER,gsl_sf_hyperg_U,3.661978424114087e+16,,,3.0
	EXPECTED: 
		< caba0b6725c770243091e62e26223111,277.9439994014186,3,gsl_sf_hyperg_U_DRIVER~input_num040,"[-10.5, 100.4, 200.0]",mpmath,mpmath_hyperu_arg3_DRIVER0,hyperu,1.4308157061056496e+20,"{'mpmath_hyperu_arg3_DRIVER0/mpmath_fp_hyperu_arg3_DRIVER0': 2602074112.0, 'mpmath_hyperu_

FPDiff

Install / Use

README

FPDiff

This is the artifact accompanying the ISSTA'20 paper "Discovering Discrepancies in Numerical Libraries".

[0] Requirements

[1] Running FPDiff

[1.1] Running FPDiff on all discoverable functions (approx. 1 hour)

[1.2] Running FPDiff on a subset of representative functions (approx. 8 minutes)

[2] Evaluating Results: Consistency with the Paper

[2.1] Running Automated Checks