mytpcds

Run TPC-DS against different databases including Hive, Spark SQL and IBM BigSQL

https://github.com/stanislawbartkowski/mytpcds/wiki

Test results: https://github.com/stanislawbartkowski/mytpcds/wiki/TPC-DS-BigData-results

TPC/DS version tested:

v2.13.0rc1
DSGen-software-code-3.2.0rc1

Inspiration

http://www.tpc.org/tpcds/

TPC-DS is an objective tool to measure and compare different databases systems. The same set of data and non trivial queries can be loaded and executed and give an insight how databases respond to the workload. Also, having expected result based on experience, the tool can be also used for testing and tunning the newly installed or upgraded database. But using TPC-DS is not easy out of the box. Requires some manual tasks to perform. So I decided to prepare an automated tool to do the task. Just download the TPC-DS files, configure and run it.

TPC-DS preparation

Unpack the compressed zip file in the directory. The following directory structure is created.

v2.10.1rc3/v2.11.0rc2/v2.13.0rc1
- answer_sets
- EULA.txt
- query_templates
- query_variants
- specification
- tests
- tools

For the purpose of the test the additional directories should be created.

v2.10.1rc3/v2.11.0rc2
- work
  - data (will contained input data)
  - db2queries (queries and results related to appropriate database)
  - mysqlqueries
  - (etc)

cd v2.10.1rc3(v2.11.0rc2)/tools make

Executable files are created. To create an input data set run the command dsdgen. The parameter -sc describes the size of the data.

./dsdgen -dir ../work/data -sc 100

Tool description.

The following database servers are supported: PosgreSQL, MySQL (MariaDB), Oracle, DB2, Netezza, Hive, SparkSQL and IBM BigSQL. The tool consists of several simple bash and awk script files. The tool does not require any dependencies.

File | Description | Wiki ------------ | ------------- | ------ db/db2proc.sh | Implementation for DB2 and BigSQL | https://github.com/stanislawbartkowski/mytpcds/wiki/DB2 https://github.com/stanislawbartkowski/mytpcds/wiki/IBM-BigSQL db/hiveproc.sh | Implementation for Hive and SparkSQL Thrive | https://github.com/stanislawbartkowski/mytpcds/wiki/SparkSQL-Thrive db/netezzaproc.sh | Implementation for Netezza | https://github.com/stanislawbartkowski/mytpcds/wiki/Netezza db/phoenixproc.sh | Implementation for HBase Phoenix (not working) db/mysqlproc.sh | Implementation for MySQL/MariaDB | https://github.com/stanislawbartkowski/mytpcds/wiki/MySQL db/oracleproc.sh | Implementation for Oracle | https://github.com/stanislawbartkowski/mytpcds/wiki/Oracle db/psqlproc.sh | Implementation for PostreSQL | https://github.com/stanislawbartkowski/mytpcds/wiki/PostgreSQL db/sparksqlproc.sh | Implementation for SparkSQL | https://github.com/stanislawbartkowski/mytpcds/wiki/SparkSQL-Thrive db/jsqshproc.sh | Alternative solution for BigSQL, jsqsh | https://github.com/stanislawbartkowski/mytpcds/tree/master/RunQueries | proc | Several supporting bash and awk scripts | https://github.com/stanislawbartkowski/mytpcds/tree/master/proc | RunQueries | Java tool to run queries using JDBC connection | https://github.com/stanislawbartkowski/mytpcds/tree/master/proc ptest.sh | Starter for Throughput Test res | Expected result sets for Qualify Test run.sh | Launching script file tpc.sh | Main tpc-ds test runner env | Configuration files for databases transf.awk | AWK script file used to transform the results env.templates | Templates for configuration files sh.templates | Templates for lauch scripts res | Refeence data set for Qualify Test qualification | Queries parameters used to run Qualify Test

Java Query Runner

The queries are executed by an appropriate command line tool. But to pass Qualify Test, the output should match the reference answer result set. Because every tool comes with its own output format, it is not easy to find a common denominator for all databases. So I develeped a simple Java QueryRunner giving the same output format regardless of the database.

Build Java Query Runner

cd RunQueries mvn package

The following directory structure should be created.

ll target/
  archive-tmp
  classes
  generated-sources
  lib
  maven-archiver
  maven-status
  RunQueries-1.0-SNAPSHOT.jar
  RunQueries-1.0-SNAPSHOT-jar-with-dependencies.jar

In order to use QueryRunner in the test, DBURL and JAVADRIVER configuration parameters should be specified.

Tool configuarion

Copy bash scripts from sh.templates to the directory.

cd mytpcds cp sh.templates/* .

Create directory for database connection specification.

mkdir env

Copy relevent properties file from env.templates to env and configure according to yout evironment. For instance: to run Hive TPC/DS test.

cp env.templates/hive.rc env

The common configuration parameters

| Parameter | Value | Example | ------- | ------- | ----- | | DBNAME | Database name | PERFDB | DBUSER | Database user, should have full privileges in the database | perf | DBPASSWORD | Database password | secret | DBHOST | Database host name | netezza.fyre.ibm.com | DTYPE | Database identifier, should correspond to db/${DTYPE}proc.sh | netezza, points to db/netezzaproc.sh | DBURL | JDBC URL to connect to database | jdbc:netezza://$DBHOST:5480/$DBNAME | JAVADRIVER | JDBC Java driver jar file | /usr/local/nz/lib/nzjdbc3.jar

DBURL and JAVADRIVER should be specified only if QueryRunner is going to be used. If only database client software is used, these parameters could be ignored.

Troubleshooting

Logs are created in the directory pointed by TEMPDIR variable.

export TEMPDIR=/tmp/tpcds

Example:

cat /tmp/tpcds/db2log/mytcpds.log

SQL30082N  Security processing failed with reason "24" ("USERNAME AND/OR 
PASSWORD INVALID").  SQLSTATE=08001
Cannot connect to DB2
Exit immediately

The queries

Not all queries are ready to execute out of the box. The TPC-DS specification allows small alteration of the query to make them runnable (4.2.3.1).

It is recognized that implementations require specific adjustments for their operating environment and the syntactic variations of its dialect of the SQL language

To avoid keeping a different version of queries for every database, I decided to make amendments on the fly. Most changes are related to date arithmetics like adding or subtracting a number of days or table aliases. Changes are limited to basic text substitution without touching the query logic.

The changes are implemented in https://github.com/stanislawbartkowski/mytpcds/blob/master/tpc.sh script file, runsinglequery bash function.

After that, I ended up with the following queries coverage.

Database | Coverage ------------ | ------------- DB2 | 100% Oracle | 100% MySQL/MariaDB | 87% PostgreSQL | 97% Hive 2.1 | 49% SparkSQL 2.3 | 94% Netezza/NPS | 95% IBM BigSQL | 100%

Global settings

Variable name | Description | Default/sample value ------------ | ------------- | ------------- TEMPDIR | Temporary directory for log file and temp files | /tmp/mytpcds TCPROOT | Root directory for upacked TCP-DS payload | /home/sbartkowski/work/v2.10.0rc2 ENV | Resource file for a database under test | env/bigsql TESTDATA | TCP-DS table used for test loading phase | call_center TESTQUERY | Number of query used to execute a query test | 04 DONOTVERIFY | If empty, run Test Validation (QUALIFY). If not empty, ignore Test Validation | X (not empty) QUERYTIMEOUT | Query time execution thereshold. Parameter for timout command | 5s (limit is 5 seconds)

Configure database properties

Prepare the server, the client and the connection. https://github.com/stanislawbartkowski/mytpcds/wiki contains a bunch of useful informations.

In conn.rc file uncomment the property file appropriate for a particular database and modify the file according to the environment.

https://github.com/stanislawbartkowski/mytpcds/blob/master/run.sh

For instance, for Oracle

#export ENV=env/db2
#export ENV=env/bigsql
export ENV=env/oracle
#export ENV=env/hive
#export ENV=env/postgresql
#export ENV=env/sparksql
#export ENV=env/thrive
#export ENV=env/phoenix
#export ENV=env/mysql
#export ENV=env/netezza

Qualify Test

Qualify dataset

Qualify database is 1GB size.

./dsdgen -dir ../work/data -sc 1

Qualify queries

Qualify queries are used to validate SQL SELECT statements. Qualify queries contains hardcoded parameter values and executed against qualify dataset should yield the same result set. The run.sh queryqualification task creates a serie of qualify queries. Every test templates has corresponding parameter definition in qualification directory. For instance, query template query_templates/query4.tpl has corresponding parameters set qualification/4.par

YEAR=2001
SELECTONE=t_s_secyear.customer_preferred_cust_flag

To prepare qualification queries:

Configure conn.rc and env/{db}proc.sh file
run.sh file, uncomment ./tpc.sh queryqualification line and comment out all other ./tpc.sh lines.
./run.sh

qgen2 Query Generator (Version 2.11.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2019
Warning: This scale factor is valid for QUALIFICATION ONLY
Parsed 99 templates
PASSED

Step 1, test the connection

Uncomment #./tpc.sh test line in run.sh file

./tpc.sh test
#./tpc.sh removedata
#./tpc.sh createtables
#./tpc.sh loadtest
#./tpc.sh testverify
#./tpc.sh testquery

#./tpc.sh loaddata
#./tpc.sh verifyload
#./tpc.sh runqueries

Execute ./run.sh. If connection is working, the output should be:

./run.sh 
PASSED

If the connection is not configured properly then look at output log file. The log directory is specified as LOGDIR in the resource file.

tail -f /tmp/mytpcds/oraclelog/mytcp.log

Mytpcds

Install / Use

README