SkewedDataGenerator
Skewed Data Generator for TPC-H
Install / Use
/learn @gunaprsd/SkewedDataGeneratorREADME
Skewed Data Generator for TPC-H:
This is a simple modification of the TPC-H data generator to generate skewed data. Microsoft Research had generously made their version of skewed data generator available for download here. But, it supports only Windows. I simply modified their code to support linux environments.
Disclaimer: I do not own any part of this generator and this is just an attempt to share a simple modification of the original code for easy portability to linux environments. If you have any issues with this being public, please let me know.
Readme
0. What is this document?
This is the general README file for DBGEN and QGEN, the data- base population and executable query text generation programs used in the TPC-D benchmark. It covers the proper use of DBGEN and QGEN. For information on porting the utility to your particular platform see Porting.Notes.
1. What is DBGEN?
DBGEN is a database population program for use with the TPC-D benchmark. It is written in ANSI 'C' for portability, and has been successfully ported to over a dozen different systems. While the TPC-D specification allows an implementor to use any utility to populate the benchmark database, the resultant population must exactly match the output of DBGEN. The source code has been provided to make the process of building a compliant database population as simple as possible.
2. What will DBGEN create?
Without any command line options, DBGEN will generate 8 separate ascii files. Each file will contain pipe-delimited load data for one of the tables defined in the TPC-D database schema. The default tables will contain the load data required for a scale factor 1 database. By default the file will be created in the current directory and be named <table>.tbl. As an example, customer.tbl will contain the load data for the customer table.
When invoked with the '-U' flag, DBGEN will create the data sets to be used in the update functions and the SQL syntax required to delete the data sets. The update files will be created in the same directory as the load data files and will be named "u_<table>.set". The delete syntax will be written to "delete.set". For instance, the data set to be used in the third query set to update the lineitem table will be named "u_lineitem.tbl.3", and the SQL to remove those rows will be found in "delete.3". The size of the update files can be controlled with the '-r' flag.
3. How is DBGEN built?
Create an appropriate makefile, using makefile.suite as a basis, and type make. Refer to Porting.Notes for more details and for suggested compile time options.
4. Command Line Options for DBGEN
DBGEN's output is controlled by a combination of command line options and environment variables. Command line options are assumed to be single letter flags preceded by a minus sign. They may be followed by an optional argument.
| option | argument | default | action | |--------|:---------:|:--------:|:----------------------------------------------------------------------| |-v |none | | Verbose. Progress messages are displayed as data is generated. | |-f |none | | Force. Existing data files will be overwritten. | |-F |none |yes | Flat file output. | |-D |none | | Direct database load. ld_XXXX() routine must be defined in load_stub.c | |-s |[scale] |1 | Scale of the database population. Scale 1.0 represents ~1 GB of data | |-T |[table] | | Generate the data for a particular table ONLY. Arguments: <br> p -- part/partuspp <br>c -- customer<br> s -- supplier, <br>o -- order/lineitem <br> t : time <br> n : nation <br> r : region <br> l : code (same as n and r) <br> O : order <br> L : lineitem <br> P : part <br> S : partsupp | |-O |d | | Generate SQL for delete function instead of key ranges | |-O |f | | Allow over-ride of default output file names | |-O |h | | Generate headers in flat ascii files. hd_XXX routines must be defined in load_stub.c | |-O |m | | Flat files generate fixed length records | |-O |r | | Generate key ranges for the UF2 update function | |-O |s | | Generate the state files for the random number generator. | |-O |t | | Generate the optional time table and its associated join fields | |-h | | | Display a usage summary | |-U |[updates] | | Create a specified number of data sets in flat files for the update/delete functions | |-r |[percentage]|10 | Scale each udpate file to the given percentage (expressed in basis points) of the data set | |-n |[name] | | Use database [name] for in-line load | |-C |[children] | | Use [hildren] separate processes to generate data | |-S |[n] | | Generate the [n]th part of a multi-part load |
5. Building Large Data Sets with DBGEN
DBGEN relies on its own random number generator to assure that identical data sets can be generated on different platforms. In order to build large data sets using either parallel or multi-stage loads, it is important that the random number generator be started, or "seeded", correctly for each step in the load. DBGEN includes an option to create correct seed files, ascii data files used to seed the randowm number generator.
Each line in a seed file represents the state of one part of the random
number generator after a stage of the data generation has been completed.
Seed files are named based on the size of the data set being constructed, the
number of children or steps involved in the build, and which step a particular
seed file represents. The naming convention for seed files is mncccsss, where
SF=m * 10^n, ccc is the number of children in the load (in hex) and sss is the
number of the current stage (in hex). For example, the 10th seed file in a 30
process load of 300 GB would be 3201E00A. Since there can be a large number of
seed files, DBGEN allows you to segregate them in the directory named in the
environment variable DSS_SEED.
6. DBGEN limitations and compliant usage
DBGEN is meant to be a robust population generator for use with the TPC-D benchmark. It is hoped that DBGEN will make it easier to experi- ment with and become proficient in the execution of TPC-D's. As a result, it includes a number of command line options which are not, strictly speaking, necessary to generate a compliant data set for a TPC-D run. In addition, some command line options will accept arguments which result in the generation of NON-COMPLIANT data sets. Options which should be used with care include:
-s -- scale factor. TPC-D runs are only compliant when run against SF's of 1, 10, 30, 100, 300, 1000 .... -r -- refresh percentage. TPC-D runs are only compliant when run with -r 10, the default.
7. Sample DBGEN executions
DBGEN has been built to allow as much flexibility as possible, but is fundementally intended to generate two things: a database population against which the queries in TPC-D can be run, and the updates that are used during the update functions in TPC-D. Here are some sample uses of DBGEN.
- To generate the database population for the qualification database dbgen -s 0.1
- To generate the lineitem table only, for a scale factor 10 database, and over-write any existing flat files: dbgen -s 10 -f -T L
- To build the seed files necessary to load a 30GB data set in 10 steps, and include some progress reports: dbgen -v -O s -s 30 -C 10
- To geterate a 100GB data set in 1GB pieces, generate only the part and partsupplier tables, and include some progress reports along the way: dbgen -s 100 -S 1 -C 100 -T p -v (to generate the first 1GB file) dbgen -s 100 -S 2 -C 100 -T p -v (to generate the second 1GB file) (and so on, incrementing the argument to -S each time)
- To generate the update files needed for a 4 stream run of the throughput test at 100 GB, using an existing set of seed files from an 8 process load: dbgen -s 100 -U 4 -C 8 Note: since the state of the seed files for a given scale factor is the same at the end of the load regardless of the number of children used in the load (the same data has been generated, resulting in the same modifications to the RNG), the -C argument is arbitrary in the generation of updates. Use whatever seed files are available.
8. What is QGEN?
QGEN is a query generation program for use with the TPC-D benchmark. It is written in ANSI 'C' for portability, and has been successfully ported to over a dozen different systems. While the TPC-D specification allows an implementor to use any utility to create the benchmark query sets, QGEN has been provided to make the process of building a benchmark implementation as simple as possible.
9. What will QGEN create?
QGEN is a filter, triggered by :'s. It does line-at-a-time reads of its input (more on that later), scanning for :foo, where foo determines the substitution that occurs. Including:
:[int] replace with the appropriate value for parameter [int] :b replace with START_TRAN (from tpcd.h) :c replace with SET_DBASE (from tpcd.h) :n[int] replace with SET_ROWCOUNT([int]) (from tpcd.h) :o replace with SET_OUTPUT (from tpcd.h) :q replace with query number :s replace with stream number :x replace with GEN_QUERY_PLAN (from tpcd.h)
Qgen takes an assortment of command line options, controlling which of these options should be active during the translation from template to EQT, and a list of query "names". It then translates the template fou
