EPAM Syngen

EPAM Syngen is an unsupervised tabular data generation tool. It is useful for generation of test data with a given table as a template. Most datatypes including floats, integers, datetime, text, categorical, binary are supported. The linked tables i.e., tables sharing a key can also be generated using the simple statistical approach. The source of data might be in CSV, Avro and Excel format and should be located locally and be in UTF-8 encoding.

The tool is based on the variational autoencoder model (VAE). The Bayesian Gaussian Mixture model is used to further detangle the latent space.

Prerequisites

Python 3.10 or 3.11 is required to run the library. The library is tested on Linux and Windows operating systems. You can download Python from the official website and install manually, or you can install Python from your terminal. After the installation of Python, please, check whether pip is installed.

Getting started

Before the installation of the library, you have to set up the virtual environment.

You can install the library with CLI only:

pip install syngen

Otherwise, if you want to install the UI version with streamlit, run:

pip install syngen[ui]

Note: see details of the UI usage in the corresponding section

The training and inference processes are separated with two CLI entry points. The training one receives paths to the original table, metadata json file or table name and used hyperparameters.

To start training with defaults parameters run:

train --source PATH_TO_ORIGINAL_CSV \
    --table_name TABLE_NAME

This will train a model and save the model artifacts to disk.

To generate with defaults parameters data simply call:

infer --table_name TABLE_NAME

Please notice that the name should match the one you used in the training process. This will create a csv file with the synthetic table in ./model_artifacts/tmp_store/TABLE_NAME/merged_infer_TABLE_NAME.csv.

Here is a quick example:

train --source ./examples/example-data/housing.csv –-table_name Housing
infer --table_name Housing

As the example you can use the dataset "Housing" in examples/example-data/housing.csv. In this example, our real-world data is <a href="https://www.kaggle.com/datasets/camnugent/california-housing-prices" target="_blank">"Housing"</a> from Kaggle.

Features

Training

You can add flexibility to the training and inference processes using additional hyperparameters. For training of single table call:

train --source PATH_TO_ORIGINAL_CSV \
    --table_name TABLE_NAME \
    --epochs INT \
    --row_limit INT \
    --drop_null BOOL \
    --reports STR \
    --batch_size INT \
    --log_level STR \ 
    --fernet_key STR

Note: To specify multiple options for the --reports parameter, you need to provide the --reports parameter multiple times. For example:

train --source PATH_TO_ORIGINAL_CSV \
    --table_name TABLE_NAME \
    --reports accuracy \
    --reports sample

The accepted values for the parameter "reports":

"none" (default) - no reports will be generated
"accuracy" - generates an accuracy report to measure the quality of synthetic data relative to the original dataset. This report is produced after the completion of the training process, during which a model learns to generate new data. The synthetic data generated for this report is of the same size as the original dataset to reach more accurate comparison.
"sample" - generates a sample report (if original data is sampled, the comparison of distributions of original data and sampled data is provided in the report)
"metrics_only" - outputs the metrics information only to standard output without generation of an accuracy report
"all" - generates both accuracy and sample reports Default value is "none".

To train one or more tables using a metadata file, you can use the following command:

train --metadata_path PATH_TO_METADATA_YAML

Parameters that you can set up for training process:

source – required parameter for training of single table, a path to the file that you want to use as a reference
table_name – required parameter for training of single table, an arbitrary string to name the directories
epochs – a number of training epochs. Since the early stopping mechanism is implemented the bigger value of epochs is the better
row_limit – a number of rows to train over. A number less than the original table length will randomly subset the specified number of rows
drop_null – whether to drop rows with at least one missing value
batch_size – if specified, the training is split into batches. This can save the RAM
reports - controls the generation of quality reports, might require significant time for big tables (>10000 rows)
metadata_path – a path to the metadata file containing the metadata
column_types - might include the section categorical which contains columns explicitly defined as categorical by the user
log_level - logging level for the process
fernet_key - the name of the environment variable that kept the value of the fernet key used to encrypt the sample data of the original data. If the fernet key is not set, the original data will be stored in '.pkl' format. If the fernet key is set, the original data will be encrypted and stored securely in '.dat' format. The same fernet key should be used for both training and inference processes to ensure that the original data can be decrypted correctly.

Requirements for parameters of training process:

source - data type - string
table_name - data type - string
epochs - data type - integer, must be equal to or more than 1, default value is 10
row_limit - data type - integer
drop_null - data type - boolean, default value - False
batch_size - data type - integer, must be equal to or more than 1, default value - 32
reports - data type - if the value is passed through CLI - string, if the value is passed in the metadata file - string or list, accepted values: "none" (default) - no reports will be generated, "all" - generates both accuracy and sample reports, "accuracy" - generates an accuracy report, "sample" - generates a sample report, "metrics_only" - outputs the metrics information only to standard output without generation of a report. Default value is "none". In the metadata file multiple values can be specified as a list of available options ("accuracy", "sample", "metrics_only") to generate multiple types of reports simultaneously, e.g. ["metrics_only", "sample"]
metadata_path - data type - string
column_types - data type - dictionary with the key categorical - the list of columns (data type - string)
log_level - data type - string, must be one of the next values - TRACE, "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL", default value is "INFO"
fernet_key - data type - string, the name of the environment variable that kept the value of the fernet key. It must be a 44-character URL-safe base64-encoded string, default value is None

Inference (generation)

You can customize the inference processes by calling for one table:

infer --size INT \
    --table_name STR \
    --run_parallel BOOL \
    --batch_size INT \
    --random_seed INT \
    --reports STR \
    --log_level STR \
    --fernet_key STR

Note: To specify multiple options for the --reports parameter, you need to provide the --reports parameter multiple times. For example:

infer --table_name TABLE_NAME \
    --reports accuracy \
    --reports metrics_only

The accepted values for the parameter "reports":

"none" (default) - no reports will be generated
"accuracy" - generates an accuracy report that compares original and synthetic data patterns to verify the quality of the generated data
"metrics_only" - outputs the metrics information only to standard output without generation of an accuracy report
"all" - generates an accuracy report Default value is "none".

To generate one or more tables using a metadata file, you can use the following command:

infer --metadata_path PATH_TO_METADATA

The parameters which you can set up for generation process:

size - the desired number of rows to generate
table_name – required parameter for inference of single table, the name of the table, same as in training
run_parallel – whether to use multiprocessing (feasible for tables > 50000 rows)
batch_size – if specified, the generation is split into batches. This can save the RAM
random_seed – if specified, generates a reproducible result
reports - controls the generation of quality reports, might require significant time for big generated tables (>10000 rows)
metadata_path – a path to metadata file
log_level - logging level for the process
fernet_key - the name of the environment variable that kept the value of the fernet key used to encrypt the sample data of the original data. If the fernet key is not set, the original data will be stored in '.pkl' format. If the fernet key is set,

Syngen

Install / Use

README