Gonymizer
Gonymizer: A Tool to Anonymize Sensitive PostgreSQL Data Tables for Use in QA and Testing
Install / Use
/learn @smithoss/GonymizerREADME
Gonymizer

- Gonymizer
Weird name, what does it do?
The Gonymizer project (Go + Anonymizer) is a project that was built at SmithRx in hope to simplify the QA process. Gonymizer is written in Golang and is meant to help database administrators and infrastructure folks easily anonymize production database dumps before loading this data into a QA environment.
We have built in support, and examples, for:
- Kubernetes CRONJOB scheduling
- AWS-S3 Storage processing and loading
We plan to have built-in:
- CRONJOB BASH scripts to use local disk as storage (see tasks, we need help!)
- AWS-Lambda Job scheduling (see tasks, we need help!)
Our API is an easy one to follow and we encourage others to join in by trying Gonymizer with their own development and staging environments either directly using the CLI or using the API. We include in our documentation: example configurations, best practices, Kubernetes CRONJOB examples, examples for AWS-Lambda, and other infrastructure tools. Please see the docs directory in this application to see a full how-to guide and where to get started.
Supported RDBMS
Currently Gonymizer only supports PostgreSQL 9.x-13.x. We have not tested Gonymizer on versions 12+, but plan to in the near future. If you would like to help by adding support for other database management systems, new processors, or general questions please join by checking the CONTRIBUTING.md file in this repository.
Abbreviations and Definitions
- HIPAA: Health Insurance Portability and Accountability Act of 1996
- PCI DSS: Payment Card Industry Data Security Standard
- PHI: Protected Health Information
- PII: Personally identifiable information
In this document/codebase, we use them interchangeably.
Getting Started
If you are a seasoned Go veteran or already have an environment which contains Go>= 1.11 then you can skip to the next section.
OSX
Gonymizer requires that one has complete install of Go >= 1.11. To install Go on OSX you can run the following:
brew install go
Once this is complete we will need to make sure our Go paths are set correctly in our BASH profile. NOTE: You may need to change the directories below to match your setup.
echo "
export GOPATH=~/go
export GOROOT=/usr/local/Cellar/go/1.11.2/libexec
export GO111MODULE=on
" >> ~/.profile
It is recommended to put all Go source code under ~/go. Once this is complete we can attempt to build the application:
cd ~/go/src/github.com/smithoss/gonymizer/scripts
./build.sh
The build script will build two binaries. One for MacOS on the amd64 architecture as well as a Linux amd64 binary. These binaries are stored under the Gonymizer/bin directory. Now that we have a built binary we can attempt to download a map file using our JSON configuration:
./gonymizer-darwin -c ~/conf/gonymizer-config-file.json dump
Debian 9.x / Ubuntu 18.04
Use the following steps to get up and going. Commands should be similar for Debian 9.x and Ubuntu 18.04.
- Install Golang and Git
sudo apt-get install go git
- Add go path to profile
echo "
export GOPATH=~/go
export GO111MODULE=on
" >> ~/.bashrc
- Git checkout
mkdir -p ~/go/src/github.com/smithoss/
cd ~/go/src/github.com/smithoss/
git clone https://github.com/smithoss/Gonymizer.git gonymizer
- Build the project
cd gonymizer/cmd/
go build -o ../bin/gonymizer .
- Run the binary
cd ../bin
./gonymizer --help
Configuration
Gonymizer has many different configuration settings that can be enabled or disabled using the command line options.
It is recommended that one run gonymizer --help or gonymizer CMD --help where CMD is one of the commands to see
which options are available at any given time.
Below we give examples of both the CLI configuration as well as examples on how to create your map file.
CLI Configuration
Gonymizer was built using the Cobra + Viper Golang libraries to allow for easy configuration however you like it. We recommend using a JSON, YAML, or TOML file to configure Gonymizer. Below we will go over an example configuration for running Gonymizer.
For an example of how to set up a CLI configuration check our Dell Store 2 example in docs/demo/dellstore2/gonymizer_config.json
{
"comment": "This example is viewable under docs/demo/dellstore2",
"num-workers": 2,
"dump": {
"database": "store",
"disable-ssl": true,
"dump-file": "phi_dump.sql",
"exclude-schema": [
"pg*",
"information_schema"
],
"host": "localhost",
"port": 5432,
"schema": ["public"],
"row-count-file": "row-counts.csv",
"username": "levi"
}
}
}
comment: is used to leave for comments for the reader and is not used by the application.
log-level: is the level the application uses to know what should be displayed to the screen. Choices are: FATAL,
ERROR, WARN, INFO, DEBUG. We use the Logrus Golang library for logging so please read the documentation
here for more information.
database: is the master database with PHI and PII that will be used for dumping a SQL dump file from.
host: is the hostname for the master database with PHI and PII that will be used for dumping a SQL dump file from.
port: is the host port that will be used to connect to the master database with PHI and PII.
username: is the username that will be used to connect to the master database with PHI and PII.
password: is the password that will be used to connect to the master database with PHI and PII.
disable-ssl: is the master database with PHI and PII that will be used for dumping a SQL dump file from.
dump-file: is where Gonymizer will store the SQL statements from the dump command.
map-file: is the file that gonymizer uses to map out which columns need to be anonymized and how. When using the
map command in conjunction with --map-file, or in the configuration above, a file is named similarly to the
map-file, but with skeleton in the name instead. More on this below in the map section.
exclude-table: is list of tables that are not to be included during the pg_dump step of the extraction process.
This allows us to only focus on tables that are needed for our base application to work. Using this option minimizes
the size of our dump file and in return decreases the amount of time needed for dumping, processing, and
reloading. This option operates in the same fashion as pg_dump's --exclude-table option.
exclude-table-data: allows you to create a list of tables we would like to include in the pg_dump process but do not
want to include any of the data (table schema only). The usage and advantages are the same as the exclude-table
feature explained above and is identical to pg_dump's --exclude-table-data option.
schema: is a list of schemas the Gonymizer should dump from the master database. This option must be in the form
of a list if you are using the configuration methods mentioned above.
exclude-schema: is a list of system level schemas that Gonymizer should ignore when adding CREATE SCHEMA statements
to the dump file. These schemas may still be included in the --schema option, for example the public schema.
schema-prefix: is the prefix used for a schema environment where there is a prefix that matches other schemas. This
is same as a sharded architecture design which is outside the scope of this article and it is recommended to read
here if you are unfamiliar with this design paradigm.
For example: [company_1, company2, company_..., company_n-1, company_n] would be
--schema-prefix=company_ --schemas=company
--oids: allows you to provide the --oids option for older versions of pg_dump (prior to version 12)
NOTE: Some arguments are not included here. It is recommen
