Gonymizer

Gonymizer

Weird name, what does it do?

The Gonymizer project (Go + Anonymizer) is a project that was built at SmithRx in hope to simplify the QA process. Gonymizer is written in Golang and is meant to help database administrators and infrastructure folks easily anonymize production database dumps before loading this data into a QA environment.

We have built in support, and examples, for:

Kubernetes CRONJOB scheduling
AWS-S3 Storage processing and loading

We plan to have built-in:

CRONJOB BASH scripts to use local disk as storage (see tasks, we need help!)
AWS-Lambda Job scheduling (see tasks, we need help!)

Our API is an easy one to follow and we encourage others to join in by trying Gonymizer with their own development and staging environments either directly using the CLI or using the API. We include in our documentation: example configurations, best practices, Kubernetes CRONJOB examples, examples for AWS-Lambda, and other infrastructure tools. Please see the docs directory in this application to see a full how-to guide and where to get started.

Supported RDBMS

Currently Gonymizer only supports PostgreSQL 9.x-13.x. We have not tested Gonymizer on versions 12+, but plan to in the near future. If you would like to help by adding support for other database management systems, new processors, or general questions please join by checking the CONTRIBUTING.md file in this repository.

Abbreviations and Definitions

HIPAA: Health Insurance Portability and Accountability Act of 1996
PCI DSS: Payment Card Industry Data Security Standard
PHI: Protected Health Information
PII: Personally identifiable information

In this document/codebase, we use them interchangeably.

Getting Started

If you are a seasoned Go veteran or already have an environment which contains Go>= 1.11 then you can skip to the next section.

OSX

Gonymizer requires that one has complete install of Go >= 1.11. To install Go on OSX you can run the following:

brew install go

Once this is complete we will need to make sure our Go paths are set correctly in our BASH profile. NOTE: You may need to change the directories below to match your setup.

echo "
export GOPATH=~/go
export GOROOT=/usr/local/Cellar/go/1.11.2/libexec
export GO111MODULE=on
" >> ~/.profile

It is recommended to put all Go source code under ~/go. Once this is complete we can attempt to build the application:

cd ~/go/src/github.com/smithoss/gonymizer/scripts
./build.sh

The build script will build two binaries. One for MacOS on the amd64 architecture as well as a Linux amd64 binary. These binaries are stored under the Gonymizer/bin directory. Now that we have a built binary we can attempt to download a map file using our JSON configuration:

./gonymizer-darwin -c ~/conf/gonymizer-config-file.json dump

Debian 9.x / Ubuntu 18.04

Use the following steps to get up and going. Commands should be similar for Debian 9.x and Ubuntu 18.04.

Install Golang and Git

sudo apt-get install go git

Add go path to profile

echo "
export GOPATH=~/go
export GO111MODULE=on
" >> ~/.bashrc

Git checkout

mkdir -p ~/go/src/github.com/smithoss/
cd ~/go/src/github.com/smithoss/
git clone https://github.com/smithoss/Gonymizer.git gonymizer

Build the project

cd gonymizer/cmd/
go build -o ../bin/gonymizer .

Run the binary

cd ../bin
./gonymizer --help

Configuration

Gonymizer has many different configuration settings that can be enabled or disabled using the command line options. It is recommended that one run gonymizer --help or gonymizer CMD --help where CMD is one of the commands to see which options are available at any given time.

Below we give examples of both the CLI configuration as well as examples on how to create your map file.

CLI Configuration

Gonymizer was built using the Cobra + Viper Golang libraries to allow for easy configuration however you like it. We recommend using a JSON, YAML, or TOML file to configure Gonymizer. Below we will go over an example configuration for running Gonymizer.

For an example of how to set up a CLI configuration check our Dell Store 2 example in docs/demo/dellstore2/gonymizer_config.json

{
    "comment": "This example is viewable under docs/demo/dellstore2",
    "num-workers": 2,
    "dump":     {
        "database":             "store",
        "disable-ssl":          true,
        "dump-file":            "phi_dump.sql",
        "exclude-schema":      [
            "pg*",
            "information_schema"
        ],
        "host":                 "localhost",
        "port":                 5432,
        "schema":               ["public"],
        "row-count-file":       "row-counts.csv",
        "username":             "levi"
    }
  }
}

comment: is used to leave for comments for the reader and is not used by the application.

log-level: is the level the application uses to know what should be displayed to the screen. Choices are: FATAL, ERROR, WARN, INFO, DEBUG. We use the Logrus Golang library for logging so please read the documentation here for more information.

database: is the master database with PHI and PII that will be used for dumping a SQL dump file from.

host: is the hostname for the master database with PHI and PII that will be used for dumping a SQL dump file from.

port: is the host port that will be used to connect to the master database with PHI and PII.

username: is the username that will be used to connect to the master database with PHI and PII.

password: is the password that will be used to connect to the master database with PHI and PII.

disable-ssl: is the master database with PHI and PII that will be used for dumping a SQL dump file from.

dump-file: is where Gonymizer will store the SQL statements from the dump command.

map-file: is the file that gonymizer uses to map out which columns need to be anonymized and how. When using the map command in conjunction with --map-file, or in the configuration above, a file is named similarly to the map-file, but with skeleton in the name instead. More on this below in the map section.

exclude-table: is list of tables that are not to be included during the pg_dump step of the extraction process. This allows us to only focus on tables that are needed for our base application to work. Using this option minimizes the size of our dump file and in return decreases the amount of time needed for dumping, processing, and reloading. This option operates in the same fashion as pg_dump's --exclude-table option.

exclude-table-data: allows you to create a list of tables we would like to include in the pg_dump process but do not want to include any of the data (table schema only). The usage and advantages are the same as the exclude-table feature explained above and is identical to pg_dump's --exclude-table-data option.

schema: is a list of schemas the Gonymizer should dump from the master database. This option must be in the form of a list if you are using the configuration methods mentioned above.

exclude-schema: is a list of system level schemas that Gonymizer should ignore when adding CREATE SCHEMA statements to the dump file. These schemas may still be included in the --schema option, for example the public schema.

schema-prefix: is the prefix used for a schema environment where there is a prefix that matches other schemas. This is same as a sharded architecture design which is outside the scope of this article and it is recommended to read here if you are unfamiliar with this design paradigm. For example: [company_1, company2, company_..., company_n-1, company_n] would be --schema-prefix=company_ --schemas=company

--oids: allows you to provide the --oids option for older versions of pg_dump (prior to version 12)

NOTE: Some arguments are not included here. It is recommen

Gonymizer

Install / Use

README

Gonymizer

Weird name, what does it do?

Supported RDBMS

Abbreviations and Definitions

Getting Started

OSX

Debian 9.x / Ubuntu 18.04

Configuration

CLI Configuration