Datanymizer
Powerful database anonymizer with flexible rules. Written in Rust.
Install / Use
/learn @datanymizer/DatanymizerREADME
[Data]nymizer
<img align="right" alt="datanymizer" src="https://raw.githubusercontent.com/datanymizer/datanymizer/master/logo.png">
Powerful database anonymizer with flexible rules. Written in Rust.
Datanymizer is created & supported by Evrone. See what else we develop with Rust.
More information you can find in articles in English and Russian.
How it works
Database -> Dumper (+Faker) -> Dump.sql
You can import or process your dump with supported database without 3rd-party importers.
Datanymizer generates database-native dump.
Installation
There are several ways to install pg_datanymizer, choose a more convenient option for you.
Pre-compiled binary
# Linux / macOS / Windows (MINGW and etc). Installs it into ./bin/ by default
$ curl -sSfL https://raw.githubusercontent.com/datanymizer/datanymizer/main/cli/pg_datanymizer/install.sh | sh -s
# Or more shorter way
$ curl -sSfL https://git.io/pg_datanymizer | sh -s
# Specify installation directory and version
$ curl -sSfL https://git.io/pg_datanymizer | sudo sh -s -- -b /usr/local/bin v0.2.0
# Alpine Linux (wget)
$ wget -q -O - https://git.io/pg_datanymizer | sh -s
Homebrew / Linuxbrew
# Installs the latest stable release
$ brew install datanymizer/tap/pg_datanymizer
# Builds the latest version from the repository
$ brew install --HEAD datanymizer/tap/pg_datanymizer
Docker
Docker images are available for different PostgreSQL versions. Use the tag format <version>-pg<pg_version>:
# Latest version with PostgreSQL latest
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:latest
# Specific version (e.g., 0.7.4) with different PostgreSQL versions
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4 # PostgreSQL latest
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4-pg17 # PostgreSQL 17
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4-pg16 # PostgreSQL 16
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4-pg15 # PostgreSQL 15
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4-pg14 # PostgreSQL 14
Available PostgreSQL versions: 14, 15, 16, 17, and latest (no suffix).
Getting started with CLI dumper
First, inspect your database schema, choose fields with sensitive data, and create a config file based on it.
# config.yml
tables:
- name: markets
rules:
name_translations:
template:
format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
rules:
- words:
min: 1
max: 2
- words:
min: 1
max: 2
- name: franchisees
rules:
operator_mail:
template:
format: user-{{_1}}-{{_2}}
rules:
- random_num: {}
- email:
kind: Safe
operator_name:
first_name: {}
operator_phone:
phone:
format: +###########
name_translations:
template:
format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
rules:
- words:
min: 2
max: 3
- words:
min: 2
max: 3
- name: users
rules:
first_name:
first_name: {}
last_name:
last_name: {}
- name: customers
rules:
email:
template:
format: user-{{_1}}-{{_2}}
rules:
- random_num: {}
- email:
kind: Safe
uniq:
required: true
try_count: 5
phone:
phone:
format: +7##########
uniq: true
city:
city: {}
age:
random_num:
min: 10
max: 99
first_name:
first_name: {}
last_name:
last_name: {}
birth_date:
datetime:
from: 1990-01-01T00:00:00+00:00
to: 2010-12-31T00:00:00+00:00
And then start to make dump from your database instance:
pg_datanymizer -f /tmp/dump.sql -c ./config.yml postgres://postgres:postgres@localhost/test_database
It creates new dump file /tmp/dump.sql with native SQL dump for Postgresql database.
You can import fake data from this dump into new Postgresql database with command:
psql -U postgres -d new_database < /tmp/dump.sql
Dumper can stream dump to STDOUT like pg_dump and you can use it in other pipelines:
pg_datanymizer -c ./config.yml postgres://postgres:postgres@localhost/test_database > /tmp/dump.sql
Additional options
Tables filter
You can specify which tables you choose or ignore for making dump.
For dumping only public.markets and public.users data.
# config.yml
#...
filter:
only:
- public.markets
- public.users
For ignoring those tables and dump data from others.
# config.yml
#...
filter:
except:
- public.markets
- public.users
You can also specify data and schema filters separately.
This is equivalent to the previous example.
# config.yml
#...
filter:
data:
except:
- public.markets
- public.users
For skipping schema and data from other tables.
# config.yml
#...
filter:
schema:
only:
- public.markets
- public.users
For skipping schema for markets table and dumping data only from users table.
# config.yml
#...
filter:
data:
only:
- public.users
schema:
except:
- public.markets
You can use wildcards in the filter section:
?matches exactly one occurrence of any character;*matches arbitrary many (including zero) occurrences of any character.
Dump conditions and limit
You can specify conditions (SQL WHERE statement) and limit for dumped data per table:
# config.yml
tables:
- name: people
query:
# don't dump some rows
dump_condition: "last_name <> 'Sensitive'"
# select maximum 100 rows
limit: 100
Transform conditions and limit
As the additional option, you can specify SQL conditions that define which rows will be transformed (anonymized):
# config.yml
tables:
- name: people
query:
# don't dump some rows
dump_condition: "last_name <> 'Sensitive'"
# preserve original values for some rows
transform_condition: "NOT (first_name = 'John' AND last_name = 'Doe')"
# select maximum 100 rows
limit: 100
You can use the dump_condition, transform_condition and limit options in any combination (only
transform_condition; transform_condition and limit; etc).
SQL assertions
You can validate source data before dump generation with SQL-based assertions. Assertions may be defined globally or per table, and both forms use the same schema.
For a complete ready-to-copy example, see docs/examples/asserts.yml.
asserts:
- name: no_duplicate_emails
sql: |
select email
from users
group by email
having count(*) > 1
expect: no_rows
tables:
- name: users
rules: {}
asserts:
- name: users_count
sql: select count(*) from users
expect:
eq: 100
expect: no_rows requires the query to return zero rows. expect.eq requires a scalar query that
returns exactly one value. Optional severity: warn logs a warning instead of stopping the dump.
expect: rows_exist requires the query to return at least one row. Scalar comparisons also support
not_eq, gt, gte, lt, and lte, and several scalar operators may be combined with AND
semantics.
asserts:
- name: pending_jobs_in_range
sql: select count(*) from jobs where state = 'pending'
expect:
gt: 0
lte: 100
Assertions are executed before the schema/data dump starts. severity: error stops the dump,
while severity: warn reports the failure and continues.
Global variables
You can specify global variables available from any template rule.
# config.yml
tables:
users:
bio:
template:
format: "User bio is {{var_a}}"
age:
template:
format: {{_0 | float * global_multiplicator}}
#...
globals:
var_a: Global variable 1
global_multiplicator: 6
Available rules
| Rule | Description |
|--------------------------------|------------------------------------------------------------------------------|
| email | Emails with different options |
| ip | IP addresses. Supports IPv4 and IPv6 |
| words | Lorem words with different length |
| first_name | First name generator
