[Data]nymizer

License Release Version

Powerful database anonymizer with flexible rules. Written in Rust.

Datanymizer is created & supported by Evrone. See what else we develop with Rust.

More information you can find in articles in English and Russian.

How it works

Database -> Dumper (+Faker) -> Dump.sql

You can import or process your dump with supported database without 3rd-party importers.

Datanymizer generates database-native dump.

Installation

There are several ways to install pg_datanymizer, choose a more convenient option for you.

Pre-compiled binary

# Linux / macOS / Windows (MINGW and etc). Installs it into ./bin/ by default
$ curl -sSfL https://raw.githubusercontent.com/datanymizer/datanymizer/main/cli/pg_datanymizer/install.sh | sh -s

# Or more shorter way
$ curl -sSfL https://git.io/pg_datanymizer | sh -s

# Specify installation directory and version
$ curl -sSfL https://git.io/pg_datanymizer | sudo sh -s -- -b /usr/local/bin v0.2.0

# Alpine Linux (wget)
$ wget -q -O - https://git.io/pg_datanymizer | sh -s

Homebrew / Linuxbrew

# Installs the latest stable release
$ brew install datanymizer/tap/pg_datanymizer

# Builds the latest version from the repository
$ brew install --HEAD datanymizer/tap/pg_datanymizer

Docker

Docker images are available for different PostgreSQL versions. Use the tag format <version>-pg<pg_version>:

# Latest version with PostgreSQL latest
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:latest

# Specific version (e.g., 0.7.4) with different PostgreSQL versions
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4        # PostgreSQL latest
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4-pg17   # PostgreSQL 17
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4-pg16   # PostgreSQL 16
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4-pg15   # PostgreSQL 15
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer:0.7.4-pg14   # PostgreSQL 14

Available PostgreSQL versions: 14, 15, 16, 17, and latest (no suffix).

Getting started with CLI dumper

First, inspect your database schema, choose fields with sensitive data, and create a config file based on it.

# config.yml
tables:
  - name: markets
    rules:
      name_translations:
        template:
          format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
          rules:
            - words:
                min: 1
                max: 2
            - words:
                min: 1
                max: 2
  - name: franchisees
    rules:
      operator_mail:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe
      operator_name:
        first_name: {}
      operator_phone:
        phone:
          format: +###########
      name_translations:
        template:
          format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
          rules:
            - words:
                min: 2
                max: 3
            - words:
                min: 2
                max: 3
  - name: users
    rules:
      first_name:
        first_name: {}
      last_name:
        last_name: {}
  - name: customers
    rules:
      email:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe
                uniq:  
                  required: true
                  try_count: 5
      phone:
        phone:
          format: +7##########
          uniq: true
      city:
        city: {}
      age:
        random_num:
          min: 10
          max: 99
      first_name:
        first_name: {}
      last_name:
        last_name: {}
      birth_date:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

And then start to make dump from your database instance:

pg_datanymizer -f /tmp/dump.sql -c ./config.yml postgres://postgres:postgres@localhost/test_database

It creates new dump file /tmp/dump.sql with native SQL dump for Postgresql database. You can import fake data from this dump into new Postgresql database with command:

psql -U postgres -d new_database < /tmp/dump.sql

Dumper can stream dump to STDOUT like pg_dump and you can use it in other pipelines:

pg_datanymizer -c ./config.yml postgres://postgres:postgres@localhost/test_database > /tmp/dump.sql

Additional options

Tables filter

You can specify which tables you choose or ignore for making dump.

For dumping only public.markets and public.users data.

# config.yml
#...
filter:
  only:
    - public.markets
    - public.users

For ignoring those tables and dump data from others.

# config.yml
#...
filter:
  except:
    - public.markets
    - public.users

You can also specify data and schema filters separately.

This is equivalent to the previous example.

# config.yml
#...
filter:
  data:
    except:
      - public.markets
      - public.users

For skipping schema and data from other tables.

# config.yml
#...
filter:
  schema:
    only:
      - public.markets
      - public.users

For skipping schema for markets table and dumping data only from users table.

# config.yml
#...
filter:
  data:
    only:
      - public.users
  schema:
    except:
      - public.markets

You can use wildcards in the filter section:

? matches exactly one occurrence of any character;
* matches arbitrary many (including zero) occurrences of any character.

Dump conditions and limit

You can specify conditions (SQL WHERE statement) and limit for dumped data per table:

# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # select maximum 100 rows
      limit: 100

Transform conditions and limit

As the additional option, you can specify SQL conditions that define which rows will be transformed (anonymized):

# config.yml
tables:
  - name: people
    query:
      # don't dump some rows
      dump_condition: "last_name <> 'Sensitive'"
      # preserve original values for some rows
      transform_condition: "NOT (first_name = 'John' AND last_name = 'Doe')"      
      # select maximum 100 rows
      limit: 100

You can use the dump_condition, transform_condition and limit options in any combination (only transform_condition; transform_condition and limit; etc).

SQL assertions

You can validate source data before dump generation with SQL-based assertions. Assertions may be defined globally or per table, and both forms use the same schema.

For a complete ready-to-copy example, see docs/examples/asserts.yml.

asserts:
  - name: no_duplicate_emails
    sql: |
      select email
      from users
      group by email
      having count(*) > 1
    expect: no_rows

tables:
  - name: users
    rules: {}
    asserts:
      - name: users_count
        sql: select count(*) from users
        expect:
          eq: 100

expect: no_rows requires the query to return zero rows. expect.eq requires a scalar query that returns exactly one value. Optional severity: warn logs a warning instead of stopping the dump.

expect: rows_exist requires the query to return at least one row. Scalar comparisons also support not_eq, gt, gte, lt, and lte, and several scalar operators may be combined with AND semantics.

asserts:
  - name: pending_jobs_in_range
    sql: select count(*) from jobs where state = 'pending'
    expect:
      gt: 0
      lte: 100

Assertions are executed before the schema/data dump starts. severity: error stops the dump, while severity: warn reports the failure and continues.

Global variables

You can specify global variables available from any template rule.

# config.yml
tables:
  users:
    bio:
      template:
        format: "User bio is {{var_a}}"
    age:
      template:
        format: {{_0 | float * global_multiplicator}}
#...
globals:
  var_a: Global variable 1
  global_multiplicator: 6

Available rules

| Rule | Description | |--------------------------------|------------------------------------------------------------------------------| | email | Emails with different options | | ip | IP addresses. Supports IPv4 and IPv6 | | words | Lorem words with different length | | first_name | First name generator

Datanymizer

Install / Use

README