Data Discovery and Anonymization toolkit

Table of content

Purpose
Features
Prerequisites
Build from source
Including JDBC Drivers
Extensions
Contributing
How to run
Using argument files
File Discovery
Column Discovery
Data Discovery
Data Extractor
Anonymizer
Requirement Tester
Logging (and database logging)
Upgrading to 2.0
Features and issues
Code quality

Purpose

While performing application development, testing, or maintenance, it is important to operate in an environment that is as close to the production environment as possible when it comes to the amount of data and close-to-real content. At the same time it is important to ensure that data privacy policies are not violated.

Database, column, and file discovery identify and analyze data risks and report on potentially identifiable and personal information stored. And the database anonymization process anonymizes sensitive data and transfer information between organizations, while reducing the risk of unintended disclosure.

The complete source code is available, so you can inspect it and perform security audits if necessary.

This implementation of Data Discovery program is using Apache OpenNLP

Features

Identifies sensitive personal data.
Creates plan (XML document) to define what columns should be anonymized and how.
Anonymizes the data.
Platform-independent.
Supports Oracle, MariaDB/MySQL, MS SQL Server, and PostgreSQL. Work in progress for DB2.
This tool can help you be GDPR-compliant.

Prerequisites

JDK 11+
Maven 3+

Build from source

Download ZIP file and unzip in a directory of your choice, or clone repo
cd {dir}/DataDefender/
mvn package
DataDefender.jar will be located in "target" directory {dir}/DataDefender/target/

Including JDBC Drivers

JDBC drivers are included as optional dependencies included in maven profiles that can be activated. Valid options are:

mariadb
mysql
sqlserver
postgresql
oracle

In addition, a property to activate all drivers is available as well for convenience:

jdbc-drivers-all

Example builds:

mvn package -P mariadb,mysql
mvn package -Djdbc-drivers-all
mvn package -P oracle

Alternatively, the JDBC drivers can be included as jar files in a 'lib' folder under your project folder (where the jar and scripts are copied to).

Note: sqlite-jdbc is included always for file discovery.

Extensions

Additional jar files/classes can be added under an 'extensions' directory in the current working directory. The default 'datadefender' scripts copied to the target directory adds classes/jar files under 'extensions' to the classpath. The 'extensions' directory is meant to house extensions for a project, for example additional anonymization or discovery routines, etc... additional libraries required may be included more appropriately in a 'lib' directory.

See sample_projects/anonymizer/ for an example.

Contributing

We encourage you to contribute to DataDefender! Please check out the Contribution guidelines for this project.

How to run

The toolkit is implemented as a command line program. To run it first build the application as above (mvn package). This will generate an executable jar file in the "target" directory. For your convenience executable 'sh' and 'bat' files are created as well. You may need to adjust permissions for the executable shell script (chmod +x datadefender). Once this has been done you can get help by running 'datadefender' or 'datadefender.bat' in your shell/command prompt:

datadefender help

Usage: datadefender [-hvV] [--debug] COMMAND
Data detection and anonymization tool
      --debug     Enable debug logging in log file
  -h, --help      Show this help message and exit.
  -v, --verbose   Enable more verbose console output, specify two -v
                            for console debug logging
  -V, --version   Print version information and exit.
Commands:
  help       Displays help information about the specified command
  anonymize  Run anonymization utility
  extract    Run data extraction utility -- generates files out of table
               columns with the name 'table_columnName.txt' for each column
               requested.
  discover   Run data discovery utility
  test-requirement  Loads the requirement file without attempting to anonymize
                      or process anything to check for syntax issues

The toolkit can be run in anonymizer mode, data extraction mode (extract), and three different discovery modes (file, column, and database discovery).

Using argument files

DataDefender is using picocli as its framework for processing command-line input. The framework allows using argument files to set argument values when running the tool. The argument file contains a list of arguments to pass (more than one can be used), and when invoking DataDefender, the argument file can be specified with an "@". For example:

File: database.config

--url=jdbc:mariadb://localhost:3306/database?zeroDateTimeBehavior=convertToNull
--password
--user=root

Running with database.config:

datadefender @database.config

File Discovery

datadefender discover files

Usage: datadefender discover files ([-l=<limit>] [-e=<extensions>]
                                   [-e=<extensions>]...
                                   [--model-file=<fileModels>]
                                   [--model-file=<fileModels>]...
                                   [--token-model=<tokenModel>]
                                   [--probability-threshold=<probabilityThreshold>]
                                   [--[no-]score-calculation]
                                   [--threshold-count=<thresholdCount>]
                                   [--threshold-high=<thresholdHighRisk>]
                                   [-m=<models>] [-m=<models>]...) [-hvV]
                                   [--debug] -d=<directories>
                                   [-d=<directories>]... -x=<excludeExtensions>
                                   [-x=<excludeExtensions>]...
Run file discovery utility
  -d, --directory=<directories>
                         Adds a directory to list of directories to be scanned
      --debug            Enable debug logging in log file
  -h, --help             Show this help message and exit.
  -v, --verbose          Enable more verbose console output, specify two -v
                            for console debug logging
  -V, --version          Print version information and exit.
  -x, --exclude-extension=<excludeExtensions>
                         Adds an extension to exclude from data discovery
Model discovery settings
  -e, --extension=<extensions>
                         Adds a call to an extension method (e.g. com.strider.
                           datadefender.specialcase.SinDetector.detectSin)
  -l, --limit=<limit>    Limit discovery to a set number of rows in a table
  -m, --model=<models>   Adds a built-in configured opennlp TokenizerME model
                           for data discovery. Available models are: date,
                           location, money, organization, person, time
      --model-file=<fileModels>
                         Adds a custom made opennlp TokenizerME file for data
                           discovery.
      --[no-]score-calculation
                         If set, includes a column score
      --probability-threshold=<probabilityThreshold>
                         Minimum NLP match score to return results for
      --threshold-count=<thresholdCount>
                         Reports if number of rows found are greater than the
                           defined threshold
      --threshold-high=<thresholdHighRisk>
                         Reports if number of high risk columns found are
                           greater than the defined threshold
      --token-model=<tokenModel>
                         Override the default built-in token model (English
                           tokens, en-token.bin) with a custom token file for
                           use by opennlp's TokenizerModel

File discovery will attempt to find sensitive personal information in binary and text files located on the file system.

Sample project can be found here: sample_projects/file_discovery

Column Discovery

datadefender discover columns

Usage: datadefender discover columns [[-u=<username>] [-p[=<password>]]
                                     [--schema=<schema>]
                                     [--[no-]skip-empty-tables-metadata]
                                     [--include-table-pattern-metadata=<includeTablePatterns>]

DataDefender

Install / Use

README