DataDefender
Sensitive Data Management: Data Discovery and Anonymization toolkit
Install / Use
/learn @armenak/DataDefenderREADME
Data Discovery and Anonymization toolkit
Table of content
- Purpose
- Features
- Prerequisites
- Build from source
- Including JDBC Drivers
- Extensions
- Contributing
- How to run
- Using argument files
- File Discovery
- Column Discovery
- Data Discovery
- Data Extractor
- Anonymizer
- Requirement Tester
- Logging (and database logging)
- Upgrading to 2.0
- Features and issues
- Code quality
Purpose
While performing application development, testing, or maintenance, it is important to operate in an environment that is as close to the production environment as possible when it comes to the amount of data and close-to-real content. At the same time it is important to ensure that data privacy policies are not violated.
Database, column, and file discovery identify and analyze data risks and report on potentially identifiable and personal information stored. And the database anonymization process anonymizes sensitive data and transfer information between organizations, while reducing the risk of unintended disclosure.
The complete source code is available, so you can inspect it and perform security audits if necessary.
This implementation of Data Discovery program is using Apache OpenNLP
Features
- Identifies sensitive personal data.
- Creates plan (XML document) to define what columns should be anonymized and how.
- Anonymizes the data.
- Platform-independent.
- Supports Oracle, MariaDB/MySQL, MS SQL Server, and PostgreSQL. Work in progress for DB2.
- This tool can help you be GDPR-compliant.
Prerequisites
- JDK 11+
- Maven 3+
Build from source
- Download ZIP file and unzip in a directory of your choice, or clone repo
- cd {dir}/DataDefender/
- mvn package
- DataDefender.jar will be located in "target" directory {dir}/DataDefender/target/
Including JDBC Drivers
JDBC drivers are included as optional dependencies included in maven profiles that can be activated. Valid options are:
- mariadb
- mysql
- sqlserver
- postgresql
- oracle
In addition, a property to activate all drivers is available as well for convenience:
- jdbc-drivers-all
Example builds:
mvn package -P mariadb,mysql
mvn package -Djdbc-drivers-all
mvn package -P oracle
Alternatively, the JDBC drivers can be included as jar files in a 'lib' folder under your project folder (where the jar and scripts are copied to).
Note: sqlite-jdbc is included always for file discovery.
Extensions
Additional jar files/classes can be added under an 'extensions' directory in the current working directory. The default 'datadefender' scripts copied to the target directory adds classes/jar files under 'extensions' to the classpath. The 'extensions' directory is meant to house extensions for a project, for example additional anonymization or discovery routines, etc... additional libraries required may be included more appropriately in a 'lib' directory.
See sample_projects/anonymizer/ for an example.
Contributing
We encourage you to contribute to DataDefender! Please check out the Contribution guidelines for this project.
How to run
The toolkit is implemented as a command line program. To run it first build the application as above (mvn package). This will generate an executable jar file in the "target" directory. For your convenience executable 'sh' and 'bat' files are created as well. You may need to adjust permissions for the executable shell script (chmod +x datadefender). Once this has been done you can get help by running 'datadefender' or 'datadefender.bat' in your shell/command prompt:
datadefender help
Usage: datadefender [-hvV] [--debug] COMMAND
Data detection and anonymization tool
--debug Enable debug logging in log file
-h, --help Show this help message and exit.
-v, --verbose Enable more verbose console output, specify two -v
for console debug logging
-V, --version Print version information and exit.
Commands:
help Displays help information about the specified command
anonymize Run anonymization utility
extract Run data extraction utility -- generates files out of table
columns with the name 'table_columnName.txt' for each column
requested.
discover Run data discovery utility
test-requirement Loads the requirement file without attempting to anonymize
or process anything to check for syntax issues
The toolkit can be run in anonymizer mode, data extraction mode (extract), and three different discovery modes (file, column, and database discovery).
Using argument files
DataDefender is using picocli as its framework for processing command-line input. The framework allows using argument files to set argument values when running the tool. The argument file contains a list of arguments to pass (more than one can be used), and when invoking DataDefender, the argument file can be specified with an "@". For example:
File: database.config
--url=jdbc:mariadb://localhost:3306/database?zeroDateTimeBehavior=convertToNull
--password
--user=root
Running with database.config:
datadefender @database.config
File Discovery
datadefender discover files
Usage: datadefender discover files ([-l=<limit>] [-e=<extensions>]
[-e=<extensions>]...
[--model-file=<fileModels>]
[--model-file=<fileModels>]...
[--token-model=<tokenModel>]
[--probability-threshold=<probabilityThreshold>]
[--[no-]score-calculation]
[--threshold-count=<thresholdCount>]
[--threshold-high=<thresholdHighRisk>]
[-m=<models>] [-m=<models>]...) [-hvV]
[--debug] -d=<directories>
[-d=<directories>]... -x=<excludeExtensions>
[-x=<excludeExtensions>]...
Run file discovery utility
-d, --directory=<directories>
Adds a directory to list of directories to be scanned
--debug Enable debug logging in log file
-h, --help Show this help message and exit.
-v, --verbose Enable more verbose console output, specify two -v
for console debug logging
-V, --version Print version information and exit.
-x, --exclude-extension=<excludeExtensions>
Adds an extension to exclude from data discovery
Model discovery settings
-e, --extension=<extensions>
Adds a call to an extension method (e.g. com.strider.
datadefender.specialcase.SinDetector.detectSin)
-l, --limit=<limit> Limit discovery to a set number of rows in a table
-m, --model=<models> Adds a built-in configured opennlp TokenizerME model
for data discovery. Available models are: date,
location, money, organization, person, time
--model-file=<fileModels>
Adds a custom made opennlp TokenizerME file for data
discovery.
--[no-]score-calculation
If set, includes a column score
--probability-threshold=<probabilityThreshold>
Minimum NLP match score to return results for
--threshold-count=<thresholdCount>
Reports if number of rows found are greater than the
defined threshold
--threshold-high=<thresholdHighRisk>
Reports if number of high risk columns found are
greater than the defined threshold
--token-model=<tokenModel>
Override the default built-in token model (English
tokens, en-token.bin) with a custom token file for
use by opennlp's TokenizerModel
File discovery will attempt to find sensitive personal information in binary and text files located on the file system.
Sample project can be found here: sample_projects/file_discovery
Column Discovery
datadefender discover columns
Usage: datadefender discover columns [[-u=<username>] [-p[=<password>]]
[--schema=<schema>]
[--[no-]skip-empty-tables-metadata]
[--include-table-pattern-metadata=<includeTablePatterns>]


