WhoDat
Pivotable Reverse WhoIs / PDNS Fusion with Registrant Tracking & Alerting plus API for automated queries (JSON/CSV/TXT)
Install / Use
/learn @MITRECND/WhoDatREADME
WhoDat Project
NOTE: During development of PyDat 5, internal operations has shifted their direction leading to the retirement of the PyDat project. Although a lot of the work has been done to finalize PyDat 5's capabilities, some capabilties remain not fully tested.
The WhoDat project is an interface for whoisxmlapi data, or any whois data living in ElasticSearch. It integrates whois data, current IP resolutions and passive DNS. In addition to providing an interactive, pivotable application for analysts to perform research, it also has an API which will allow output in JSON format.
WhoDat was originally written by Chris Clark. The original implementation is in PHP and available in this repository under the legacy_whodat directory. The code was re-written from scratch by Wesley Shields and Murad Khan in Python, and is available under the pydat directory.
The PHP version is left for those who want to run it, but it is not as full featured or extensible as the Python implementation, and is not supported.
For more information on the PHP implementation please see the readme. For more information on the Python implementation keep reading...
PyDat
pyDat is a Python implementation of Chris Clark's WhoDat code. It is designed to be more extensible and has more features than the PHP implementation.
PreReqs
pyDat is a python 3.6+ application that requires the following to run:
- ElasticSearch installed somewhere (version 7.x is supported)
- python packages (installed via setup script)
Data Population
To aid in properly populating the database, a program called pydat-populator
is provided to auto-populate the data.
Note that the data coming from whoisxmlapi doesn't seem to be always consistent so some care should be taken when ingesting data.
More testing needs to be done to ensure all data is ingested properly.
Anyone setting up their database, should read the available flags for the script before running it to ensure they've tweaked it for their setup.
The following is the output from pydat-populator -h:
usage: pydat-populator [-h] [-c CONFIG] [--debug] [--debug-level DEBUG_LEVEL]
[-x EXCLUDE [EXCLUDE ...]] [-n INCLUDE [INCLUDE ...]]
[--ignore-field-prefixes [IGNORE_FIELD_PREFIXES [IGNORE_FIELD_PREFIXES ...]]]
[-e EXTENSION] [-v] [-s] [--pipelines PIPELINES]
[--shipper-threads SHIPPER_THREADS]
[--fetcher-threads FETCHER_THREADS]
[--bulk-ship-size BULK_SHIP_SIZE]
[--bulk-fetch-size BULK_FETCH_SIZE]
[-u [ES_URI [ES_URI ...]]] [--es-user ES_USER]
[--es-pass ES_PASSWORD] [--cacert ES_CA_CERT]
[--es-disable-sniffing] [-p ES_INDEX_PREFIX]
[--rollover-size ES_ROLLOVER_DOCS] [--ask-pass]
[-r | --config-template-only | --clear-interrupted-flag]
[-f INGEST_FILE | -d INGEST_DIRECTORY] [-D INGEST_DAY]
[-o COMMENT]
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
location of configuration file for
environmentparameter configuration (example yaml file
in /backend)
--debug Enables debug logging
--debug-level DEBUG_LEVEL
Debug logging level [0-3] (default: 1)
-x EXCLUDE [EXCLUDE ...], --exclude EXCLUDE [EXCLUDE ...]
list of keys to exclude if updating entry
-n INCLUDE [INCLUDE ...], --include INCLUDE [INCLUDE ...]
list of keys to include if updating entry (mutually
exclusive to -x)
--ignore-field-prefixes [IGNORE_FIELD_PREFIXES [IGNORE_FIELD_PREFIXES ...]]
list of fields (in whois data) to ignore when
extracting and inserting into ElasticSearch
-e EXTENSION, --extension EXTENSION
When scanning for CSV files only parse files with
given extension (default: csv)
-v, --verbose Be verbose
-s, --stats Print out Stats after running
-r, --redo Attempt to re-import a failed import or import more
data, uses stored metadata from previous run
--config-template-only
Configure the ElasticSearch template and then exit
--clear-interrupted-flag
Clear the interrupted flag, forcefully (NOT
RECOMMENDED)
-f INGEST_FILE, --file INGEST_FILE
Input CSV file
-d INGEST_DIRECTORY, --directory INGEST_DIRECTORY
Directory to recursively search for CSV files --
mutually exclusive to '-f' option
-D INGEST_DAY, --ingest-day INGEST_DAY
Day to use for metadata, in the format 'YYYY-MM-dd',
e.g., '2021-01-01'. Defaults to todays date, use
'YYYY-MM-00' to indicate a quarterly ingest, e.g.,
2021-04-00
-o COMMENT, --comment COMMENT
Comment to store with metadata
Performance Options:
--pipelines PIPELINES
Number of pipelines (default: 2)
--shipper-threads SHIPPER_THREADS
How many threads per pipeline to spawn to send bulk ES
messages. The larger your cluster, the more you can
increase this, defaults to 1
--fetcher-threads FETCHER_THREADS
How many threads to spawn to search ES. The larger
your cluster, the more you can increase this, defaults
to 2
--bulk-ship-size BULK_SHIP_SIZE
Size of Bulk Elasticsearch Requests (default: 10)
--bulk-fetch-size BULK_FETCH_SIZE
Number of documents to search for at a time (default:
50), note that this will be multiplied by the number
of indices you have, e.g., if you have 10
pydat-<number> indices it results in a request for 500
documents
Elasticsearch Options:
-u [ES_URI [ES_URI ...]], --es-uri [ES_URI [ES_URI ...]]
Location(s) of ElasticSearch Server (e.g.,
foo.server.com:9200) Can take multiple endpoints
--es-user ES_USER Username for ElasticSearch when Basic Auth is enabled
--es-pass ES_PASSWORD
Password for ElasticSearch when Basic Auth is enabled
--cacert ES_CA_CERT Path to a CA Certicate bundle to enable https support
--es-disable-sniffing
Disable ES sniffing, useful when ssl
hostnameverification is not working properly
-p ES_INDEX_PREFIX, --index-prefix ES_INDEX_PREFIX
Index prefix to use in ElasticSearch (default: pydat)
--rollover-size ES_ROLLOVER_DOCS
Set the number of documents after which point a new
index should be created, defaults to 50 million, note
that this is fuzzy since the index count isn't
continuously updated, so should be reasonably below 2
billion per ES shard and should take your ES
configuration into consideration
--ask-pass Prompt for ElasticSearch password
Note that when adding a new version of data to the database, you should use either the -x flag to exclude certain fields that are not important to track changes or the -n flag to include specific fields that are subject to scrutiny. This will significantly decrease the amount of data that is stored between versions. You can only use either -x or -n not both at the same time, but you can choose whichever is best for your given environment. As an example, if you get daily updates, you might decide that for daily updates you only care if contactEmail changes but every quarter you might want to instead only exclude certain fields you don't find important.
Config File
To save time on repetitive flag usage, pydat-populator takes a configuration file.
Please look at the example config for an example of how to create a configuration file.
Running pyDat
pyDat does not provide any data on its own. You must provide your own whois data in an ElasticSearch data store.
Populating ElasticSearch with whoisxmlapi data (Ubuntu 20.04 LTS)
- Install ElasticSearch. Using Docker is the easiest mechanism
- Download latest trimmed (smallest possible) whoisxmlapi quarterly DB dump.
- Extract the csv files.
- Use the included progam when the package is installed:
pydat-populator -u localhost:9200 -f ~/whois/data/1.csv -v -s -x Audit_auditUpdatedDate,updatedDate,standardRegUpdatedDate,expiresDate,standardRegExpiresDate
Installation
PyDat 5 is a split backend/frontend application that utilizes Python Flask for providing a REST API and ReactJS for providing an interactive web UI. The easiest way to use the app is to build a docker image.
cd pydat/
docker build -t mitrecnd/pydat:5
The created image will compile and install the frontend components into the backend
