DTK (data toolkit) is a suite of tools for parsing, analyzing, and graphing logs and other datasets.

In short, DTK converts data into knowledge.

Note: In the examples in this document, a prefix of $ indicates that the line is a command at a bash-like shell prompt, and a token like <count> is a placeholder for some actual value named "count".

Introduction

Are your logs just a huge pile of data you lug around because "you might need them some day"? Do you make decisions without the knowledge locked away in those files because it's cumbersome to get from an Apache log to user behavior statistics? Do you waste hours digging through logs to track down the root cause of an incident? You're in the right place. Have a seat.

The modules in DTK follow the Unix philosophy - they do one thing and do it well, they work together, and they operate on text streams.

DTK provides many tools called "modules"; much like Git, they are all accessible through the launcher dtk - dtk help, dtk filter, dtk parse, etc. Each module focuses on solving one kind of problem while accepting and producing data in a reusable, common format.

Because of this, a DTK workflow often involves a pipe chain consisting of both DTK modules and other programs. These can be simple or complex:

# flags and other arguments omitted for brevity:

$ zcat | dtk parse | dtk hist1

$ zcat | grep | cut | cut | dtk filter | dtk uc | dtk filter | dtk hist2_bycol

$ cat <(zcat | dtk parse | dtk filter) <(zcat | dtk parse | dtk filter) | dtk delta1

Some DTK pipelines work like map-reduce jobs, and, in fact, the DTK suite is also quite effective when run in true big data map-reduce environments such as Hadoop Streaming.

DTK prioritizes usefulness and efficiency over shiny interfaces and PowerPoint presentations. These aren't your boss' tools (unless they also like this sort of thing, of course).

`dtk`

The main dtk launcher, which is normally at /usr/bin/dtk, is responsible for finding and invoking the individual modules, which are normally in /usr/libexec/dtk-modules. To invoke a module, use dtk <module>, much as you would with git. If you would like to load modules from a different directory (such as during dtk development or in a homedir-local install), set the DTK_MODPATH environment variable to a colon-delimited list of module directories, much like the PATH environment variable. An empty string as one of the parts of DTK_MODPATH will be replaced with the default (/usr/libexec/dtk-modules), making it easy to provide overlay directories by using something like :~/dtk-modules (which would expand to /usr/libexec/dtk-modules and ~/dtk-modules).

To get details on a specific module, use dtk help <module>, which simply invokes dtk <module> --help for you.

Some modules (like hist2 and plot) work better when more colors are available. For best results, get a terminal with 256-color support, and then set your TERM environment variable to xterm-256color. If you are within screen, you may have to do this explicitly, like cat stuff | TERM=xterm-256color dtk hist2.

Modules

Most of the tools produce or operate on newline-terminated records containing tab-delimited fields, often called TSV ("tab-separated values"). Unless otherwise specified, tools in DTK take input on STDIN and produce output on STDOUT.

`help`

Merely invoking dtk help produces a list of all available modules.

To get full documentation on a specific module, use dtk help <module> (like dtk help uc) which just invokes dtk <module> --help for you.

`filter`

The filter module is similar to perl -ne - it runs its first argument as Perl code wrapped in some convenience logic. The given code is run for each record after filling @v with each field; the final value of @v after the code is run is used as the output record. Use next to drop a record or last to skip all remaining records.

For example, suppose you have a dataset containing pairs of numbers:

$ perl -e 'print int(rand()*100)."\t".int(rand()*100)."\n" for 1..100' > data
$ head -n 5 data
50      34
13      15
37      5
62      2
7       88

You could add a column which contains the sum of the first two:

$ cat data | dtk filter '$v[2] = $v[0] + $v[1]' | head -n 5
50      34      84
13      15      28
37      5       42
62      2       64
7       88      95

Or perhaps you only want records where the first field is less than 20 and the second field is more than 80:

$ cat data | dtk filter 'next unless $v[0] < 20 && $v[1] > 80'
7       88
10      86
8       82
2       82
12      87
0       91
14      95

Because DTK is commonly used to analyze log files, helper functions are also available for fast parsing of data often found in logs:

<dl> <dt><code>parse_uri</code></dt> <dd>Takes a URI (or part of one, like <code>/path/to/thing?a=1&b=2</code> or <code>file:///path/to/thing.doc</code>) and returns an empty list on failure or otherwise a hash containing <code>schema</code>, <code>auth</code>, <code>host</code>, <code>port</code>, <code>path</code>, <code>query</code>, and <code>fragment</code>.</dd> <dt><code>parse_query</code></dt> <dd>Takes a query string (key/value pairs in the <code>application/x-www-form-urlencoded</code> format) and returns a hash containing the key/value pairs. Keys with no <code>=value</code> part will receive the value <code>1</code>, such that <code>a&b=2</code> will return <code>(a=>1, b=>2)</code>.</dd> <dt><code>parse_cookies</code></dt> <dd>Takes a cookie string (key/value pairs like <code>a=1; b=2; c=3</code>) and returns a hash containing the key/value pairs.</dd> <dt><code>decode_uri</code></dt> <dd>Takes a URI-encoded string (containing codes like <code>%2d</code> for <code>=</code>) and returns the decoded string.</dd> </dl>

Here are some example URIs and the result of parse_uri:

$ cat uris
http://www.google.com/path/to/page.cgi?a=1&b=2&c#jump
http://www.google.com/
/path/to/thing?a=1&b=2
file:///path/to/thing.doc

$ cat uris | dtk filter 'my %u = parse_uri($v[0]); @v = ($u{host}, $v[0])'
www.google.com  http://www.google.com/path/to/page.cgi?a=1&b=2&c#jump
www.google.com  http://www.google.com/
                /path/to/thing?a=1&b=2
                file:///path/to/thing.doc

$ cat uris | dtk filter 'my %u = parse_uri($v[0]); @v = ($u{query}, $v[0])'
a=1&b=2&c       http://www.google.com/path/to/page.cgi?a=1&b=2&c#jump
                http://www.google.com/
a=1&b=2         /path/to/thing?a=1&b=2
                file:///path/to/thing.doc

Here is a filter which shows the query string parameters b and c when parameter a has a truthy value:

$ cat uris | dtk filter 'my %u = parse_uri($v[0]); my %q = parse_query($u{query}); next unless $q{a}; @v = ($q{b}, $q{c}, $v[0])'
2       1       http://www.google.com/path/to/page.cgi?a=1&b=2&c#jump
2               /path/to/thing?a=1&b=2

`parse`

The parse module parses specific file formats and produces the fields of each record in DTK-friendly tab-delimited output. To get a list of known formats, use dtk parse --help (or dtk help parse). To see details on a specific format, use dtk parse --help <format> (or dtk help parse <format>).

Parse formats are executable (chmod +x) Perl scripts discovered in the parse-formats subdirectory of any path given in DTK_MODPATH, which by default causes a search only in /usr/libexec/dtk-modules/parse-formats/. The following parse formats are packaged with DTK:

<dl> <dt><code>apache_access</code></dt> <dd>For parsing the default Apache access logs in the built-in <code>common</code> or <code>combined</code> log formats.</dd> </dl>

To parse a format, pipe it to dtk parse <format> to get all fields, or dtk parse <format> <field>,<field>,... to get specific fields. For example:

$ cat access_log | dtk parse apache_access datetime,bytes
09/May/2012:16:00:00 -0400      -
09/May/2012:16:00:00 -0400      -
09/May/2012:16:00:00 -0400      43
09/May/2012:16:00:00 -0400      -
09/May/2012:15:59:59 -0400      -
09/May/2012:16:00:00 -0400      -
09/May/2012:15:59:59 -0400      931
09/May/2012:16:00:00 -0400      43
09/May/2012:15:59:59 -0400      179090
09/May/2012:16:00:00 -0400      515

Some fields have extra post-filters which can be applied by specifying the field as <field>:<filter> - these are shown in the format details in parentheses after each field. For example, apache_access's datetime field can be converted to epoch time, bytes can be forced to a number (so - becomes 0), and useragent can be categorized into general classes for easy bucketing:

$ cat access_log | dtk parse apache_access datetime:epoch,bytes:numeric,useragent:class
1336593600      0       ie8
1336593600      0       ie8
1336593600      43      ie9
1336593600      0       ie8
1336593599      0       ie9
1336593600      0       ie9
1336593599      931     firefox
1336593600      43      ie9
1336593599      179090  ie7
1336593600      515     ie9

Parse format details

Each parse format is merely an executable (chmod +x) Perl script which returns a data structure representing instructions on what fields are available, how to build a regular expression which extracts those fields, and any filters that can be applied to the values in those fields before returning them. These instructions are represented as an

Dtk

Install / Use

README