data - package manager for datasets

Imagine installing datasets like this:

data get jbenet/norb

It's about time we used all we've learned making package managers to fix the awful data management problem. Read the designdoc and the roadmap.

Install
Usage
Datafile
Development
About
Examples

Install

Two ways to install:

from pre-built binary distributions (the easy way)
from source (the hard way)

Usage

Please see the command reference.

Downloading datasets

Downloading datasets is trivial:

> data get jbenet/mnist
Installed jbenet/mnist@1.0 at datasets/jbenet/mnist@1.0

Add datasets to projects

Or, if you want to add datasets to a project, create a Datafile like this one:

> cat Datafile
dependencies:
- jbenet/mnist@1.0
- jbenet/cifar-10
- jbenet/cifar-100

Then, run data get to install the dependencies (it works like npm install):

> data get
Installed jbenet/mnist@1.0 at datasets/jbenet/mnist@1.0
Installed jbenet/cifar-10@1.0 at datasets/jbenet/cifar-10@1.0
Installed jbenet/cifar-100@1.0 at datasets/jbenet/cifar-100@1.0

You can even commit the Datafile to version control, so your collaborators or users can easily get the data:

> git clone github.com/jbenet/ml-vision-comparisons
> cd ml-vision-comparisons
> data get
Installed jbenet/mnist@1.0 at datasets/jbenet/mnist@1.0
Installed jbenet/cifar-10@1.0 at datasets/jbenet/cifar-10@1.0
Installed jbenet/cifar-100@1.0 at datasets/jbenet/cifar-100@1.0

Publishing datasets

Publishing datasets is simple:

make a directory with all the files you want to publish.
cd into it, and run data publish within the directory.
data will guide you through creating a Datafile.
Then, data will upload and publish the package.

> data publish
<lots of output>
Published jbenet/mnist@1.0 (b5f84c2).

Note that uploading can take a long while, as we'll upload all the files to S3, ensuring others can always get them.

Datafile

data tracks the definition of dataset packages, and dependencies in a Datafile (in the style of Makefile, Vagrantfile, Procfile, and friends). Both published dataset packages, and regular projects use it. In a way, your project defines a dataset made up of other datasets, like package.json in npm.

# Datafile format
# A YAML (inc json) doc with the following keys:

# required
handle: <author>/<name>[.<format>][@<tag>]
title: Dataset Title

# optional functionality
dependencies: [<other dataset handles>]
formats: {<format> : <format url>}

# optional information
description: Text describing dataset.
repository: <repo url>
website: <dataset url>
license: <license url>
contributors: ["Author Name [<email>] [(url)]>", ...]
sources: [<source urls>]

May be outdated. See datafile.go.

why yaml?

YAML is much more readable than json. One of data's design goals is an Intuitive UX. Since the target users are scientists in various domains, any extra syntax, parse errors, and other annoyances could cease to provide the ease of use data aims for. I've always found this

dataset: feynman/spinning-plate-measurements
title: Measurements of Plate Rotation
contributors:
  - Richard Feynman <feynman@caltech.edu>
website: http://caltech.edu/~feynman/not-girls/plate-stuff/trial3

much more friendly and approachable than this

{
  "dataset": "feynman/spinning-plate-measurements",
  "title": "Measurements of Plate Rotation",
  "contributors": [
    "Richard Feynman <feynman@caltech.edu>"
  ],
  "website": "http://caltech.edu/~feynman/not-girls/plate-stuff/trial3"
}

It's already hard enough to get anyone to do anything. Don't add more hoops to jump through than necessary. Each step will cause significant dropoff in conversion funnels. (Remember, Apple pays Amazon for 1-click buy...)

And, since YAML is a superset of json, you can do whatever you want.

Development

Setup:

install go
Run

git clone https://github.com/jbenet/data
cd data
make deps
make install

You'll want to run datadex too.

About

This project started because data management is a massive problem in science. It should be trivial to (a) find, (b) download, (c) track, (d) manage, (e) re-format, (f) publish, (g) cite, and (h) collaborate on datasets. Data management is a problem in other domains (engineering, civics, etc), and data seeks to be general enough to be used with any kind of dataset, but the target use case is saving scientists' time.

Many people agree we direly need the "GitHub for Science"; scientific collaboration problems are large and numerous. It is not entirely clear how, and in which order, to tackle these challenges, or even how to drive adoption of solutions across fields. I think simple and powerful tools can solve large problems neatly. Perhaps the best way to tackle scientific collaboration is by decoupling interconnected problems, and building simple tools to solve them. Over time, reliable infrastructure can be built with these. git, github, and arxiv are great examples to follow.

data is an attempt to solve the fairly self-contained issue of downloading, publishing, and managing datasets. Let's take what computer scientists have learned about version control and distributed collaboration on source code, and apply it to the data management problem. Let's build new data tools and infrastructure with the software engineering and systems design principles that made git, apt, npm, and github successful.

Acknowledgements

data is released under the MIT License.

Authored by @jbenet. Feel free to contact me at juan@benet.ai, but please post issues on github first.

Special thanks to @colah (original idea and data.py), @damodei, and @davidad, who provided valuable thoughts + discussion on this problem.

Examples

data - dataset package manager

Basic commands:

    get         Download and install dataset.
    list        List installed datasets.
    info        Show dataset information.
    publish     Guided dataset publishing.

Tool commands:

    version     Show data version information.
    config      Manage data configuration.
    user        Manage users and credentials.
    commands    List all available commands.

Advanced Commands:

    blob        Manage blobs in the blobstore.
    manifest    Generate and manipulate dataset manifest.
    pack        Dataset packaging, upload, and download.

Use "data help <command>" for more information about a command.

data get

# author/dataset
> data get jbenet/bar
Downloading jbenet/foo from datadex.
get blob b53ce99 Manifest
get blob 2183ea8 Datafile
get blob 63443e4 data.csv
copy blob 63443e4 data.txt
copy blob 63443e4 data.xsl
get blob b53ce99 Manifest

Installed jbenet/foo@1.0 at datasets/jbenet/foo

data list

> data list
jbenet/bar@1.0

data info

> data info jbenet/foo
dataset: jbenet/foo@1.0
title: Foo Dataset
description: The first dataset to use data.
license: MIT

# shows the Datafile
> cat datasets/jbenet/bar/Datafile
dataset: foo/bar@1.1

data publish

> data publish
==> Guided Data Package Publishing.

==> Step 1/3: Creating the package.
Verifying Datafile fields...
Generating manifest...
data manifest: added Datafile
data manifest: added data.csv
data manifest: added data.txt
data manifest: added data.xsl
data manifest: hashed 2183ea8 Datafile
data manifest: hashed 63443e4 data.csv
data manifest: hashed 63443e4 data.txt
data manifest: hashed 63443e4 data.xsl

==> Step 2/3: Uploading the package contents.
put blob 2183ea8 Datafile - uploading
put blob 63443e4 data.csv - exists
put blob b53ce99 Manifest - uploading

==> Step 3/3: Publishing the package to the index.
data pack: published jbenet/foo@1.0 (b53ce99).

Et voila! You can now use data get foo/bar to retrieve it!

data config

> data config index.datadex.url http://localhost:8080
> data config index.datadex.url
http://localhost:8080

data user

> data user
data user - Manage users and credentials.

Commands:

    add         Register new user with index.
    auth        Authenticate user account.
    pass        Change user password.
    info        Show (or edit) public user information.
    url         Output user profile url.

Use "user help <command>" for more information about a command.

> data user add
Username: juan
Password (6 char min):
Email (for security): juan@benet.ai
juan registered.

> data user auth
Username: juan
Password:
Authenticated as juan.

> data user info
name: ""
email: juan@benet.ai

> data user info jbenet
name: Juan
email: juan@benet.ai
github: jbenet
twitter: '@jbenet'
website: benet.ai

> data user info --edit
Editing user profile. [Current value].
Full Name: [] Juan Batiz-Benet
Website Url: []
Github username: []
Twitter username: []
Profile saved.

> data user info
name: Juan Batiz-Benet
email: juan@benet.ai

> data user pass
Username: juan
Current Password:
New Password (6 char min):
Password changed. You will receive an email notification.

> data user url
http://datadex.io:8080/juan

data manifest (plumbing)

> data manifest add filename
data manifest: added filename

> data manifest hash filename
data manifest: hashed 61a66fd

Data

Install / Use

README