Data
package manager for datasets
Install / Use
/learn @jbenet/DataREADME
data - package manager for datasets
Imagine installing datasets like this:
data get jbenet/norb
It's about time we used all we've learned making package managers to fix the awful data management problem. Read the designdoc and the roadmap.
Table of Contents
Install
Two ways to install:
- from pre-built binary distributions (the easy way)
- from source (the hard way)
Usage
Please see the command reference.
Downloading datasets
Downloading datasets is trivial:
> data get jbenet/mnist
Installed jbenet/mnist@1.0 at datasets/jbenet/mnist@1.0
Add datasets to projects
Or, if you want to add datasets to a project, create a Datafile like this one:
> cat Datafile
dependencies:
- jbenet/mnist@1.0
- jbenet/cifar-10
- jbenet/cifar-100
Then, run data get to install the dependencies (it works like npm install):
> data get
Installed jbenet/mnist@1.0 at datasets/jbenet/mnist@1.0
Installed jbenet/cifar-10@1.0 at datasets/jbenet/cifar-10@1.0
Installed jbenet/cifar-100@1.0 at datasets/jbenet/cifar-100@1.0
You can even commit the Datafile to version control, so your collaborators or users can easily get the data:
> git clone github.com/jbenet/ml-vision-comparisons
> cd ml-vision-comparisons
> data get
Installed jbenet/mnist@1.0 at datasets/jbenet/mnist@1.0
Installed jbenet/cifar-10@1.0 at datasets/jbenet/cifar-10@1.0
Installed jbenet/cifar-100@1.0 at datasets/jbenet/cifar-100@1.0
Publishing datasets
Publishing datasets is simple:
- make a directory with all the files you want to publish.
cdinto it, and rundata publishwithin the directory.datawill guide you through creating a Datafile.- Then,
datawill upload and publish the package.
> data publish
<lots of output>
Published jbenet/mnist@1.0 (b5f84c2).
Note that uploading can take a long while, as we'll upload all the files to S3, ensuring others can always get them.
Datafile
data tracks the definition of dataset packages, and dependencies in a
Datafile (in the style of Makefile, Vagrantfile, Procfile, and
friends). Both published dataset packages, and regular projects use it.
In a way, your project defines a dataset made up of other datasets, like
package.json in npm.
# Datafile format
# A YAML (inc json) doc with the following keys:
# required
handle: <author>/<name>[.<format>][@<tag>]
title: Dataset Title
# optional functionality
dependencies: [<other dataset handles>]
formats: {<format> : <format url>}
# optional information
description: Text describing dataset.
repository: <repo url>
website: <dataset url>
license: <license url>
contributors: ["Author Name [<email>] [(url)]>", ...]
sources: [<source urls>]
May be outdated. See datafile.go.
why yaml?
YAML is much more readable than json. One of data's design goals
is an Intuitive UX. Since the target users are scientists in various domains,
any extra syntax, parse errors, and other annoyances could cease to provide
the ease of use data aims for. I've always found this
dataset: feynman/spinning-plate-measurements
title: Measurements of Plate Rotation
contributors:
- Richard Feynman <feynman@caltech.edu>
website: http://caltech.edu/~feynman/not-girls/plate-stuff/trial3
much more friendly and approachable than this
{
"dataset": "feynman/spinning-plate-measurements",
"title": "Measurements of Plate Rotation",
"contributors": [
"Richard Feynman <feynman@caltech.edu>"
],
"website": "http://caltech.edu/~feynman/not-girls/plate-stuff/trial3"
}
It's already hard enough to get anyone to do anything. Don't add more hoops to jump through than necessary. Each step will cause significant dropoff in conversion funnels. (Remember, Apple pays Amazon for 1-click buy...)
And, since YAML is a superset of json, you can do whatever you want.
Development
Setup:
- install go
- Run
git clone https://github.com/jbenet/data
cd data
make deps
make install
You'll want to run datadex too.
About
This project started because data management is a massive problem in science.
It should be trivial to (a) find, (b) download, (c) track, (d) manage,
(e) re-format, (f) publish, (g) cite, and (h) collaborate on datasets. Data
management is a problem in other domains (engineering, civics, etc), and data
seeks to be general enough to be used with any kind of dataset, but the target
use case is saving scientists' time.
Many people agree we direly need the "GitHub for Science"; scientific collaboration problems are large and numerous. It is not entirely clear how, and in which order, to tackle these challenges, or even how to drive adoption of solutions across fields. I think simple and powerful tools can solve large problems neatly. Perhaps the best way to tackle scientific collaboration is by decoupling interconnected problems, and building simple tools to solve them. Over time, reliable infrastructure can be built with these. git, github, and arxiv are great examples to follow.
data is an attempt to solve the fairly self-contained issue of downloading,
publishing, and managing datasets. Let's take what computer scientists have
learned about version control and distributed collaboration on source code,
and apply it to the data management problem. Let's build new data tools and
infrastructure with the software engineering and systems design principles
that made git, apt, npm, and github successful.
Acknowledgements
data is released under the MIT License.
Authored by @jbenet. Feel free to contact me at juan@benet.ai, but please post issues on github first.
Special thanks to @colah (original idea and data.py), @damodei, and @davidad, who provided valuable thoughts + discussion on this problem.
Examples
data - dataset package manager
Basic commands:
get Download and install dataset.
list List installed datasets.
info Show dataset information.
publish Guided dataset publishing.
Tool commands:
version Show data version information.
config Manage data configuration.
user Manage users and credentials.
commands List all available commands.
Advanced Commands:
blob Manage blobs in the blobstore.
manifest Generate and manipulate dataset manifest.
pack Dataset packaging, upload, and download.
Use "data help <command>" for more information about a command.
data get
# author/dataset
> data get jbenet/bar
Downloading jbenet/foo from datadex.
get blob b53ce99 Manifest
get blob 2183ea8 Datafile
get blob 63443e4 data.csv
copy blob 63443e4 data.txt
copy blob 63443e4 data.xsl
get blob b53ce99 Manifest
Installed jbenet/foo@1.0 at datasets/jbenet/foo
data list
> data list
jbenet/bar@1.0
data info
> data info jbenet/foo
dataset: jbenet/foo@1.0
title: Foo Dataset
description: The first dataset to use data.
license: MIT
# shows the Datafile
> cat datasets/jbenet/bar/Datafile
dataset: foo/bar@1.1
data publish
> data publish
==> Guided Data Package Publishing.
==> Step 1/3: Creating the package.
Verifying Datafile fields...
Generating manifest...
data manifest: added Datafile
data manifest: added data.csv
data manifest: added data.txt
data manifest: added data.xsl
data manifest: hashed 2183ea8 Datafile
data manifest: hashed 63443e4 data.csv
data manifest: hashed 63443e4 data.txt
data manifest: hashed 63443e4 data.xsl
==> Step 2/3: Uploading the package contents.
put blob 2183ea8 Datafile - uploading
put blob 63443e4 data.csv - exists
put blob b53ce99 Manifest - uploading
==> Step 3/3: Publishing the package to the index.
data pack: published jbenet/foo@1.0 (b53ce99).
Et voila! You can now use data get foo/bar to retrieve it!
data config
> data config index.datadex.url http://localhost:8080
> data config index.datadex.url
http://localhost:8080
data user
> data user
data user - Manage users and credentials.
Commands:
add Register new user with index.
auth Authenticate user account.
pass Change user password.
info Show (or edit) public user information.
url Output user profile url.
Use "user help <command>" for more information about a command.
> data user add
Username: juan
Password (6 char min):
Email (for security): juan@benet.ai
juan registered.
> data user auth
Username: juan
Password:
Authenticated as juan.
> data user info
name: ""
email: juan@benet.ai
> data user info jbenet
name: Juan
email: juan@benet.ai
github: jbenet
twitter: '@jbenet'
website: benet.ai
> data user info --edit
Editing user profile. [Current value].
Full Name: [] Juan Batiz-Benet
Website Url: []
Github username: []
Twitter username: []
Profile saved.
> data user info
name: Juan Batiz-Benet
email: juan@benet.ai
> data user pass
Username: juan
Current Password:
New Password (6 char min):
Password changed. You will receive an email notification.
> data user url
http://datadex.io:8080/juan
data manifest (plumbing)
> data manifest add filename
data manifest: added filename
> data manifest hash filename
data manifest: hashed 61a66fd
