in action
in action
late 2014
Max Ogden @maxogden
dat is an open source tool for sharing and collaborating on data
started august '13
we are grant funded and 100% open source
reproducible science
analogy time:
lets talk about source control
life before git
i want to fix a bug in cool-project
  1. somehow get a zip of cool-project
  2. unpack and edit a file
  3. email the file back
  4. ????
maintainer creates new zip of cool-project that might contain my fix
all in all a mess
git to the rescue
  1. git clone cool-project
  2. edit a file
  3. git add file
  4. git commit -m "fixed issue"
  5. git push
  1. git clone git://github.com/cool-project
  2. edit a file
  3. git add file
  4. git commit -m "fixed issue"
  5. git push
getting the latest changes is as simple as

git pull
claim: currently data sharing is a mess
email csv files
database dumps in git
we want to do for data what git did for source code
npm install -g dat
  1. dat init
  2. collect-data | dat import
  3. dat listen
  1. dat init
  2. collect-data | dat import
  3. dat listen
  1. dat init
  2. collect-data | dat import
  3. dat listen
  1. dat init
  2. collect-data | dat import
  3. dat listen
max, import your genome into dat
dat clone your-data-set.com
getting the latest changes is as simple as

dat pull
getting the latest changes is as simple as

dat pull --live
attach binary blobs to data

dat blobs put my-key file.ext
data is stored locally in leveldb blobs are stored in blob-stores
https://github.com/maxogden/abstract-blob-store
choose the blob store that fits your use case s3, local-fs, bittorrent, ftp, etc
- auto schema generation - free REST API - *all* APIs are streaming
a data set we can all relate to: npm
dat clone npm.dathub.org
dat clone npm.dathub.org --skim
calculate how big npm is using dat
dat cat | transform
dat cat | docker run -i transform
transform the npm data using bulk-markdown-to-png
use case: bionode bioinformatics tools on npm
data pipelines dependency management data streaming
gasket is a cross platform pipeline manager
datscript is an experimental pipeline config language
the future
branches, dat checkout 3b2d98v3, multi master replication, sync to databases, registry
get involved in
#dat
on freenode and
maxogden/dat