[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to package human, mouse and viral genomes?



On Fri, 04 Sep 2020, Steffen Möller wrote:

>  * sharing data between colleagues - can you have two different versions
> at the same time?

sure, similar to git... well -- it is git ;)  so multiple versions
across collaborators, multiple versions on your own box  etc -- all
possible.  When you get into it really, you might even like to start
using BTRFS as your filesystem -- provides awesome CoW feature so you
could breed your huge datasets without wasting too much space.

Re versions: especially mind blowing is the ability to quickly switch
between versions -- "large" files are just symlinks.  The only gotcha
remaining -- switching between dataset with subdatasets versions
is not yet "convenienced", but it is possible to have multiple dataset
hierarchy clones of different versions.

>  * I see this mostly orthogonal to the question how we organize our data
> relative to whatever "dataRoot" we define

well -- you could have disjoint datasets, it is not required to bring
them all up into a superdataset, although that could have benefits.

>  * we still have a community-effort to collect the data from somewhere
> (which likely is not a git repository) and post-process it (like some
> indexing for a variety of tools) and to finally prepare the data somewhere

for "processing" checkout "datalad run" and datalad-container extension
providing "datalad container-run".  Then you could you have your
preprocessing entirely reproducible and simple provenance recorded
within git commits. 

handbook on that:
http://handbook.datalad.org/en/latest/basics/basics-run.html

And Michael ATM is actively looking into making snakemake to
tollerate datalad (well, git-annex), so you might like to define your
snakemake workflows

>  * with some agreement between us on how to formulate the metadata in a
> machine-readable manner so we know what tool needs to check out what
> files for which workflows

unfortunately cannot recommend anything specific ATM since not familiar
with bioinformatics metadata and its use within workflows.

> I should now read a bit in your handbook. And think a bit more about it
> over the weekend.

I hope you like it.  Adina and Michael did (well -- still doing) awesome
job with it.

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW:   http://www.linkedin.com/in/yarik        


Reply to: