Re: How to package human, mouse and viral genomes?
On Fri, 04 Sep 2020, Steffen Möller wrote:
> * sharing data between colleagues - can you have two different versions
> at the same time?
sure, similar to git... well -- it is git ;) so multiple versions
across collaborators, multiple versions on your own box etc -- all
possible. When you get into it really, you might even like to start
using BTRFS as your filesystem -- provides awesome CoW feature so you
could breed your huge datasets without wasting too much space.
Re versions: especially mind blowing is the ability to quickly switch
between versions -- "large" files are just symlinks. The only gotcha
remaining -- switching between dataset with subdatasets versions
is not yet "convenienced", but it is possible to have multiple dataset
hierarchy clones of different versions.
> * I see this mostly orthogonal to the question how we organize our data
> relative to whatever "dataRoot" we define
well -- you could have disjoint datasets, it is not required to bring
them all up into a superdataset, although that could have benefits.
> * we still have a community-effort to collect the data from somewhere
> (which likely is not a git repository) and post-process it (like some
> indexing for a variety of tools) and to finally prepare the data somewhere
for "processing" checkout "datalad run" and datalad-container extension
providing "datalad container-run". Then you could you have your
preprocessing entirely reproducible and simple provenance recorded
within git commits.
handbook on that:
http://handbook.datalad.org/en/latest/basics/basics-run.html
And Michael ATM is actively looking into making snakemake to
tollerate datalad (well, git-annex), so you might like to define your
snakemake workflows
> * with some agreement between us on how to formulate the metadata in a
> machine-readable manner so we know what tool needs to check out what
> files for which workflows
unfortunately cannot recommend anything specific ATM since not familiar
with bioinformatics metadata and its use within workflows.
> I should now read a bit in your handbook. And think a bit more about it
> over the weekend.
I hope you like it. Adina and Michael did (well -- still doing) awesome
job with it.
--
Yaroslav O. Halchenko
Center for Open Neuroscience http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW: http://www.linkedin.com/in/yarik
Reply to: