Re: How to package human, mouse and viral genomes?

To: debian-med@lists.debian.org
Subject: Re: How to package human, mouse and viral genomes?
From: Yaroslav Halchenko <debian@onerussian.com>
Date: Thu, 3 Sep 2020 22:05:31 -0400
Message-id: <[🔎] 20200904020531.GF6250@lena.dartmouth.edu>
In-reply-to: <[🔎] ecd09a66-8875-e89e-0a15-b6dc3090bfee@gmx.de>
References: <[🔎] dc8bf264-6648-a02e-7b67-f6acf86fc6e5@gmx.de> <[🔎] 20200903160918.GQ6250@lena.dartmouth.edu> <[🔎] 89341f8d-43f7-763f-b281-766ad075f49b@gmx.de> <[🔎] 20200903195324.GY6250@lena.dartmouth.edu> <[🔎] bb53256e-5ff8-2316-509c-17f39ecfcefd@gmx.de> <[🔎] 20200903212238.GB6250@lena.dartmouth.edu> <[🔎] ecd09a66-8875-e89e-0a15-b6dc3090bfee@gmx.de>

On Fri, 04 Sep 2020, Steffen Möller wrote:

>  * sharing data between colleagues - can you have two different versions
> at the same time?

sure, similar to git... well -- it is git ;)  so multiple versions
across collaborators, multiple versions on your own box  etc -- all
possible.  When you get into it really, you might even like to start
using BTRFS as your filesystem -- provides awesome CoW feature so you
could breed your huge datasets without wasting too much space.

Re versions: especially mind blowing is the ability to quickly switch
between versions -- "large" files are just symlinks.  The only gotcha
remaining -- switching between dataset with subdatasets versions
is not yet "convenienced", but it is possible to have multiple dataset
hierarchy clones of different versions.

>  * I see this mostly orthogonal to the question how we organize our data
> relative to whatever "dataRoot" we define

well -- you could have disjoint datasets, it is not required to bring
them all up into a superdataset, although that could have benefits.

>  * we still have a community-effort to collect the data from somewhere
> (which likely is not a git repository) and post-process it (like some
> indexing for a variety of tools) and to finally prepare the data somewhere

for "processing" checkout "datalad run" and datalad-container extension
providing "datalad container-run".  Then you could you have your
preprocessing entirely reproducible and simple provenance recorded
within git commits. 

handbook on that:
http://handbook.datalad.org/en/latest/basics/basics-run.html

And Michael ATM is actively looking into making snakemake to
tollerate datalad (well, git-annex), so you might like to define your
snakemake workflows

>  * with some agreement between us on how to formulate the metadata in a
> machine-readable manner so we know what tool needs to check out what
> files for which workflows

unfortunately cannot recommend anything specific ATM since not familiar
with bioinformatics metadata and its use within workflows.

> I should now read a bit in your handbook. And think a bit more about it
> over the weekend.

I hope you like it.  Adina and Michael did (well -- still doing) awesome
job with it.

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW:   http://www.linkedin.com/in/yarik

Reply to:

References:
- How to package human, mouse and viral genomes?
  - From: Steffen Möller <steffen_moeller@gmx.de>
- Re: How to package human, mouse and viral genomes?
  - From: Yaroslav Halchenko <debian@onerussian.com>
- Re: How to package human, mouse and viral genomes?
  - From: Steffen Möller <steffen_moeller@gmx.de>
- Re: How to package human, mouse and viral genomes?
  - From: Yaroslav Halchenko <debian@onerussian.com>
- Re: How to package human, mouse and viral genomes?
  - From: Steffen Möller <steffen_moeller@gmx.de>
- Re: How to package human, mouse and viral genomes?
  - From: Yaroslav Halchenko <debian@onerussian.com>
- Re: How to package human, mouse and viral genomes?
  - From: Steffen Möller <steffen_moeller@gmx.de>

Prev by Date: Re: How to package human, mouse and viral genomes?
Next by Date: Re: Datalad (Was: How to package human, mouse and viral genomes?)
Previous by thread: Re: How to package human, mouse and viral genomes?
Next by thread: Datalad (Was: How to package human, mouse and viral genomes?)
Index(es):
- Date
- Thread