[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to package human, mouse and viral genomes?



You might like to listen to debconf20 talk on DataLad ;-)

At some point I have started even to establish some kind of dh-datalad
helper so that  .deb package would contain a datalad dataset
(git/git-annex repo), and would just `get`  data files upon
installation...   So -- yes, they would not be "self contained" but it
is infeasible for any sizeable data packages on debian.  But they could
be versioned, point to specific git state of corresponding datasets,
provide lightweight and efficient upgrades (only changed/new files would
need to be fetched), etc.  They could be partitioned into smaller
subdatasets or custom views to be provided, like we have

https://github.com/datalad-datasets/hcp-structural-preprocessed
which is a selection from a larger
https://github.com/datalad-datasets/human-connectome-project-openaccess

Never finished that helper though -- we just (develop and) use datalad
directly and had no debian packages which would need strict dependency
on the datasets.  More of sample  datasets could be found on
https://datasets.datalad.org/ -- data primarily comes from original
repositories, and covers now > 200TB

We had started to collect resources someone might like to datalad'ify
relevant to bioinformatics:
https://github.com/datalad/datalad/milestone/14?closed=1
but since we are not in bioinformatics field, never actually addressed
them.

I also know that https://github.com/notestaff is actively using
git-annex (not sure if datalad -- but he did submit some issues, so he
might) for bioinformatics.  Might be worth checking with him
if git-annex/datalad would be decided to be used.

On Thu, 03 Sep 2020, Steffen Möller wrote:

> Hello,

> We are closing in on the workflows. What is kind of missing are the
> mostly invariant inputs like the genomes of pathogens and very much so
> the reference genomes of the human, mouse, rat, worm, fly, .... you name
> them.

> Other than a few years ago, hard drives are now big enough to
> accommodate the one or other genome and derivative indexes. Just - I
> don't think we want to organize in our regular Debian infrastructure
> something as variant as public genome (yes, they are still regularly
> updated, very much so) and that is so very security-irrelevant (just
> some data). Also, different sites will vary a lot in where this data
> shall be organized and all those scripts should likely be
> executed/initiated as/by non-root. There are public sites for this from
> where this data can be downloaded. Any redundancy to these sites imho
> mostly hurts us. The other side is that to just get something up quickly
> and for reproducibility tests, our infrastructure is difficult to beat.

> Please kindly throw your ideas at me how you would like whole genomes to
> be presented by Debian to the average user and to professionals. Just
> reply to this thread and/or send me "+1"s a PM and I summarize this up
> in a document which I suggest we then talk about in a jitsi meeting.

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW:   http://www.linkedin.com/in/yarik        


Reply to: