[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to package human, mouse and viral genomes?



I looked at datasets.datalad.org. I could well imagine to use your
technology for other (larger) databases like Pfam or UniProt or PDB. For
cute little genomes my initial reaction was that I felt overwhelmed.
Your pointer will certainly help to define what we want. Many thanks!

On 03.09.20 18:09, Yaroslav Halchenko wrote:
> You might like to listen to debconf20 talk on DataLad ;-)
>
> At some point I have started even to establish some kind of dh-datalad
> helper so that  .deb package would contain a datalad dataset
> (git/git-annex repo), and would just `get`  data files upon
> installation...   So -- yes, they would not be "self contained" but it
> is infeasible for any sizeable data packages on debian.  But they could
> be versioned, point to specific git state of corresponding datasets,
> provide lightweight and efficient upgrades (only changed/new files would
> need to be fetched), etc.  They could be partitioned into smaller
> subdatasets or custom views to be provided, like we have
>
> https://github.com/datalad-datasets/hcp-structural-preprocessed
> which is a selection from a larger
> https://github.com/datalad-datasets/human-connectome-project-openaccess
>
> Never finished that helper though -- we just (develop and) use datalad
> directly and had no debian packages which would need strict dependency
> on the datasets.  More of sample  datasets could be found on
> https://datasets.datalad.org/ -- data primarily comes from original
> repositories, and covers now > 200TB
>
> We had started to collect resources someone might like to datalad'ify
> relevant to bioinformatics:
> https://github.com/datalad/datalad/milestone/14?closed=1
> but since we are not in bioinformatics field, never actually addressed
> them.
>
> I also know that https://github.com/notestaff is actively using
> git-annex (not sure if datalad -- but he did submit some issues, so he
> might) for bioinformatics.  Might be worth checking with him
> if git-annex/datalad would be decided to be used.
>
> On Thu, 03 Sep 2020, Steffen Möller wrote:
>
>> Hello,
>> We are closing in on the workflows. What is kind of missing are the
>> mostly invariant inputs like the genomes of pathogens and very much so
>> the reference genomes of the human, mouse, rat, worm, fly, .... you name
>> them.
>> Other than a few years ago, hard drives are now big enough to
>> accommodate the one or other genome and derivative indexes. Just - I
>> don't think we want to organize in our regular Debian infrastructure
>> something as variant as public genome (yes, they are still regularly
>> updated, very much so) and that is so very security-irrelevant (just
>> some data). Also, different sites will vary a lot in where this data
>> shall be organized and all those scripts should likely be
>> executed/initiated as/by non-root. There are public sites for this from
>> where this data can be downloaded. Any redundancy to these sites imho
>> mostly hurts us. The other side is that to just get something up quickly
>> and for reproducibility tests, our infrastructure is difficult to beat.
>> Please kindly throw your ideas at me how you would like whole genomes to
>> be presented by Debian to the average user and to professionals. Just
>> reply to this thread and/or send me "+1"s a PM and I summarize this up
>> in a document which I suggest we then talk about in a jitsi meeting.


Reply to: