[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to package human, mouse and viral genomes?



Hi AndreiR,

On 03.09.20 19:47, Andrei Rozanski wrote:
> I am new to Debian Med :) 
> Maybe could be easier to keep a registry of links, md5sum, taxonId,
> database, version, etc. Then, when needed one can fetch the genomes on
> the fly and check md5 using scripts that parse the registry. The
> genomes are not huge anyways (unless somebody wants to work with
> Axolotl :) ) so the download is quite fast (specially if using 2bit
> from ucsc - however 2bit will require twoBitToFasta). 
> As the time passes the number of genomes and and versions grows so
> could be difficult to keep a copy of all genomes needed. 
>
> Depending on the database, one could automatize the version check and
> find new genomes as they are released.

I tend to think we have two technologies associated with us that kind of
address that. One is BioMaj (https://tracker.debian.org/pkg/biomaj3) and
the other coming to mind is getData (https://wiki.debian.org/getData).

Neither of the two efforts ever started to shine in our community. In my
reading this is since we did not care. Ideally we come up with something
that per se is independent from either of the two (and future) attempts.

Steffen

>
> On September 3, 2020 17:16:32 Steffen Möller <steffen_moeller@gmx.de>
> wrote:
>
>> Hello,
>>
>> We are closing in on the workflows. What is kind of missing are the
>> mostly invariant inputs like the genomes of pathogens and very much so
>> the reference genomes of the human, mouse, rat, worm, fly, .... you name
>> them.
>>
>> Other than a few years ago, hard drives are now big enough to
>> accommodate the one or other genome and derivative indexes. Just - I
>> don't think we want to organize in our regular Debian infrastructure
>> something as variant as public genome (yes, they are still regularly
>> updated, very much so) and that is so very security-irrelevant (just
>> some data). Also, different sites will vary a lot in where this data
>> shall be organized and all those scripts should likely be
>> executed/initiated as/by non-root. There are public sites for this from
>> where this data can be downloaded. Any redundancy to these sites imho
>> mostly hurts us. The other side is that to just get something up quickly
>> and for reproducibility tests, our infrastructure is difficult to beat.
>>
>> Please kindly throw your ideas at me how you would like whole genomes to
>> be presented by Debian to the average user and to professionals. Just
>> reply to this thread and/or send me "+1"s a PM and I summarize this up
>> in a document which I suggest we then talk about in a jitsi meeting.
>>
>> Best,
>>
>> Steffen
>


Reply to: