Large static datasets like genomes (Re: Reasonable maximum package size ?)

To: Roger Leigh <rleigh@whinlatter.ukfsn.org>
Cc: debian-devel@lists.debian.org
Subject: Large static datasets like genomes (Re: Reasonable maximum package size ?)
From: Tim Cutts <timc@chiark.greenend.org.uk>
Date: Wed, 6 Jun 2007 11:28:02 +0100
Message-id: <[🔎] 3C0C43F7-5522-44FB-A095-7227005F73B0@chiark.greenend.org.uk>
In-reply-to: <[🔎] 874plm163a.fsf@hardknott.home>
References: <[🔎] 20070605080907.GA3416@gloin> <[🔎] 20070605092853.GF19396@kunpuu.plessy.org> <[🔎] 20070605155833.GC10266@azure.humbug.org.au> <[🔎] 874plm163a.fsf@hardknott.home>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On 5 Jun 2007, at 9:47 pm, Roger Leigh wrote:

Anthony Towns <aj@azure.humbug.org.au> writes:

On Tue, Jun 05, 2007 at 06:28:53PM +0900, Charles Plessy wrote:
Le Tue, Jun 05, 2007 at 10:09:07AM +0200, Michael Hanke a ?crit :
My question is now: Is it reasonable to provide this rather hugeamount
of data in a package in the archive?
many thanks for bringing this crucial question on -devel. In myfield, Iwish that it would be possible to apt-get install the humangenome for
instance.
Are either of you going to debconf, or able to point out some example
large (free?) data sets that should be packaged like this as atest case
for playing with over debconf?


The NCBI non-redundant database (nr).  Having this packaged and
frequently updated (maybe in volatile) would be fantastic.  There are
also quite a number of other significant (popular) databases used for
bioinformatics, genomics, proteomics and other biological fields which
would be really nice to have in Debian.  Here's a selection:

ftp://ftp.ncbi.nih.gov/blast/db/
ftp://ftp.ncbi.nih.gov/refseq/
ftp://ftp.ncbi.nih.gov/repository/
ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Because these are all in standard formats, it might even be possible
to have updated packages generated and uploaded semi-automatically.
These would be really useful in conjunction with much of the
bioinformatics software already available in Debian, which could make
good use of them if they were put in standardised locations.

Obviously such things can't be put into Debian directly; Debiandoesn't update nearly fast enough. There's nothing to stop youcreating a repository of data .deb files though, separate from Debianitself, which people can track. Especially if the BitTorrentdistribution method becomes a reality.

Here at the Sanger Institute, we use Debian heavily, but wespecifically *don't* use Debian tools for either bioinformaticssoftware or the large data sets. Reasons for this:

1) Currently, package management requires root access, and we wantusers to be able to update their own data sets (aside: I'd love itif we could have some sort of "user package" system which could allownon-root users to install software packages in areas they have accessto, and yet have full dependency checking on the main system packages)

2) Having large data sets in packages installed on lots of machineswastes a lot of disk space, especially when modern clusterfilesystems can get you just as good performance with only a singlecopy of the data.

3) The frequency with which updates are required is much too high tomake the effort of the packaging worth it. Sanger has instead asingle centrally maintained set of data sets, with a MySQL databaseattached which stores things like (a) the upstream version of thedata (b) when it was downloaded (c) which local machines are known tohave copies. There are then scripts which can be run to do thingslike ask what version of a database I can see locally, and whetherthere is a newer version available centrally. If so, which files doI need to rsync to obtain a new copy?

As has been mentioned previously, a separate archive section so that
mirrors could skip them would be nice.  Together, all these databases
are eye-wateringly huge.  Especially when uncompressed.

If you archive them, it really is going to get eye-watering. As youknow, these databases are growing exponentially, and if you want tosave historical versions as well... argh! Even Ensembl hasabandoned the idea of keeping the data forever.

As an aside, several people in this list are interested in largescale biological data, and you may or may not be aware that I'mgiving a presentation at debconf about what we're doing at the SangerInstitute, and how Debian fits in. I imagine some of you would liketo come to that talk, and if you wish to contact me off list tosuggest things you particularly want me to talk about, then pleasedo. Some of the topics I could cover:

1) Management of a thousand node cluster (choice of hardware,automated installation, configuration management, monitoring)

2)  Parallel filesystems (Lustre, GPFS, PVFS etc)

3) Scalability issues in genomic analysis (especially in thesoftware which builds and then presents http://www.ensembl.org)

4)  Maintaining our own package repository
5)  Migration from Tru64 to Debian
6)  Multipath SAN access, failover and so on
7)  Approaches to job scheduling on large clusters
8)  Problems with MySQL at this scale

Feel free to suggest to me things that you'd find interesting to talkabout.


Tim
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)

iQEVAwUBRmaMOxypeFo2odvPAQKhEAgArX2wh5j/2mBBYSWCWyFzezsYcgR6P6RE
Ha2xD+z47hlomwrAQ6vvCNcCvdyro0OfneiQSJsuWPGJQW4wkN3UUtjrLkOybkHD
EFNeAZO2qTC0AVzXmk7VpEWYbduPXSZJClGw7nueWFuLjBmnah7Q7nRoqbZFTMu+
RqpQOeDV80oOOVZyAEJ+Kdbw+xr1BMhjWmioxtdyeKmnx6Xfq4r25KhjHSc5kf4J
sBwZ/Fpisw8J9UUHN2AtWF0nTS6TfhsdEWaQzcu5G2WMXD+3l5Ec0y6kQh1VheMi
rU92Sne4Y0hoOauNmALme6tmnMK2bg8z10PEonJHFN9NdAm6ldnDLA==
=AWBI
-----END PGP SIGNATURE-----


--

The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE.

Reply to:

Follow-Ups:
- Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)
  - From: Santiago Vila <sanvila@unex.es>
- Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)
  - From: Andreas Tille <tillea@rki.de>

References:
- Reasonable maximum package size ?
  - From: Michael Hanke <michael.hanke@gmail.com>
- Re: Reasonable maximum package size ?
  - From: Charles Plessy <charles-debian-nospam@plessy.org>
- Re: Reasonable maximum package size ?
  - From: Anthony Towns <aj@azure.humbug.org.au>
- Re: Reasonable maximum package size ?
  - From: Roger Leigh <rleigh@whinlatter.ukfsn.org>

Prev by Date: Re: Dependencies on shared libs, take 2
Next by Date: Re: Dependencies on shared libs, take 2
Previous by thread: Re: Reasonable maximum package size ?
Next by thread: Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)
Index(es):
- Date
- Thread