[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Large static datasets like genomes (Re: Reasonable maximum package size ?)

Hash: SHA1

On 5 Jun 2007, at 9:47 pm, Roger Leigh wrote:

Anthony Towns <aj@azure.humbug.org.au> writes:

On Tue, Jun 05, 2007 at 06:28:53PM +0900, Charles Plessy wrote:
Le Tue, Jun 05, 2007 at 10:09:07AM +0200, Michael Hanke a ?crit :
My question is now: Is it reasonable to provide this rather huge amount
of data in a package in the archive?
many thanks for bringing this crucial question on -devel. In my field, I wish that it would be possible to apt-get install the human genome for

Are either of you going to debconf, or able to point out some example
large (free?) data sets that should be packaged like this as a test case
for playing with over debconf?

The NCBI non-redundant database (nr).  Having this packaged and
frequently updated (maybe in volatile) would be fantastic.  There are
also quite a number of other significant (popular) databases used for
bioinformatics, genomics, proteomics and other biological fields which
would be really nice to have in Debian.  Here's a selection:


Because these are all in standard formats, it might even be possible
to have updated packages generated and uploaded semi-automatically.
These would be really useful in conjunction with much of the
bioinformatics software already available in Debian, which could make
good use of them if they were put in standardised locations.

Obviously such things can't be put into Debian directly; Debian doesn't update nearly fast enough. There's nothing to stop you creating a repository of data .deb files though, separate from Debian itself, which people can track. Especially if the BitTorrent distribution method becomes a reality.

Here at the Sanger Institute, we use Debian heavily, but we specifically *don't* use Debian tools for either bioinformatics software or the large data sets. Reasons for this:

1) Currently, package management requires root access, and we want users to be able to update their own data sets (aside: I'd love it if we could have some sort of "user package" system which could allow non-root users to install software packages in areas they have access to, and yet have full dependency checking on the main system packages)

2) Having large data sets in packages installed on lots of machines wastes a lot of disk space, especially when modern cluster filesystems can get you just as good performance with only a single copy of the data.

3) The frequency with which updates are required is much too high to make the effort of the packaging worth it. Sanger has instead a single centrally maintained set of data sets, with a MySQL database attached which stores things like (a) the upstream version of the data (b) when it was downloaded (c) which local machines are known to have copies. There are then scripts which can be run to do things like ask what version of a database I can see locally, and whether there is a newer version available centrally. If so, which files do I need to rsync to obtain a new copy?

As has been mentioned previously, a separate archive section so that
mirrors could skip them would be nice.  Together, all these databases
are eye-wateringly huge.  Especially when uncompressed.

If you archive them, it really is going to get eye-watering. As you know, these databases are growing exponentially, and if you want to save historical versions as well... argh! Even Ensembl has abandoned the idea of keeping the data forever.

As an aside, several people in this list are interested in large scale biological data, and you may or may not be aware that I'm giving a presentation at debconf about what we're doing at the Sanger Institute, and how Debian fits in. I imagine some of you would like to come to that talk, and if you wish to contact me off list to suggest things you particularly want me to talk about, then please do. Some of the topics I could cover:

1) Management of a thousand node cluster (choice of hardware, automated installation, configuration management, monitoring)
2)  Parallel filesystems (Lustre, GPFS, PVFS etc)
3) Scalability issues in genomic analysis (especially in the software which builds and then presents http://www.ensembl.org)
4)  Maintaining our own package repository
5)  Migration from Tru64 to Debian
6)  Multipath SAN access, failover and so on
7)  Approaches to job scheduling on large clusters
8)  Problems with MySQL at this scale

Feel free to suggest to me things that you'd find interesting to talk about.

Version: GnuPG v1.4.3 (Darwin)


The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Reply to: