Large static datasets like genomes (Re: Reasonable maximum package size ?)
-----BEGIN PGP SIGNED MESSAGE-----
On 5 Jun 2007, at 9:47 pm, Roger Leigh wrote:
Anthony Towns <firstname.lastname@example.org> writes:
On Tue, Jun 05, 2007 at 06:28:53PM +0900, Charles Plessy wrote:
Le Tue, Jun 05, 2007 at 10:09:07AM +0200, Michael Hanke a ?crit :
My question is now: Is it reasonable to provide this rather huge
many thanks for bringing this crucial question on -devel. In my
wish that it would be possible to apt-get install the human
of data in a package in the archive?
Are either of you going to debconf, or able to point out some example
large (free?) data sets that should be packaged like this as a
for playing with over debconf?
The NCBI non-redundant database (nr). Having this packaged and
frequently updated (maybe in volatile) would be fantastic. There are
also quite a number of other significant (popular) databases used for
bioinformatics, genomics, proteomics and other biological fields which
would be really nice to have in Debian. Here's a selection:
Because these are all in standard formats, it might even be possible
to have updated packages generated and uploaded semi-automatically.
These would be really useful in conjunction with much of the
bioinformatics software already available in Debian, which could make
good use of them if they were put in standardised locations.
Obviously such things can't be put into Debian directly; Debian
doesn't update nearly fast enough. There's nothing to stop you
creating a repository of data .deb files though, separate from Debian
itself, which people can track. Especially if the BitTorrent
distribution method becomes a reality.
Here at the Sanger Institute, we use Debian heavily, but we
specifically *don't* use Debian tools for either bioinformatics
software or the large data sets. Reasons for this:
1) Currently, package management requires root access, and we want
users to be able to update their own data sets (aside: I'd love it
if we could have some sort of "user package" system which could allow
non-root users to install software packages in areas they have access
to, and yet have full dependency checking on the main system packages)
2) Having large data sets in packages installed on lots of machines
wastes a lot of disk space, especially when modern cluster
filesystems can get you just as good performance with only a single
copy of the data.
3) The frequency with which updates are required is much too high to
make the effort of the packaging worth it. Sanger has instead a
single centrally maintained set of data sets, with a MySQL database
attached which stores things like (a) the upstream version of the
data (b) when it was downloaded (c) which local machines are known to
have copies. There are then scripts which can be run to do things
like ask what version of a database I can see locally, and whether
there is a newer version available centrally. If so, which files do
I need to rsync to obtain a new copy?
As has been mentioned previously, a separate archive section so that
mirrors could skip them would be nice. Together, all these databases
are eye-wateringly huge. Especially when uncompressed.
If you archive them, it really is going to get eye-watering. As you
know, these databases are growing exponentially, and if you want to
save historical versions as well... argh! Even Ensembl has
abandoned the idea of keeping the data forever.
As an aside, several people in this list are interested in large
scale biological data, and you may or may not be aware that I'm
giving a presentation at debconf about what we're doing at the Sanger
Institute, and how Debian fits in. I imagine some of you would like
to come to that talk, and if you wish to contact me off list to
suggest things you particularly want me to talk about, then please
do. Some of the topics I could cover:
1) Management of a thousand node cluster (choice of hardware,
automated installation, configuration management, monitoring)
2) Parallel filesystems (Lustre, GPFS, PVFS etc)
3) Scalability issues in genomic analysis (especially in the
software which builds and then presents http://www.ensembl.org)
4) Maintaining our own package repository
5) Migration from Tru64 to Debian
6) Multipath SAN access, failover and so on
7) Approaches to job scheduling on large clusters
8) Problems with MySQL at this scale
Feel free to suggest to me things that you'd find interesting to talk
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)
-----END PGP SIGNATURE-----
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.