[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)

On Wednesday 06 June 2007 13:00:19 Andreas Tille wrote:
> On Wed, 6 Jun 2007, Tim Cutts wrote:
>     0. Find a solution for large data sets in generel
>     1. Find a solution for static biological data (I couldn't believe
>        that all biological data are really changing that frequently).
>     2. Find a solution that might make the kind of handling of
>        dynamical data as you described more user firendly (bittorrent).

Not all data is updated at the bimonthly Ensembl-pace or as big as Ensembl. 
But the most interesting data is :o)   

> > software which builds and then presents http://www.ensembl.org)
> > 4)  Maintaining our own package repository
> > 5)  Migration from Tru64 to Debian
> >
> > Feel free to suggest to me things that you'd find interesting to talk
> I personally would be mostly interested in top 4 (Maintaining our own
> package repository).

It would be lovely if we could agree on a set of databases to support in 
Debian and to have a permanent location in the file system for them. For the 
reasons that Tim has already outlined I do not see to distribute the larger 
database as Debian packages. Once a (computational) biologist starts a new 
project, (s)he wants the latest data no matter what and anything older than 
three months (or a week sometimes) is likely not to be acceptable.  I do not 
see any packaging effort to work for that and particularly not in the way we 
think of the stable distribution.

What may be stable though is an application that install the latest databases 
for the user. And maybe that application would even know how to make use of 
the diffs to the respective latest release that many databases like EMBL 
offer in order to reduce download times (we are talking about many Gigs for 
these big players). I could well imagine, that an application that maintains 
the most important databases of say the Nucleic Acids Research's January 
issue could well be publishable and may be a nice project for a summer 
student to start off. Any volunteers on this list by any chance?

I am not certain about how to reference a such auto-maintained particular 
database from other packages. Maybe there could be something like virtual 
packages that depend on the auto-biodb-maintaintenance tool and call it in 
their postinst scripts as
$ auto-biodb-maintaintenance --make-sure-it-is-maintained dbname

Many greetings


Attachment: signature.asc
Description: This is a digitally signed message part.

Reply to: