Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)
On Sat, 9 Jun 2007, Steffen Moeller wrote:
It would be lovely if we could agree on a set of databases to support in
Debian and to have a permanent location in the file system for them. For the
reasons that Tim has already outlined I do not see to distribute the larger
database as Debian packages. Once a (computational) biologist starts a new
project, (s)he wants the latest data no matter what and anything older than
three months (or a week sometimes) is likely not to be acceptable. I do not
see any packaging effort to work for that and particularly not in the way we
think of the stable distribution.
Well, but some kind of
that is parsed by some downloader via cronjob and putting the data into
the location you mentioned above comes into mind. I do not ask for
the impossible but for supporting users. I as a user would be bored by
doing things my computer could do itself.
What may be stable though is an application that install the latest databases
for the user. And maybe that application would even know how to make use of
the diffs to the respective latest release that many databases like EMBL
offer in order to reduce download times (we are talking about many Gigs for
these big players). I could well imagine, that an application that maintains
the most important databases of say the Nucleic Acids Research's January
issue could well be publishable and may be a nice project for a summer
student to start off. Any volunteers on this list by any chance?
Ah yes - I see we share quite the same idea.
I am not certain about how to reference a such auto-maintained particular
database from other packages. Maybe there could be something like virtual
packages that depend on the auto-biodb-maintaintenance tool and call it in
their postinst scripts as
$ auto-biodb-maintaintenance --make-sure-it-is-maintained dbname
Something like that.