Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)

To: debian-devel@lists.debian.org
Subject: Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)
From: Steffen Moeller <steffen_moeller@gmx.de>
Date: Sat, 9 Jun 2007 12:27:57 +0200
Message-id: <[🔎] 200706091228.29410.steffen_moeller@gmx.de>
In-reply-to: <[🔎] Pine.LNX.4.62.0706061249390.1391@wr-linux02>
References: <[🔎] 20070605080907.GA3416@gloin> <[🔎] 3C0C43F7-5522-44FB-A095-7227005F73B0@chiark.greenend.org.uk> <[🔎] Pine.LNX.4.62.0706061249390.1391@wr-linux02>

On Wednesday 06 June 2007 13:00:19 Andreas Tille wrote:
> On Wed, 6 Jun 2007, Tim Cutts wrote:
>     0. Find a solution for large data sets in generel
>     1. Find a solution for static biological data (I couldn't believe
>        that all biological data are really changing that frequently).
>     2. Find a solution that might make the kind of handling of
>        dynamical data as you described more user firendly (bittorrent).

Not all data is updated at the bimonthly Ensembl-pace or as big as Ensembl. 
But the most interesting data is :o)   

...
> > software which builds and then presents http://www.ensembl.org)
> > 4)  Maintaining our own package repository
> > 5)  Migration from Tru64 to Debian
...
> >
> > Feel free to suggest to me things that you'd find interesting to talk
>
> I personally would be mostly interested in top 4 (Maintaining our own
> package repository).

It would be lovely if we could agree on a set of databases to support in 
Debian and to have a permanent location in the file system for them. For the 
reasons that Tim has already outlined I do not see to distribute the larger 
database as Debian packages. Once a (computational) biologist starts a new 
project, (s)he wants the latest data no matter what and anything older than 
three months (or a week sometimes) is likely not to be acceptable.  I do not 
see any packaging effort to work for that and particularly not in the way we 
think of the stable distribution.

What may be stable though is an application that install the latest databases 
for the user. And maybe that application would even know how to make use of 
the diffs to the respective latest release that many databases like EMBL 
offer in order to reduce download times (we are talking about many Gigs for 
these big players). I could well imagine, that an application that maintains 
the most important databases of say the Nucleic Acids Research's January 
issue could well be publishable and may be a nice project for a summer 
student to start off. Any volunteers on this list by any chance?

I am not certain about how to reference a such auto-maintained particular 
database from other packages. Maybe there could be something like virtual 
packages that depend on the auto-biodb-maintaintenance tool and call it in 
their postinst scripts as
$ auto-biodb-maintaintenance --make-sure-it-is-maintained dbname

Many greetings

Steffen

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply to:

Follow-Ups:
- Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)
  - From: Andreas Tille <tillea@rki.de>

References:
- Reasonable maximum package size ?
  - From: Michael Hanke <michael.hanke@gmail.com>
- Large static datasets like genomes (Re: Reasonable maximum package size ?)
  - From: Tim Cutts <timc@chiark.greenend.org.uk>
- Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)
  - From: Andreas Tille <tillea@rki.de>

Prev by Date: Re: discussion with the FSF: GPLv3, GFDL, Nexenta
Next by Date: Bug#428147: ITP: hotwire -- Graphical, terminal-oriented shell for GNOME
Previous by thread: Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)
Next by thread: Re: Large static datasets like genomes (Re: Reasonable maximum package size ?)
Index(es):
- Date
- Thread