Re: Reasonable maximum package size ?
Le Tue, Jun 05, 2007 at 09:47:37PM +0100, Roger Leigh a écrit :
> Anthony Towns <firstname.lastname@example.org> writes:
> > Are either of you going to debconf, or able to point out some example
> > large (free?) data sets that should be packaged like this as a test case
> > for playing with over debconf?
> The NCBI non-redundant database (nr). Having this packaged and
> frequently updated (maybe in volatile) would be fantastic. There are
> also quite a number of other significant (popular) databases used for
> bioinformatics, genomics, proteomics and other biological fields which
> would be really nice to have in Debian. Here's a selection:
Thanks to Roger, I do not need to give more examples of big datasets. I
recently tried to explore the issues of packaging biological sequence
databases with a small one, miRbase:
This example shows why using a packaging system is useful to process the
data, and why it is space-hungry:
Let us examine the contents of miRbase:
ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/ (no tarball
File: README 5 KB 25.05.2007 15:43:00
File: THIS_IS_RELEASE_9_2 1 KB 25.05.2007 15:47:00
Directory: database_files 25.05.2007 15:41:00
Directory: genomes 25.05.2007 15:40:00
File: hairpin.fa.gz 171 KB 25.05.2007 15:40:00
File: mature.fa.gz 62 KB 25.05.2007 15:40:00
File: miFam.dat.gz 24 KB 25.05.2007 15:40:00
File: miRNA.dat.gz 507 KB 25.05.2007 15:40:00
File: miRNA.dead.gz 3 KB 25.05.2007 15:40:00
File: miRNA.diff.gz 3 KB 25.05.2007 15:40:00
File: miRNA.str.gz 362 KB 25.05.2007 15:40:00
File: miRNA.xls 1193 KB 25.05.2007 15:40:00
The core data is miRNA.dat.gz. Its compression ratio is almost 90%: DNA
sequences are mostly A,C,T,G,N and headers. To use the database, we can
search it either by sequence name (researcher name known sequences), or
by similarity ("find anything more than 85% similar to AACTGAATTCGAT").
Debian has tools for this, but they can not work directly on miRNA.dat.
Here is a longer summary on many ways to search miRbase (you may skip it
if you are busy):
* Finding by name: using the (experimental) package emboss, the build
rules of a mirbase binary package should create an index in EMBOSS
format. The latest format produces indexes which are bigger than the
database itself. And it has been specially written to index files
bigger than 2 Go !
* Finding by similarity: again, emboss is needed, this time to convert
miRNA.dat to an intermediary format, which can be used to create a
database in the NCBI blast (package blast2) format. EMBOSS can also
index these databases by name, but it sacrifices some information.
NCBI blast users may nevertheless opt for these indexes, because it
saves a lot of space (the blast databases are binary, not flat files)
* Finding through a warehouse. Most databases are interconnectable. A
gene in the "nr" database (see above) can signal that it is the target
of a miRNA of miRbase, which lists other targets, which code for
proteins, which have domains, which have strucure, which bind drugs,
which cure diseases, which are caused by mutations in genes, which are
the target of miRNA... The most famous warehouse is SRS, but it is
proprietary. Luckily, there is an alternative being developed, MRS
* Finding by SQL: in the particular case of miRbase, which is rare,
some mySQL dumps are provided in the directory database_files, so that
people can set up a SQL database indexing in details all fields.
* Finding at the office: did you notice the extension of the biggest
file? Ironically it is the only one not to be zipped. miRbase is small
enough to fit a spreadsheet (which can be used by OpenOffice).
* Finding by chance: the 'genomes' directory contain coordinates of the
entries in the different genomes. When displaying a portion of the
sequence of a human chromosome, for instance, the provided files can
be used to flag places in which the sequences of miRbase originate (a
la Google Maps).
Consequences from the packaging point of view:
In order to provide data packages which take advantage of the dependancy
relationships with binary packages, we need some build mechanims, mostly
to reformat and create the indexes in a format compatible with the
current versions of the packages in Debian such as emboss and blast2.
This could be done:
- In buildds,
- on the users computers,
- in "data buildds"
Once processed, the data is sort of duplicated. In the example of
mirbase, we would have:
- The source
- mirbase-embl, the origninal database indexed for emboss.
- mirbase-blast, the database reformated for blast, maybe indexed for
- mirbase-sql, the original databasae injected in a SQL server.
- mirbase-common, with the excel file, the genome goodies, and the
accessory files which summarise changes from previous versions.
Obviously, this strongly increases the size that would be taken on the
mirrors. Also, in the (mid-term) future, Debian can have many more
mainstream tools, and I am quite sure that they do not all use the same
format. So there is the risk of a package proliferation in addition to
the inflation of disk space.
Maybe a solution to this would be to rely on dpkg triggers (when
implemented) so that adding new databases would install only requested
things according to the available tools, and adding new tools would
trigger the reformatting of databases if necessary ?
Have a nice day,
Wako, Saitama, Japan