[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Reasonable maximum package size ?



Le Tue, Jun 05, 2007 at 09:47:37PM +0100, Roger Leigh a écrit :
> Anthony Towns <aj@azure.humbug.org.au> writes:
> >
> > Are either of you going to debconf, or able to point out some example
> > large (free?) data sets that should be packaged like this as a test case
> > for playing with over debconf?
> 
> The NCBI non-redundant database (nr).  Having this packaged and
> frequently updated (maybe in volatile) would be fantastic.  There are
> also quite a number of other significant (popular) databases used for
> bioinformatics, genomics, proteomics and other biological fields which
> would be really nice to have in Debian.  Here's a selection:
> 
> ftp://ftp.ncbi.nih.gov/blast/db/
> ftp://ftp.ncbi.nih.gov/refseq/
> ftp://ftp.ncbi.nih.gov/repository/
> ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Hi all,

Thanks to Roger, I do not need to give more examples of big datasets. I
recently tried to explore the issues of packaging biological sequence
databases with a small one, miRbase:

ITP: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=420938

This example shows why using a packaging system is useful to process the
data, and why it is space-hungry:

Let us examine the contents of miRbase:

ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/ (no tarball
available)

File: README  			5 KB  	25.05.2007  	15:43:00
File: THIS_IS_RELEASE_9_2 	1 KB 	25.05.2007 	15:47:00
Directory: database_files 		25.05.2007 	15:41:00
Directory: genomes 			25.05.2007 	15:40:00
File: hairpin.fa.gz 		171 KB 	25.05.2007 	15:40:00
File: mature.fa.gz 		62 KB 	25.05.2007 	15:40:00
File: miFam.dat.gz 		24 KB 	25.05.2007 	15:40:00
File: miRNA.dat.gz 		507 KB 	25.05.2007 	15:40:00
File: miRNA.dead.gz 		3 KB 	25.05.2007 	15:40:00
File: miRNA.diff.gz 		3 KB 	25.05.2007 	15:40:00
File: miRNA.str.gz 		362 KB 	25.05.2007 	15:40:00
File: miRNA.xls 		1193 KB 25.05.2007 	15:40:00

The core data is miRNA.dat.gz. Its compression ratio is almost 90%: DNA
sequences are mostly A,C,T,G,N and headers. To use the database, we can
search it either by sequence name (researcher name known sequences), or
by similarity ("find anything more than 85% similar to AACTGAATTCGAT").
Debian has tools for this, but they can not work directly on miRNA.dat.

Here is a longer summary on many ways to search miRbase (you may skip it
if you are busy):

* Finding by name: using the (experimental) package emboss, the build
  rules of a mirbase binary package should create an index in EMBOSS
  format. The latest format produces indexes which are bigger than the
  database itself. And it has been specially written to index files
  bigger than 2 Go !

* Finding by similarity: again, emboss is needed, this time to convert
  miRNA.dat to an intermediary format, which can be used to create a
  database in the NCBI blast (package blast2) format. EMBOSS can also
  index these databases by name, but it sacrifices some information.
  NCBI blast users may nevertheless opt for these indexes, because it
  saves a lot of space (the blast databases are binary, not flat files)

* Finding through a warehouse. Most databases are interconnectable. A
  gene in the "nr" database (see above) can signal that it is the target
  of a miRNA of miRbase, which lists other targets, which code for
  proteins, which have domains, which have strucure, which bind drugs,
  which cure diseases, which are caused by mutations in genes, which are
  the target of miRNA... The most famous warehouse is SRS, but it is
  proprietary. Luckily, there is an alternative being developed, MRS
  (http://mrs.cmbi.ru.nl/mrs-3/status.do).

* Finding by SQL: in the particular case of miRbase, which is rare,
  some mySQL dumps are provided in the directory database_files, so that
  people can set up a SQL database indexing in details all fields.

* Finding at the office: did you notice the extension of the biggest
  file? Ironically it is the only one not to be zipped. miRbase is small
  enough to fit a spreadsheet (which can be used by OpenOffice).

* Finding by chance: the 'genomes' directory contain coordinates of the
  entries in the different genomes. When displaying a portion of the
  sequence of a human chromosome, for instance, the provided files can
  be used to flag places in which the sequences of miRbase originate (a
  la Google Maps).


Consequences from the packaging point of view:

In order to provide data packages which take advantage of the dependancy
relationships with binary packages, we need some build mechanims, mostly
to reformat and create the indexes in a format compatible with the
current versions of the packages in Debian such as emboss and blast2.
This could be done:

 - In buildds,
 - on the users computers,
 - in "data buildds"

Once processed, the data is sort of duplicated. In the example of
mirbase, we would have:

 - The source
 - mirbase-embl, the origninal database indexed for emboss.
 - mirbase-blast, the database reformated for blast, maybe indexed for
   emboss
 - mirbase-sql, the original databasae injected in a SQL server.
 - mirbase-common, with the excel file, the genome goodies, and the
   accessory files which summarise changes from previous versions.

Obviously, this strongly increases the size that would be taken on the
mirrors. Also, in the (mid-term) future, Debian can have many more
mainstream tools, and I am quite sure that they do not all use the same
format. So there is the risk of a package proliferation in addition to
the inflation of disk space.

Maybe a solution to this would be to rely on dpkg triggers (when
implemented) so that adding new databases would install only requested
things according to the available tools, and adding new tools would
trigger the reformatting of databases if necessary ?


Have a nice day,

-- 
Charles Plessy
http://charles.plessy.org
Wako, Saitama, Japan



Reply to: