[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Please raise your opinion to package size and the given options to restrict it (Was: Bug#833388: ITP: metaphlan2 -- Metagenomic Phylogenetic Analysis)



Hi,

my remarks about the size of this package and the explicite request to
discuss it here might have been remain unseen at the end of this ITP so
I'm hereby making some more noise.

Please raise your concerns about the size and your opinion about the
alternatives I suggested below.  In parallel I'm discussing options with
upstream (https://lists.debian.org/debian-med/2016/08/msg00036.html).

Kind regards

      Andreas.

On Wed, Aug 03, 2016 at 09:00:05PM +0200, Andreas Tille wrote:
> Package: wnpp
> Severity: wishlist
> Owner: Andreas Tille <tille@debian.org>
> 
> * Package name    : metaphlan2
>   Version         : 2.5.0
>   Upstream Author : Nicola Segata <nicola.segata@unitn.it>
> * URL             : https://bitbucket.org/nsegata/metaphlan2/wiki/Home
> * License         : MIT
>   Programming Lang: Python
>   Description     : Metagenomic Phylogenetic Analysis
>  MetaPhlAn is a computational tool for profiling the composition of
>  microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from
>  metagenomic shotgun sequencing data with species level resolution. From
>  version 2.0, MetaPhlAn is also able to identify specific strains (in the
>  not-so-frequent cases in which the sample contains a previously
>  sequenced strains) and to track strains across samples for all species.
>  .
>  MetaPhlAn 2.0 relies on ~1M unique clade-specific marker genes (the
>  marker information file can be found at src/utils/markers_info.txt.bz2
>  or here) identified from ~17,000 reference genomes (~13,500 bacterial
>  and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
>  .
>   * unambiguous taxonomic assignments;
>   * accurate estimation of organismal relative abundance;
>   * species-level resolution for bacteria, archaea, eukaryotes and
>     viruses;
>   * strain identification and tracking
>   * orders of magnitude speedups compared to existing methods.
>   * metagenomic strain-level population genomics
> 
> 
> Remark: The package is a target for Debian Med in itself and will be
> used by metaBIT.  It will be maintained by the Debian Med team and the
> packaging is currently available at
>    svn://anonscm.debian.org/debian-med/trunk/packages/metaphlan2/trunk/
> 
> 
> ******* I'd like to discuss the following issue on debian-devel list *******
> 
> While Debian Med is injecting several low popularity contest packages
> this one has an extraordinary large set of data and thus I want to
> discuss the following options:
> 
>   1) Original orig.tar.gz has 1GB and contains 1.2GB uncompressed
>      binary data.  License-wise it should not be a problem since
>      there is a recipe given how to translate these into text form
>      back and forth[1].
> 
>      We would have: source package 1GB + binary package 1GB
> 
>   2) When unpackaging the orig.tar.gz translating binary data to
>      text format and recompress using xz the tarball is "only" 265MB.
>      The transformation process takes about 30min on my Laptop - not
>      longer than any larger project might need to build but the
>      resulting binary package would have again close to 1GB.
> 
>      This enables the options:
> 
>      2a) Source tarball 256MB + binary package 1GB
> 
>      2b) Do the conversion of the format in postinst at the expense
>          of users time which is acceptable since the package usually
>          unpacks on high performance machines and not so many
>          installations which means bandwidth and disk space on Debian
>          mirrors should be saved here instead of users machine
> 
>          Source tarball 256MB + binary package ~250MB (estimated)
> 
>   3) Strip all data from the source package and download data in
>      postinst from upstream Git repository.  This makes the package
>      of uncritical size from a Debian point of view but might be
>      problematic in some user setups which might have problems with
>      larger data downloads (possibly be upstream can be convinced
>      to provide a *.bz2 tarball for maximum compression).
> 
>      3a) Use postinst
> 
>      3b) Inform user to call a download script manually to do not
>          block apt for a longer time dealing with potential download
>          problems.
> 
> What do you think what strategy should be choosen to be kind to Debian
> (and mirror) resources?
> 
> Kind regards
> 
>         Andreas.
> 
> [1] https://bitbucket.org/biobakery/metaphlan2#markdown-header-customizing-the-database
> 
> 

-- 
http://fam-tille.de


Reply to: