Bug#833388: Separation of code and data in MetaPhLan2 (Was: Description for lefse tools)

To: Nicola Segata <nicola.segata@unitn.it>
Cc: Duy Tin Truong <duytin.truong@gmail.com>, Debian Med Project List <debian-med@lists.debian.org>, Curtis Huttenhower <chuttenh@hsph.harvard.edu>, 833388@bugs.debian.org
Subject: Bug#833388: Separation of code and data in MetaPhLan2 (Was: Description for lefse tools)
From: Andreas Tille <andreas@an3as.eu>
Date: Thu, 4 Aug 2016 13:54:42 +0200
Message-id: <[🔎] 20160804115442.GL30880@an3as.eu>
Reply-to: Andreas Tille <andreas@an3as.eu>, 833388@bugs.debian.org
In-reply-to: <CAAbkqH8EO_8kjZ0n40VVy=p3k4W11=BPqKpx=4abY+bZGj2USg@mail.gmail.com>
References: <20160730204704.GC18451@an3as.eu> <CAAbkqH-TQg_vnQipP+7scuwsgJv1EfSXM2Yey3vurm6ZWv9ifQ@mail.gmail.com> <20160803133853.GA21010@an3as.eu> <CAHRzRd_Jh0QtaTMSP-0y+KpK+UnO-Gdv738CaAeca79EwfVX3w@mail.gmail.com> <20160803160817.GD21010@an3as.eu> <CAAbkqH8ZFg2fFhBQR5uVC0G41SsPLFrwBWY1vMhV-3chW7if1Q@mail.gmail.com> <20160804061841.GE30880@an3as.eu> <CAAbkqH-6UdFan=k8q51rOMqEAH_cGcPEXVXOVhgrdzGxd0mWnA@mail.gmail.com> <20160804104650.GJ30880@an3as.eu> <CAAbkqH8EO_8kjZ0n40VVy=p3k4W11=BPqKpx=4abY+bZGj2USg@mail.gmail.com>

Hi Nicola,

thanks for the clarification.  What about the following which might be
in the interest of all users not only for the Debian packaging.  You
provide separate download archives of code and data.  If users want to
run the new code but have the data just installed there is no real point
in a heavy download of the same files.  If you do so its probably easily
possible to provide the data in two different archives: one with the
processed index data as its provided currently and another archive with
the fasta database.

If you do so my plan for the Debian packaging would be simple to
realise:  I provide separate packages with code and data.  The data will
be shipped as fasta and converted on the users machine while leaving the
option to either convert right at installation time or ask the user to
call a script at another point in time doing the conversion.

Would you think the separation of code and data at your side is
possible?

Kind regards

     Andreas.

On Thu, Aug 04, 2016 at 11:28:34AM +0000, Nicola Segata wrote:
> Hi Andreas,
>  yes, it is likely that the code will be frequently updated, but the big
> database file will change only rarely (for sure no more frequently than
> once a year).
> thanks
> Nicola
> 
> On Thu, Aug 4, 2016 at 12:46 PM Andreas Tille <andreas@an3as.eu> wrote:
> 
> > Hi again,
> >
> > On Thu, Aug 04, 2016 at 08:10:29AM +0000, Nicola Segata wrote:
> > > Makes sense to me!
> >
> >    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=833388#15
> >
> > If you read the discussion it seems that my suggestion to ship the fasta
> > file inside the Debian package and let the postinst do the
> > transformation step found some agreement - provided that there are no
> > frequent changes in the package and several uploads per month will
> > happen.
> >
> > I'm now wondering what your estimated change rate for the metaphlan2
> > data files might be.  Do these change frequently?  Is there any chance
> > that the code changes frequently but the data files stay unchanged?
> >
> > Kind regards
> >
> >       Andreas.
> >
> > > On Thu, Aug 4, 2016 at 8:18 AM Andreas Tille <andreas@an3as.eu> wrote:
> > >
> > > > Hi Nicola,
> > > >
> > > > On Wed, Aug 03, 2016 at 08:51:33PM +0000, Nicola Segata wrote:
> > > > > Great, thanks Andreas. We provide the "*.bt2" files so that the user
> > can
> > > > > run BowTie2 internally to MetaPhlAn directly without first building
> > the
> > > > > indexes (it will take quite a bit of time).
> > > >
> > > > Fully agreed here.
> > > >
> > > > > Also, the indexes are smaller
> > > > > in size than the sequence file...
> > > >
> > > > Hmmm, all *.bt2 files sum up to 1,124,449kB while the fasta file has
> > > > only 753081kB.  Considering the better compression performance of pure
> > > > text files a compressed archive containing the fasta is drastically
> > > > smaller than one with the *.bt2 files.  Yesterday I tried to start a
> > > > discussion how to deal with the size of the data inside Debian[1] (no
> > > > answer so far) and my experiment to create a source tarball just
> > > > containing the fasta resulted in a 270MB *xz* compressed file (well xz
> > > > is better than gz but lets say the compressed tarball with the fasta is
> > > > about 30% of size of your current download of 1.017MB.
> > > >
> > > > The situation for Debian is different than from your users:  A user who
> > > > downloads from your website intends to run metaphlan2.  Amongst the
> > > > millions of Debian users only very few are interested in metaphlan2 and
> > > > we need to outweight how much resources we could spent.  Its not that
> > > > only Debian provides resources.  There is a large mirroring network
> > that
> > > > spents lots of bandwidth and disk space for a very small usage.  So in
> > > > this case it makes sense to put the effort on the users side to
> > > > regenerate the indexes (or even download the data separately via a
> > > > script we could provide inside the package).  So I could imagine to
> > > > package only the metaphlan2 code and provide a script that downloads
> > the
> > > > data and puts them into the expected place.
> > > >
> > > > Kind regards
> > > >
> > > >          Andreas.
> > > >
> > > > [1]
> > > >
> > https://lists.alioth.debian.org/pipermail/debian-med-packaging/2016-August/044984.html
> > > >
> > > > > cheers
> > > > > Nicola
> > > > >
> > > > > On Wed, Aug 3, 2016 at 6:08 PM Andreas Tille <andreas@an3as.eu>
> > wrote:
> > > > >
> > > > > > Hi Tin,
> > > > > >
> > > > > > On Wed, Aug 03, 2016 at 02:01:01PM +0000, Duy Tin Truong wrote:
> > > > > > > > - Tin can also provide more info about the binary data in
> > db_v20.
> > > > The
> > > > > > files
> > > > > > > > ending with "bt2" are created using a script in the Bowtie2
> > package
> > > > > > > > (bowtie2-build) using a sequence file Tin can provide (it can
> > also
> > > > be
> > > > > > > > recovered from the bt2 files with bowtie2-inspect if I remember
> > > > well).
> > > > > > > As Nicola said, those files in db_v20 are created with
> > bowtie2-build
> > > > > > > using a sequence file and you can recover the sequence file by:
> > > > > > >
> > > > > > > bowtie2-inspect metaphlan2/db_v20/mpa_v20_m200 >
> > > > metaphlan2/markers.fasta
> > > > > > >
> > > > > > > If you want to rebuild them, the command is:
> > > > > > >
> > > > > > > bowtie2-build metaphlan2/markers.fasta
> > metaphlan2/db_v21/mpa_v21_m200
> > > > > >
> > > > > > I can confirm that I can reproduce the files byte identical from
> > > > > > markers.fasta.  Is there any reason to ship the binary form
> > instead of
> > > > > > the fasta text file?  Moreover, what is the source of the
> > > > markers.fasta?
> > > > > > Is there any related publication or so?
> > > > > >
> > > > > > > > For the mpa_v20_m200.pkl Tin can also provide the uncompressed
> > > > python
> > > > > > > > object (or he can provide a couple of lines of code to
> > uncompress
> > > > it?)
> > > > > > > It is python dictionary and can be read as:
> > > > > > >
> > > > > > > import cPickle as pickleimport bz2
> > > > > > > db = pickle.load(bz2.BZ2File('db_v20/mpa_v20_m200.pkl', 'r'))
> > > > > > >
> > > > > > > You can have more information about them at:
> > > > > > >
> > > > > >
> > > >
> > https://bitbucket.org/biobakery/metaphlan2#markdown-header-customizing-the-database
> > > > > >
> > > > > > OK, that page clarifies the method.  Just a personal remark from
> > the
> > > > > > point of view of an outsider of bioinformatics:  I'd regard the
> > > > creation
> > > > > > process of the mpa_v20_m200.pkl file a bit cumbersome.  I'd
> > personally
> > > > > > prefer droping some text record somewhere and call a script
> > processing
> > > > > > this record rather than writing an own script.
> > > > > >
> > > > > > > In addition, some files were changed the names:
> > > > > > >    - metaphlan2_strainer.py -> strainphlan.py
> > > > > > >    - strainer_src -> strainphlan_src
> > > > > > >    - strainer_tutorial -> strainphlan_tutorial
> > > > > > >
> > > > > > > Some source files were updated as well.
> > > > > > > Please let me know if you need other information.
> > > > > >
> > > > > > Just drop me a not once you might release a new version containing
> > > > these
> > > > > > changes.  I think I'll try to release the current version as is
> > since
> > > > at
> > > > > > least the origin of the files is clarified now.  I'm not yet sure
> > > > whether
> > > > > > the size of the data is acceptable or might spoil some limit.
> > > > Regarding
> > > > > > this I'm wondering whether I create a source tarball including
> > rather
> > > > > > markers.fasta and create the bt2 files in the build process.
> > > > > >
> > > > > > Kind regards
> > > > > >
> > > > > >        Andreas.
> > > > > >
> > > > > > --
> > > > > > http://fam-tille.de
> > > > > >
> > > >
> > > > --
> > > > http://fam-tille.de
> > > >
> >
> > --
> > http://fam-tille.de
> >

-- 
http://fam-tille.de

Reply to:

Prev by Date: Bug#819654: marked as done (ITP: barectf -- A code generator to write native CTF binary streams)
Next by Date: Bug#830698: marked as done (ITP: minetest-mod-advspawning -- Minetest mod providing an advanced spawning framework)
Previous by thread: Bug#819654: marked as done (ITP: barectf -- A code generator to write native CTF binary streams)
Next by thread: Bug#830698: marked as done (ITP: minetest-mod-advspawning -- Minetest mod providing an advanced spawning framework)
Index(es):
- Date
- Thread