[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Description for lefse tools (Was: Origin of data files in MetaPhLan2)



Hi Andreas,

If you can use the latest version with the name changes as I mentioned, it would fit better with the updated tutorial on the metaphlan2 repository.

Thanks,
Tin

On Thu, Aug 4, 2016 at 1:28 PM Nicola Segata <nicola.segata@unitn.it> wrote:
Hi Andreas,
 yes, it is likely that the code will be frequently updated, but the big database file will change only rarely (for sure no more frequently than once a year).
thanks
Nicola

On Thu, Aug 4, 2016 at 12:46 PM Andreas Tille <andreas@an3as.eu> wrote:
Hi again,

On Thu, Aug 04, 2016 at 08:10:29AM +0000, Nicola Segata wrote:
> Makes sense to me!

   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=833388#15

If you read the discussion it seems that my suggestion to ship the fasta
file inside the Debian package and let the postinst do the
transformation step found some agreement - provided that there are no
frequent changes in the package and several uploads per month will
happen.

I'm now wondering what your estimated change rate for the metaphlan2
data files might be.  Do these change frequently?  Is there any chance
that the code changes frequently but the data files stay unchanged?

Kind regards

      Andreas.

> On Thu, Aug 4, 2016 at 8:18 AM Andreas Tille <andreas@an3as.eu> wrote:
>
> > Hi Nicola,
> >
> > On Wed, Aug 03, 2016 at 08:51:33PM +0000, Nicola Segata wrote:
> > > Great, thanks Andreas. We provide the "*.bt2" files so that the user can
> > > run BowTie2 internally to MetaPhlAn directly without first building the
> > > indexes (it will take quite a bit of time).
> >
> > Fully agreed here.
> >
> > > Also, the indexes are smaller
> > > in size than the sequence file...
> >
> > Hmmm, all *.bt2 files sum up to 1,124,449kB while the fasta file has
> > only 753081kB.  Considering the better compression performance of pure
> > text files a compressed archive containing the fasta is drastically
> > smaller than one with the *.bt2 files.  Yesterday I tried to start a
> > discussion how to deal with the size of the data inside Debian[1] (no
> > answer so far) and my experiment to create a source tarball just
> > containing the fasta resulted in a 270MB *xz* compressed file (well xz
> > is better than gz but lets say the compressed tarball with the fasta is
> > about 30% of size of your current download of 1.017MB.
> >
> > The situation for Debian is different than from your users:  A user who
> > downloads from your website intends to run metaphlan2.  Amongst the
> > millions of Debian users only very few are interested in metaphlan2 and
> > we need to outweight how much resources we could spent.  Its not that
> > only Debian provides resources.  There is a large mirroring network that
> > spents lots of bandwidth and disk space for a very small usage.  So in
> > this case it makes sense to put the effort on the users side to
> > regenerate the indexes (or even download the data separately via a
> > script we could provide inside the package).  So I could imagine to
> > package only the metaphlan2 code and provide a script that downloads the
> > data and puts them into the expected place.
> >
> > Kind regards
> >
> >          Andreas.
> >
> > [1]
> > https://lists.alioth.debian.org/pipermail/debian-med-packaging/2016-August/044984.html
> >
> > > cheers
> > > Nicola
> > >
> > > On Wed, Aug 3, 2016 at 6:08 PM Andreas Tille <andreas@an3as.eu> wrote:
> > >
> > > > Hi Tin,
> > > >
> > > > On Wed, Aug 03, 2016 at 02:01:01PM +0000, Duy Tin Truong wrote:
> > > > > > - Tin can also provide more info about the binary data in db_v20.
> > The
> > > > files
> > > > > > ending with "bt2" are created using a script in the Bowtie2 package
> > > > > > (bowtie2-build) using a sequence file Tin can provide (it can also
> > be
> > > > > > recovered from the bt2 files with bowtie2-inspect if I remember
> > well).
> > > > > As Nicola said, those files in db_v20 are created with bowtie2-build
> > > > > using a sequence file and you can recover the sequence file by:
> > > > >
> > > > > bowtie2-inspect metaphlan2/db_v20/mpa_v20_m200 >
> > metaphlan2/markers.fasta
> > > > >
> > > > > If you want to rebuild them, the command is:
> > > > >
> > > > > bowtie2-build metaphlan2/markers.fasta metaphlan2/db_v21/mpa_v21_m200
> > > >
> > > > I can confirm that I can reproduce the files byte identical from
> > > > markers.fasta.  Is there any reason to ship the binary form instead of
> > > > the fasta text file?  Moreover, what is the source of the
> > markers.fasta?
> > > > Is there any related publication or so?
> > > >
> > > > > > For the mpa_v20_m200.pkl Tin can also provide the uncompressed
> > python
> > > > > > object (or he can provide a couple of lines of code to uncompress
> > it?)
> > > > > It is python dictionary and can be read as:
> > > > >
> > > > > import cPickle as pickleimport bz2
> > > > > db = pickle.load(bz2.BZ2File('db_v20/mpa_v20_m200.pkl', 'r'))
> > > > >
> > > > > You can have more information about them at:
> > > > >
> > > >
> > https://bitbucket.org/biobakery/metaphlan2#markdown-header-customizing-the-database
> > > >
> > > > OK, that page clarifies the method.  Just a personal remark from the
> > > > point of view of an outsider of bioinformatics:  I'd regard the
> > creation
> > > > process of the mpa_v20_m200.pkl file a bit cumbersome.  I'd personally
> > > > prefer droping some text record somewhere and call a script processing
> > > > this record rather than writing an own script.
> > > >
> > > > > In addition, some files were changed the names:
> > > > >    - metaphlan2_strainer.py -> strainphlan.py
> > > > >    - strainer_src -> strainphlan_src
> > > > >    - strainer_tutorial -> strainphlan_tutorial
> > > > >
> > > > > Some source files were updated as well.
> > > > > Please let me know if you need other information.
> > > >
> > > > Just drop me a not once you might release a new version containing
> > these
> > > > changes.  I think I'll try to release the current version as is since
> > at
> > > > least the origin of the files is clarified now.  I'm not yet sure
> > whether
> > > > the size of the data is acceptable or might spoil some limit.
> > Regarding
> > > > this I'm wondering whether I create a source tarball including rather
> > > > markers.fasta and create the bt2 files in the build process.
> > > >
> > > > Kind regards
> > > >
> > > >        Andreas.
> > > >
> > > > --
> > > > http://fam-tille.de
> > > >
> >
> > --
> > http://fam-tille.de
> >

--
http://fam-tille.de

Reply to: