[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Description for lefse tools (Was: Origin of data files in MetaPhLan2)



Hi Nicola,

On Wed, Aug 03, 2016 at 08:51:33PM +0000, Nicola Segata wrote:
> Great, thanks Andreas. We provide the "*.bt2" files so that the user can
> run BowTie2 internally to MetaPhlAn directly without first building the
> indexes (it will take quite a bit of time).

Fully agreed here.

> Also, the indexes are smaller
> in size than the sequence file...

Hmmm, all *.bt2 files sum up to 1,124,449kB while the fasta file has
only 753081kB.  Considering the better compression performance of pure
text files a compressed archive containing the fasta is drastically
smaller than one with the *.bt2 files.  Yesterday I tried to start a
discussion how to deal with the size of the data inside Debian[1] (no
answer so far) and my experiment to create a source tarball just
containing the fasta resulted in a 270MB *xz* compressed file (well xz
is better than gz but lets say the compressed tarball with the fasta is
about 30% of size of your current download of 1.017MB.

The situation for Debian is different than from your users:  A user who
downloads from your website intends to run metaphlan2.  Amongst the
millions of Debian users only very few are interested in metaphlan2 and
we need to outweight how much resources we could spent.  Its not that
only Debian provides resources.  There is a large mirroring network that
spents lots of bandwidth and disk space for a very small usage.  So in
this case it makes sense to put the effort on the users side to
regenerate the indexes (or even download the data separately via a
script we could provide inside the package).  So I could imagine to
package only the metaphlan2 code and provide a script that downloads the
data and puts them into the expected place.

Kind regards

         Andreas.

[1] https://lists.alioth.debian.org/pipermail/debian-med-packaging/2016-August/044984.html

> cheers
> Nicola
> 
> On Wed, Aug 3, 2016 at 6:08 PM Andreas Tille <andreas@an3as.eu> wrote:
> 
> > Hi Tin,
> >
> > On Wed, Aug 03, 2016 at 02:01:01PM +0000, Duy Tin Truong wrote:
> > > > - Tin can also provide more info about the binary data in db_v20. The
> > files
> > > > ending with "bt2" are created using a script in the Bowtie2 package
> > > > (bowtie2-build) using a sequence file Tin can provide (it can also be
> > > > recovered from the bt2 files with bowtie2-inspect if I remember well).
> > > As Nicola said, those files in db_v20 are created with bowtie2-build
> > > using a sequence file and you can recover the sequence file by:
> > >
> > > bowtie2-inspect metaphlan2/db_v20/mpa_v20_m200 > metaphlan2/markers.fasta
> > >
> > > If you want to rebuild them, the command is:
> > >
> > > bowtie2-build metaphlan2/markers.fasta metaphlan2/db_v21/mpa_v21_m200
> >
> > I can confirm that I can reproduce the files byte identical from
> > markers.fasta.  Is there any reason to ship the binary form instead of
> > the fasta text file?  Moreover, what is the source of the markers.fasta?
> > Is there any related publication or so?
> >
> > > > For the mpa_v20_m200.pkl Tin can also provide the uncompressed python
> > > > object (or he can provide a couple of lines of code to uncompress it?)
> > > It is python dictionary and can be read as:
> > >
> > > import cPickle as pickleimport bz2
> > > db = pickle.load(bz2.BZ2File('db_v20/mpa_v20_m200.pkl', 'r'))
> > >
> > > You can have more information about them at:
> > >
> > https://bitbucket.org/biobakery/metaphlan2#markdown-header-customizing-the-database
> >
> > OK, that page clarifies the method.  Just a personal remark from the
> > point of view of an outsider of bioinformatics:  I'd regard the creation
> > process of the mpa_v20_m200.pkl file a bit cumbersome.  I'd personally
> > prefer droping some text record somewhere and call a script processing
> > this record rather than writing an own script.
> >
> > > In addition, some files were changed the names:
> > >    - metaphlan2_strainer.py -> strainphlan.py
> > >    - strainer_src -> strainphlan_src
> > >    - strainer_tutorial -> strainphlan_tutorial
> > >
> > > Some source files were updated as well.
> > > Please let me know if you need other information.
> >
> > Just drop me a not once you might release a new version containing these
> > changes.  I think I'll try to release the current version as is since at
> > least the origin of the files is clarified now.  I'm not yet sure whether
> > the size of the data is acceptable or might spoil some limit.  Regarding
> > this I'm wondering whether I create a source tarball including rather
> > markers.fasta and create the bt2 files in the build process.
> >
> > Kind regards
> >
> >        Andreas.
> >
> > --
> > http://fam-tille.de
> >

-- 
http://fam-tille.de


Reply to: