[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Description for lefse tools (Was: Origin of data files in MetaPhLan2)



Hi Andreas,

I meant the latest version of the repository fit with the tutorial on the repository. If you used the older version (old names), I am afraid users will have some problems when following the tutorial.
Regarding the separation code and data issue, I will discuss with Nicola next Monday and let you know.

Thanks,
Tin

On Fri, Aug 5, 2016 at 10:21 PM Andreas Tille <andreas@an3as.eu> wrote:
Hi Tin,

I need to admit that I can not parse the information you gave in your mail.

It is also not really connected to my next mail (which is archived here
   https://lists.debian.org/debian-med/2016/08/msg00040.html ) about the
separation of code and data.

Kind regards

      Andreas.

On Thu, Aug 04, 2016 at 11:48:08AM +0000, Duy Tin Truong wrote:
> Hi Andreas,
>
> If you can use the latest version with the name changes as I mentioned, it
> would fit better with the updated tutorial on the metaphlan2 repository.
>
> Thanks,
> Tin
>
> On Thu, Aug 4, 2016 at 1:28 PM Nicola Segata <nicola.segata@unitn.it> wrote:
>
> > Hi Andreas,
> >  yes, it is likely that the code will be frequently updated, but the big
> > database file will change only rarely (for sure no more frequently than
> > once a year).
> > thanks
> > Nicola
> >
> > On Thu, Aug 4, 2016 at 12:46 PM Andreas Tille <andreas@an3as.eu> wrote:
> >
> >> Hi again,
> >>
> >> On Thu, Aug 04, 2016 at 08:10:29AM +0000, Nicola Segata wrote:
> >> > Makes sense to me!
> >>
> >>    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=833388#15
> >>
> >> If you read the discussion it seems that my suggestion to ship the fasta
> >> file inside the Debian package and let the postinst do the
> >> transformation step found some agreement - provided that there are no
> >> frequent changes in the package and several uploads per month will
> >> happen.
> >>
> >> I'm now wondering what your estimated change rate for the metaphlan2
> >> data files might be.  Do these change frequently?  Is there any chance
> >> that the code changes frequently but the data files stay unchanged?
> >>
> >> Kind regards
> >>
> >>       Andreas.
> >>
> >> > On Thu, Aug 4, 2016 at 8:18 AM Andreas Tille <andreas@an3as.eu> wrote:
> >> >
> >> > > Hi Nicola,
> >> > >
> >> > > On Wed, Aug 03, 2016 at 08:51:33PM +0000, Nicola Segata wrote:
> >> > > > Great, thanks Andreas. We provide the "*.bt2" files so that the
> >> user can
> >> > > > run BowTie2 internally to MetaPhlAn directly without first building
> >> the
> >> > > > indexes (it will take quite a bit of time).
> >> > >
> >> > > Fully agreed here.
> >> > >
> >> > > > Also, the indexes are smaller
> >> > > > in size than the sequence file...
> >> > >
> >> > > Hmmm, all *.bt2 files sum up to 1,124,449kB while the fasta file has
> >> > > only 753081kB.  Considering the better compression performance of pure
> >> > > text files a compressed archive containing the fasta is drastically
> >> > > smaller than one with the *.bt2 files.  Yesterday I tried to start a
> >> > > discussion how to deal with the size of the data inside Debian[1] (no
> >> > > answer so far) and my experiment to create a source tarball just
> >> > > containing the fasta resulted in a 270MB *xz* compressed file (well xz
> >> > > is better than gz but lets say the compressed tarball with the fasta
> >> is
> >> > > about 30% of size of your current download of 1.017MB.
> >> > >
> >> > > The situation for Debian is different than from your users:  A user
> >> who
> >> > > downloads from your website intends to run metaphlan2.  Amongst the
> >> > > millions of Debian users only very few are interested in metaphlan2
> >> and
> >> > > we need to outweight how much resources we could spent.  Its not that
> >> > > only Debian provides resources.  There is a large mirroring network
> >> that
> >> > > spents lots of bandwidth and disk space for a very small usage.  So in
> >> > > this case it makes sense to put the effort on the users side to
> >> > > regenerate the indexes (or even download the data separately via a
> >> > > script we could provide inside the package).  So I could imagine to
> >> > > package only the metaphlan2 code and provide a script that downloads
> >> the
> >> > > data and puts them into the expected place.
> >> > >
> >> > > Kind regards
> >> > >
> >> > >          Andreas.
> >> > >
> >> > > [1]
> >> > >
> >> https://lists.alioth.debian.org/pipermail/debian-med-packaging/2016-August/044984.html
> >> > >
> >> > > > cheers
> >> > > > Nicola
> >> > > >
> >> > > > On Wed, Aug 3, 2016 at 6:08 PM Andreas Tille <andreas@an3as.eu>
> >> wrote:
> >> > > >
> >> > > > > Hi Tin,
> >> > > > >
> >> > > > > On Wed, Aug 03, 2016 at 02:01:01PM +0000, Duy Tin Truong wrote:
> >> > > > > > > - Tin can also provide more info about the binary data in
> >> db_v20.
> >> > > The
> >> > > > > files
> >> > > > > > > ending with "bt2" are created using a script in the Bowtie2
> >> package
> >> > > > > > > (bowtie2-build) using a sequence file Tin can provide (it can
> >> also
> >> > > be
> >> > > > > > > recovered from the bt2 files with bowtie2-inspect if I
> >> remember
> >> > > well).
> >> > > > > > As Nicola said, those files in db_v20 are created with
> >> bowtie2-build
> >> > > > > > using a sequence file and you can recover the sequence file by:
> >> > > > > >
> >> > > > > > bowtie2-inspect metaphlan2/db_v20/mpa_v20_m200 >
> >> > > metaphlan2/markers.fasta
> >> > > > > >
> >> > > > > > If you want to rebuild them, the command is:
> >> > > > > >
> >> > > > > > bowtie2-build metaphlan2/markers.fasta
> >> metaphlan2/db_v21/mpa_v21_m200
> >> > > > >
> >> > > > > I can confirm that I can reproduce the files byte identical from
> >> > > > > markers.fasta.  Is there any reason to ship the binary form
> >> instead of
> >> > > > > the fasta text file?  Moreover, what is the source of the
> >> > > markers.fasta?
> >> > > > > Is there any related publication or so?
> >> > > > >
> >> > > > > > > For the mpa_v20_m200.pkl Tin can also provide the uncompressed
> >> > > python
> >> > > > > > > object (or he can provide a couple of lines of code to
> >> uncompress
> >> > > it?)
> >> > > > > > It is python dictionary and can be read as:
> >> > > > > >
> >> > > > > > import cPickle as pickleimport bz2
> >> > > > > > db = pickle.load(bz2.BZ2File('db_v20/mpa_v20_m200.pkl', 'r'))
> >> > > > > >
> >> > > > > > You can have more information about them at:
> >> > > > > >
> >> > > > >
> >> > >
> >> https://bitbucket.org/biobakery/metaphlan2#markdown-header-customizing-the-database
> >> > > > >
> >> > > > > OK, that page clarifies the method.  Just a personal remark from
> >> the
> >> > > > > point of view of an outsider of bioinformatics:  I'd regard the
> >> > > creation
> >> > > > > process of the mpa_v20_m200.pkl file a bit cumbersome.  I'd
> >> personally
> >> > > > > prefer droping some text record somewhere and call a script
> >> processing
> >> > > > > this record rather than writing an own script.
> >> > > > >
> >> > > > > > In addition, some files were changed the names:
> >> > > > > >    - metaphlan2_strainer.py -> strainphlan.py
> >> > > > > >    - strainer_src -> strainphlan_src
> >> > > > > >    - strainer_tutorial -> strainphlan_tutorial
> >> > > > > >
> >> > > > > > Some source files were updated as well.
> >> > > > > > Please let me know if you need other information.
> >> > > > >
> >> > > > > Just drop me a not once you might release a new version containing
> >> > > these
> >> > > > > changes.  I think I'll try to release the current version as is
> >> since
> >> > > at
> >> > > > > least the origin of the files is clarified now.  I'm not yet sure
> >> > > whether
> >> > > > > the size of the data is acceptable or might spoil some limit.
> >> > > Regarding
> >> > > > > this I'm wondering whether I create a source tarball including
> >> rather
> >> > > > > markers.fasta and create the bt2 files in the build process.
> >> > > > >
> >> > > > > Kind regards
> >> > > > >
> >> > > > >        Andreas.
> >> > > > >
> >> > > > > --
> >> > > > > http://fam-tille.de
> >> > > > >
> >> > >
> >> > > --
> >> > > http://fam-tille.de
> >> > >
> >>
> >> --
> >> http://fam-tille.de
> >>
> >

--
http://fam-tille.de

Reply to: