[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A common test-data package for genome assemblers



Hi all,

> I've had a couple packages that indicate the availability of data
> outside of the source distribution that can be used to try out the
> software (and make sure that it actually runs). I didn't think it was a
> good idea to bundle the data in with the actual package since it doesn't
> change between releases and would take up too much space on the archive
> if it was bundled with every upstream tarball.

+1, and also it would be nice to know that there is actually something
to assemble and one is not hitting corner cases with simulated reads
(i.e. generating not enough coverage trying to keep the size down).
For my autopkgtests that always required some fiddling.

> For example, at <http://canu.readthedocs.io/en/stable/quick-start.html>,
> there are a few reduced datasets that can be used to run assemblies for
> PacBio and Nanopore sequencing data. Those files can also be used for
> tests of the sprai package, and possibly also for other long-read genome
> assemblers. There's also the option of packaging the Assemblathon data
> for this purpose, or using simulators to generate datasets for testing.
> 
> Does anyone have suggestions or thoughts on this?

I've previously used pbsim, wgsim, GenomeTools' simreads (packaged) or
Flux Simulator (not packaged) to generate very small read sets for
reasonably sized reference chromosomes, small enough to distribute with
the source tarballs of the packages to be tested or alternatively
generating them on-the-fly as part of the test run. For me these were
miniasm, rna-star and snpomatic.

I'd welcome a package containing known good test sets that would avoid
duplication. One could also generate separate sets for genomic reads,
RNA-seq reads, different sequencing platforms etc. What do you think?
For someone with little practical NGS experience to pick 'good' test
sets (like me) this will be a huge help for adding tests where upstream
doesn't provide test data.

Cheers
Sascha


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


Reply to: