[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A common test-data package for genome assemblers



Hi, Sascha, and apologies for the delay

على الخميس 14 تـمـوز 2016 ‫08:57، كتب Sascha Steinbiss:
> 
>> I've had a couple packages that indicate the availability of data
>> outside of the source distribution that can be used to try out the
>> software (and make sure that it actually runs). I didn't think it was a
>> good idea to bundle the data in with the actual package since it doesn't
>> change between releases and would take up too much space on the archive
>> if it was bundled with every upstream tarball.
> 
> +1, and also it would be nice to know that there is actually something
> to assemble and one is not hitting corner cases with simulated reads
> (i.e. generating not enough coverage trying to keep the size down).
> For my autopkgtests that always required some fiddling.
> 
>> For example, at <http://canu.readthedocs.io/en/stable/quick-start.html>,
>> there are a few reduced datasets that can be used to run assemblies for
>> PacBio and Nanopore sequencing data. Those files can also be used for
>> tests of the sprai package, and possibly also for other long-read genome
>> assemblers. There's also the option of packaging the Assemblathon data
>> for this purpose, or using simulators to generate datasets for testing.
>>
>> Does anyone have suggestions or thoughts on this?
> 
> I've previously used pbsim, wgsim, GenomeTools' simreads (packaged) or
> Flux Simulator (not packaged) to generate very small read sets for
> reasonably sized reference chromosomes, small enough to distribute with
> the source tarballs of the packages to be tested or alternatively
> generating them on-the-fly as part of the test run. For me these were
> miniasm, rna-star and snpomatic.

That seems to be the best option given the current situation. I've just
been manually running tests before uploading to avoid having to
duplicate the data with every new upstream release, but that obviously
has many downsides.

> 
> I'd welcome a package containing known good test sets that would avoid
> duplication. One could also generate separate sets for genomic reads,
> RNA-seq reads, different sequencing platforms etc. What do you think?

Yes, I think this would be good. Pacific Biosciences has created a
package along these lines [1] to reduce duplication within their own
suite of software, but it's of course only for their own sequencing
platform.

The ones that Canu's developers suggest are genomic reads for PacBio and
Oxford Nanopore bacterial genome assembly, but they take significant
resources to run (so probably not a good idea for debci). I don't think
it would be too hard to make such a package for phage assembly, for
example, but I think this would just test the program to see if it
works. Getting a minimal dataset that would test assemblers in corner
cases would take some more work. I think we should reach out to the
upstream developers of these tools to try to pool efforts for such a
resource.

> For someone with little practical NGS experience to pick 'good' test
> sets (like me) this will be a huge help for adding tests where upstream
> doesn't provide test data.
> 

The problem of duplication and shared corner cases to test (for programs
that serve similar purposes) is also a very worthy reason.

Thanks and regards
Afif

1. https://github.com/PacificBiosciences/PacBioTestData

-- 
Afif Elghraoui | عفيف الغراوي
http://afif.ghraoui.name


Reply to: