Bug#970192: ITP: clark -- accurate and versatile classification of biological sequences
Package: wnpp
Severity: wishlist
Subject: ITP: clark -- accurate and versatile classification of biological sequences
Package: wnpp
Owner: Steffen Moeller <moeller@debian.org>
Severity: wishlist
* Package name : clark
Version : 1.2.6.1
Upstream Author : Rachid Ounit <rouni001@cs.ucr.edu>
* URL : http://clark.cs.ucr.edu/
* License : GPL-3.0+
Programming Lang: (C, C++, C#, Perl, Python, etc.)
Description : accurate and versatile classification of biological sequences
The problem of DNA sequence classification is central to several
application domains in molecular biology, genomics, metagenomics and
genetics. Although several software tools have been developed for this
problem, it is still computationally challenging due to the size of
datasets generated by modern sequencing instruments and the growing
size of reference sequence databases.
.
CLARK is based on a supervised sequence classification using
discriminative k-mers. Considering two distinct specific classification
problems (see the article for details), namely (1) the taxonomic
classification of metagenomic reads to known bacterial genomes,
and (2) the assignment of BAC clones and transcript to chromosome
arms/centromeres (in the absence of a finished assembly for the reference
genome), CLARK aspires to outperforms in classification speed and
precision the best state-of-the-art methods.
.
Three classifiers from the CLARK framework are provided:
.
* CLARK (default): created for powerful workstation, it can require
a significant amount of RAM to run with large database (e.g., all
bacterial genomes from NCBI/RefSeq). This classifier is the standard
in the CLARK tool series. It builds discriminative k-mers from all
k-mers in the targets, queries k-mers with exact matching, and, in
its fastest mode, classifies 1 million short reads in few seconds...;
* CLARK-l : created for workstations with limited memory (i.e., "l"
for light), this software tool provides precise classification on
small metagenomes. Indeed, for metagenomics analysis, CLARK-l works
with a sparse or ''light'' database (up to 4 GB of RAM) while still
performing ultra accurate and fast results. This classifier builds
discriminative k-mers from non-overlapping and distant k-mers in the
targets and queries k-mers with exact matching;
* CLARK-S: created for powerful workstations and exploiting spaced
k-mers (i.e., "S" for spaced), this classifier requires a higher RAM
usage than CLARK or CLARK-l, but it does offer a higher sensitivity
than CLARK at the species level (see the peer-reviewed publication in
Bioinformatics). CLARK-S completes the series of classifiers from the
CLARK framework.
.
Other applications of CLARK are, for example, the detection of
contaminants, the identification of chimerism and vector contamination
in sequenced BACs (cf. "Overview" tab).
Remark: This package is maintained by Steffen Moeller at
https://salsa.debian.org/med-team/clark
Reply to: