[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#944785: ITP: pufferfish -- An efficient index for the colored, compacted, de Bruijn graph



Package: wnpp
Severity: wishlist

Subject: ITP: pufferfish -- An efficient index for the colored, compacted, de Bruijn graph 
Package: wnpp
Owner: Michael R. Crusoe <michael.crusoe@gmail.com>
Severity: wishlist

* Package name    : pufferfish
  Version         : 1.0.0
  Upstream Author : , 2016 Rob Patro, Avi Srivastava, Hirak Sarkar
* URL             : https://github.com/COMBINE-lab/pufferfish
* License         : GPL-3+
  Programming Lang: C
  Description     : An efficient index for the colored, compacted, de Bruijn graph 
 Pufferfish is a new time and memory-efficient data structure for indexing a
 compacted, colored de Bruijn graph (ccdBG). 
 .
 Though the de Bruijn Graph (dBG) has enjoyed tremendous popularity as an
 assembly and sequence comparison data structure, it has only relatively
 recently begun to see use as an index of the reference sequences (e.g. deBGA,
 kallisto). Particularly, these tools index the compacted dBG (cdBG), in which
 all non-branching paths are collapsed into individual nodes and labeled with
 the string they spell out. This data structure is particularly well-suited for
 representing repetitive reference sequences, since a single contig in the cdBG
 represents all occurrences of the repeated sequence. The original positions in
 the reference can be recovered with the help of an auxiliary "contig table"
 that maps each contig to the reference sequence, position, and orientation
 where it appears as a substring. The deBGA paper has a nice description how
 this kind of index looks (they call it a unipath index, because the contigs we
 index are unitigs in the cdBG), and how all the pieces fit together to be able
 to resolve the queries we care about.  Moreover, the cdBG can be built on
 multiple reference sequences (transcripts, chromosomes, genomes), where each
 reference is given a distinct color (or colour, if you're of the British
 persuasion). The resulting structure, which also encodes the relationships
 between the cdBGs of the underlying reference sequences, is called the
 compacted, colored de Bruijn graph (ccdBG).  This is not, of course, the only
 variant of the dBG that has proven useful from an indexing perspective. The
 (pruned) dBG has also proven useful as a graph upon which to build a path
 index of arbitrary variation / sequence graphs, which has enabled very
 interesting and clever indexing schemes like that adopted in GCSA2. Also,
 thinking about sequence search in terms of the dBG has led to interesting
 representations for variation-aware sequence search backed by indexes like the
 vBWT (implemented in the excellent gramtools package).

Remark: This package is maintained by Debian Med Packaging Team at
   https://salsa.debian.org/med-team/pufferfish

This package will be team maintained by Debian-Med


Reply to: