[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#798873: ITP: dazzdb -- database library for the dazzler assembler



Package: wnpp
Severity: wishlist
Owner: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>

* Package name    : dazzdb
  Version         : 1.0
  Upstream Author : Eugene W. Myers, Jr. <gene.myers@gmail.com>
* URL             : https://github.com/thegenemyers/DAZZ_DB
* License         : BSD
  Programming Lang: C
  Description     : database library for the dazzler assembler

 To facilitate the multiple phases of the dazzler assembler, all the read
 data are organized into what is effectively a "database" of the reads and
 their meta-information. The design goals for this database are as follows:
 (1) The database stores the source Pacbio read information in such a way that
     it can recreate the original input data, thus permitting a user to remove
     the (effectively redundant) source files. This avoids duplicating the
     same data, once in the source file and once in the database.
 (2) The database can be built up incrementally, that is new sequence data can
     be added to the database over time.
 (3) The database flexibly allows one to store any meta-data desired for reads.
     This is accomplished with the concept of *tracks* that implementors can
     add as they need them.
 (4) The data is held in a compressed form equivalent to the .dexta and .dexqv
     files of the data extraction module. Both the .fasta and .quiva
     information for each read is held in the database and can be recreated
     from it. The .quiva information can be added separately and later on if
     desired.
 (5) To facilitate job parallel, cluster operation of the phases of dazzler,
     the data base has a concept of a *current partitioning* in which all the
     reads that are over a given length and optionally unique to a well, are
     divided up into *blocks* containing roughly a given number of bases,
     except possibly the last block which may have a short count. Often
     programs con be run on blocks or pairs of blocks and each such job is
     reasonably well balanced as the blocks are all the same size. One must
     be careful about changing the partition during an assembly as doing so can
     void the structural validity of any interim block-based results.


Reply to: