[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1016046: ITP: genomicsdb -- sparse array storage library for genomics



Package: wnpp
Severity: wishlist
Owner: Debian-med team <debian-med@lists.debian.org>
X-Debbugs-Cc: debian-devel@lists.debian.org, debian-med@lists.debian.org

* Package name    : genomicsdb
  Version         : 1.4.3
  Upstream Author : Intel Health and Lifesciences
* URL             : https://www.genomicsdb.org/
* License         : Expat
  Programming Lang: C++, Java
  Description     : sparse array storage library for genomics

GenomicsDB is built on top of a htslib fork and an internal array storage
system for importing, querying and transforming variant data. Variant data is
sparse by nature (sparse relative to the whole genome) and using sparse array
data stores is a perfect fit for storing such data.

The GenomicsDB stores variant data in a 2D array where:
 - Each column corresponds to a genomic position (chromosome + position);
 - Each row corresponds to a sample in a VCF (or CallSet in the GA4GH
   terminology);
 - Each cell contains data for a given sample/CallSet at a given position;
   data is stored in the form of cell attributes;
 - Cells are stored in column major order - this makes accessing cells with
   the same column index (i.e. data for a given genomic position over all
   samples) fast.
 - Variant interval/gVCF interval data is stored in a cell at the start of the
   interval. The END is stored as a cell attribute. For variant intervals
   (such as deletions and gVCF REF blocks), an additional cell is stored at
   the END value of the variant interval. When queried for a given genomic
   position, the query library performs an efficient sweep to determine all
   intervals that intersect with the queried position.

There is a C++ library and a Java library, we plan to ship both of them.

This library is needed as a dependency of gatk, which is a packaging target of
the Debian-med team.


Reply to: