[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#835493: ITP: indexed-gzip -- fast random access of gzip files in Python



Package: wnpp
Severity: wishlist
Owner: Michael Hanke <mih@debian.org>

* Package name    : indexed-gzip
  Version         : 0.1
  Upstream Author : Paul D McCarthy
* URL             : https://github.com/pauldmccarthy/indexed_gzip
* License         : BSDish https://opensource.org/licenses/zlib-license.php
  Programming Lang: Python
  Description     : fast random access of gzip files in Python

[from the README]

The indexed_gzip project is a Python extension which aims to provide a
drop-in replacement for the built-in Python gzip.GzipFile class, the
IndexedGzipFile.

The standard gzip.GzipFile class exposes a random access-like interface
(via its seek and read methods), but every time you seek to a new point
in the uncompressed data stream, the GzipFile instance has to start
decompressing from the beginning of the file, until it reaches the
requested location.

An IndexedGzipFile instance gets around this performance limitation by
building an index, which contains seek points, mappings between
corresponding locations in the compressed and uncompressed data streams.
Each seek point is accompanied by a chunk (32KB) of uncompressed data
which is used to initialise the decompression algorithm, allowing us to
start reading from any seek point. If the index is built with a seek
point spacing of 1MB, we only have to decompress (on average) 512KB of
data to read from any location in the file.

Performance comparison:
https://github.com/pauldmccarthy/indexed_gzip#performance

This needs to be packaged as a dependency of a replacement of the
`fslview` (https://packages.debian.org/sid/fslview) package, which will
not make it into the next release due to its dependencies on obsolete
code (Qt4, ...).

This will be maintained by the NeuroDebian team.


Reply to: