Bug#291235: RFP: libcgrep -- Library for Fast and Flexible Word Searching on Compressed Text
Package: wnpp
Severity: wishlist
* Package name : libcgrep
Version : x.y.z
Upstream Author : Paolo Ferragina <ferragina@di.unipi.it>
* URL : http://roquefort.di.unipi.it/~ferrax/CompressedSearch/
* License : GPL
Description : Search through compressed file (fast)
Summary:
- In summary,CGrep is about four times faster than ZGrep, and
additionally it supports fast advanced IR-searches that are
completely absent in the Grep-family tools.
Preconditions:
- doxygen
- agrep
Following programs are included:
- huffw; to (de)compress a textual file according to the Huffword paradigm
- cgrep; search for complicated patterns over the Huffword-compressed files
Following libraries are included:
Huffword Library: This library offers some algorithms and data
structures for the efficient (de)compression of textual files. The
compression paradigm is a variant of the Huffman algorithm, here
adapted to be byte-aligned, tagged and word oriented. More
precisely, the tokens are alphanumeric words, and the codewords
are byte sequences with some tagged bits that allow to guarantee
synchronization. The tagging is obtained as follows: the Huffman
tree has fan-out 128, and the arc labels of 7 bits are stored in
the least significant bits of a byte; the codeword is finally
constructed by setting the first bit of the first byte to 1, and
the first bit of each other byte to 0. This compression paradigm
can be improved by removing each single space that occurs between
two alphanumeric tokens, hereafter called Spaceless Model.
CGrep Library: This library offers all the pattern-matching
functionalities that allow to search over Huffword compressed
files. Actually the search consists of three steps: first the
complex pattern query (possibly involving errors, reg exp, ....)
is resolved against the dictionary of tokens, built at compression
time by the Huffword algorithm. The tokens that satisfy the query
are coded and then their codewords are searched directly onto the
compressed file (note that this is an exact multiple-patterns
search). This approach induces a twofold speedup: the search cost
is actually proportional to the length of the compressed file, no
decompression at all is required; and the search cost for
complicated patterns (like reg exp, errors,....) depends on the
dictionary size, which is small compared to the entire file. The
features that we have implemented in addition to the original
paper are: proximity query, arbitrary number of query-patters each
with its search options, snippet extraction.
-- System Information:
Debian Release: 3.1
APT prefers unstable
APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: i386 (i686)
Kernel: Linux 2.6.9-1-686
Locale: LANG=C, LC_CTYPE=C (charmap=ISO-8859-1) (ignored: LC_ALL set to en_US)
Reply to: