[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#291235: RFP: libcgrep -- Library for Fast and Flexible Word Searching on Compressed Text



Package: wnpp
Severity: wishlist


* Package name    : libcgrep
  Version         : x.y.z
  Upstream Author : Paolo Ferragina <ferragina@di.unipi.it>
* URL             : http://roquefort.di.unipi.it/~ferrax/CompressedSearch/
* License         : GPL
  Description     : Search through compressed file (fast)

Summary:

- In summary,CGrep is about four times faster than ZGrep, and
  additionally it supports fast advanced IR-searches that are
  completely absent in the Grep-family tools.

Preconditions:

- doxygen
- agrep

Following programs are included:

- huffw; to (de)compress a textual file according to the Huffword paradigm
- cgrep; search for complicated patterns over the Huffword-compressed files

Following libraries are included:

Huffword Library: This library offers some algorithms and data
    structures for the efficient (de)compression of textual files. The
    compression paradigm is a variant of the Huffman algorithm, here
    adapted to be byte-aligned, tagged and word oriented. More
    precisely, the tokens are alphanumeric words, and the codewords
    are byte sequences with some tagged bits that allow to guarantee
    synchronization. The tagging is obtained as follows: the Huffman
    tree has fan-out 128, and the arc labels of 7 bits are stored in
    the least significant bits of a byte; the codeword is finally
    constructed by setting the first bit of the first byte to 1, and
    the first bit of each other byte to 0. This compression paradigm
    can be improved by removing each single space that occurs between
    two alphanumeric tokens, hereafter called Spaceless Model.

 CGrep Library: This library offers all the pattern-matching
    functionalities that allow to search over Huffword compressed
    files. Actually the search consists of three steps: first the
    complex pattern query (possibly involving errors, reg exp, ....)
    is resolved against the dictionary of tokens, built at compression
    time by the Huffword algorithm. The tokens that satisfy the query
    are coded and then their codewords are searched directly onto the
    compressed file (note that this is an exact multiple-patterns
    search). This approach induces a twofold speedup: the search cost
    is actually proportional to the length of the compressed file, no
    decompression at all is required; and the search cost for
    complicated patterns (like reg exp, errors,....) depends on the
    dictionary size, which is small compared to the entire file. The
    features that we have implemented in addition to the original
    paper are: proximity query, arbitrary number of query-patters each
    with its search options, snippet extraction.

-- System Information:
Debian Release: 3.1
  APT prefers unstable
  APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: i386 (i686)
Kernel: Linux 2.6.9-1-686
Locale: LANG=C, LC_CTYPE=C (charmap=ISO-8859-1) (ignored: LC_ALL set to en_US)



Reply to: