[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

PPMd (PPMII), bzip2 -9 misconception, order of files in archives



On 19 Feb 2003 17:16:48 +1100, Glenn McGrath <bug1@optushome.com.au> wrote:
>I have looked at PPMd, ive read thats good but it took me a while to
>workout how to use it to get maximum ocmpression, here is a comparison.
>
>8921088 Packages
>2375680 Packages.gz
>1814528 Packages.bz2
>1335296 Packages.pmd
>
>That was done with "PPMd e -o16 -m256 -r2 Packages", it surprised me
>that the -m option (amount of memory to use) effects the compression
>ratio.
>
>Thats a pretty improvement over bz2.

I would guess that memory usage improving compression is a concept that is
widely unknown so I filed a bug against ppmd:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=181665
I'm ccing the bug as in this message I talk about the -m option in this
next paragraph.

There is a good paper on the PPMII algorithm (based on PPM) which ppmd
uses at: http://DataCompression.info/Miscellaneous/PPMII_DCC02.pdf
Basically, as I understand it, the more memory that you allow a PPM
algorithm to use, the more of the file it can check for similar contexts
or come up with better prediction values by checking more of and more in
the file.



Some interesting stuff about bzip2 form:
ftp://sources.redhat.com/pub/bzip2/docs/manual_1.html#SEC1
-----
bzip2 compresses files using the Burrows-Wheeler block-sorting text
compression algorithm, and Huffman coding. Compression is generally
considerably better than that achieved by more conventional
LZ77/LZ78-based compressors, and approaches the performance of the PPM
family of statistical compressors.
-----


Also, when using bzip2 note that -9 does not indicate to use the best
compression (best as in size), but instead specifies the largest blocksize
which *usually* gives the smallest compressed file. Sometimes different
block sizes can actually be better than -9. I'd have to test it to be
sure, but the case I was testing was compressing several megabytes of IRC
logs which are ASCII text with many repeating words and a defined
structure.

The reference fwiw:
ftp://sources.redhat.com/pub/bzip2/docs/manual_2.html#SEC6
-----
-1 (or --fast) to -9 (or --best)
Set the block size to 100 k, 200 k .. 900 k when compressing.
-----

Perhaps I should make a wishlist bug against bzip2 about this option's
documentation. --best could also be said to be short for best guess at
blocksize which uses the 900 k block size. Also people using bzip2 might
need to be explicitly informed. Perhaps all bzip2'd files in Debian
should be checked to see what blocksize is best and file bugs against the
few that don't have optimal blocksizes chosen. I might warn of a mass bug
filing in debian-devel if I decide I want to test all relevant bz2 files.

It should also be noted that bzip2 may not be an optimal implementation of
the Burrows-Wheeler block-sorting algorithm. In
ftp://sources.redhat.com/pub/bzip2/docs/manual_4.html#SEC44 Julian Seward
says: "It would be fair to say that the bzip2 format was frozen before I
properly and fully understood the performance consequences of doing so."



The Archive Comparison Test (ACT) (usually available at
http://compression.ca ) has a comparison of archivers. You can find the
algorithm that each archiver uses in the Index of archivers. I found the
best compressors for size use Prediction by Partial Match (PPM which is
based on Markov Chains) and Context Tree Weighting (CTW) algorithms.



Another thing to note when creating compressed *archives* is that the
order of files may be important. It has been documented (I don't remember
where) that grouping files of similar types together can yield better
compression. It might yield optimum compression to go through all the
different ways of ordering files. I suspect this is how the 7-zip (
http://www.7-zip.org ) algorithm outperforms other zip algorithms and
still maintains compatibility with all other zip archivers.

A program or script to put files into different orders for archiving (in a
list or in a list passed to something like tar) is something that I have
been meaning to do. Going through all possible combinations would be nice,
but it may be faster to use something like file to group like files
together. File has quite a few bugs against it though (
http://bugs.debian.org/file ) and it may be far from an optimal choice.
Certainly using file would be a good place for me to start. It'd be nice if
someone wrote such a script before me though. ;-)

     Drew Daniels



Reply to: