[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

PPMd (PPMII) memory, file order in archives, special tar/debs/udebs



Sorry about the cross post, but the topic here are relevent to the bug,
and discussed in both lists.

On Thu, 20 Feb 2003, Magnus Ekdahl wrote:

> > There is a good paper on the PPMII algorithm (based on PPM) which ppmd
> > uses at: http://DataCompression.info/Miscellaneous/PPMII_DCC02.pdf
> > Basically, as I understand it, the more memory that you allow a PPM
> > algorithm to use, the more of the file it can check for similar contexts
> > or come up with better prediction values by checking more of and more in
> > the file.
>
> True, the PPMII can use a lot of memory and if that memory ain't available
> the compression model have to be crushed somehow. With decreased
> compression ratio as a result. But if PPMII only needs 10Mb for its model,
> another 100Mb won't make the compression any better. (As far as I understand).
>
It's possible. It depends the maximum context size. If you have a very
large chunk in the same context (and it's allowed), then the statistics
table generator should imho be checking all predictors and predicted
blocks. I didn't check to see if it actually allows large contexts and
actually builds tables using all the predictors and predicted blocks in
larger contexts. I'm also not sure how the different models would affect
maximum context sizes, but I wouldn't be surprised if it was expected that
binaries have small contexts.

Another possible reason for allowing a PPM algorithm to use more memory is
to allow it to be sloppy and quick by using full ints or equivalent in
its in memory table generation. There are memory usage optimizations that
can be put in place as it is known ahead of time the maximum number of
possible occurrences of a block. If you take the amount of space left in
the file and divide it by the block size you get the maximum possible
number of occurrences (a rough description). Having the maximum number of
occurrences be small enough, there is extra space in the int or whatever
kind of counter that could be shared with another table entry
(counter/statistic). I doubt ppmd would be using such an optimization
though and I'm not sure if it's counters are slightly lossy (ie, floats)
although lossy counters can be optimized too.

> > Another thing to note when creating compressed *archives* is that the
> > order of files may be important. It has been documented (I don't remember
> > where) that grouping files of similar types together can yield better
> > compression. It might yield optimum compression to go through all the
> > different ways of ordering files. I suspect this is how the 7-zip (
> > http://www.7-zip.org ) algorithm outperforms other zip algorithms and
> > still maintains compatibility with all other zip archivers.
> >
> > A program or script to put files into different orders for archiving (in a
> > list or in a list passed to something like tar) is something that I have
> > been meaning to do. Going through all possible combinations would be nice,
> > but it may be faster to use something like file to group like files
> > together. File has quite a few bugs against it though (
> > http://bugs.debian.org/file ) and it may be far from an optimal choice.
> > Certainly using file would be a good place for me to start. It'd be nice if
> > someone wrote such a script before me though. ;-)
>
> Hmm... this might be more of a tar issue than a PPMd issue. Tar has
> additional problems too, since it adds extra things in the archive.
> Christian Fasshauer <mseacf@gmx.net> has done some work with tars
> problems, and his solution was supposed to decrease the size of the debian
> packages. Perhaps he has more suggestions of what can be done.
>
I like the extra information that tar includes, more so in source files,
and especially in "original" source files. When looking through source
code, seeing the date that a file was change/modified/created and it's
attributes can tell allot about the history of the file. For binary
packages, the value of the information is debatable. I'm on the fence as
far as whether removing things like the number of user/group entries
available and other things suggested in
http://lists.debian.org/debian-dpkg/2003/debian-dpkg-200301/msg00049.html
but I'm leaning towards their inclusion in standard debs, and their
removal in udebs.

Glenn, did you ever get around to making a busybox dar applet? If so what
were your results. Does it implement the suggestions of Christian
Fasshauer? If so maybe Russell Coker could get his comparisons done.

http://bugs.debian.org/cgi-bin/bugreport.cgi?archive=no&bug=174308 is also
an interesting discussion about gtar, star... I really don't know enough
about any tar to have an opinion as to what kind of tar is best.

To save even more space, a special compressor with a dictionary of Debian
blocks and/or statistics could be used. Thus if few users/groups etc were
used then such a dictionary would pick up on this and be used. As far as
compression goes, if there's a common dictionary to all the files, then
the change in space based on what kind of tar/tar features are used would
be minimal. The trouble is that gzip, bzip2, PPMd and alike won't work
from common dictionaries. Dictionary importing, choosing etc is something
that I am going to be writing into my compressor.

I'm suggesting a lossless change to the way tar is used, not a change to
tar, not a change in what is stored in tar archives. I'm suggesting that
the files be passed to tar in a different or several different orders so
that they may be stored in different orders. Although technically the
order in which files are stored is a piece of information (thus
technically this idea is lossy) it's extremely unimportant and when played
with can dramatically change the resulting archive. The input and output
to the tar files would be the same (lossless), the only difference would
be the order in which files were stored.

     Drew Daniels



Reply to: