[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages



On 2014-09-25 Henrique de Moraes Holschuh wrote:
> On Thu, 25 Sep 2014, Riku Voipio wrote:
> > On Wed, Sep 24, 2014 at 03:18:02PM -0300, Henrique de Moraes
> > Holschuh wrote:
> > > OTOH, using -z9 on datasets smaller than the -z8 dictionary size
> > > *is* a waste of memory (I don't know about cpu time, and xz(1)
> > > doesn't say anything on that matter).  The same goes for -z8 and
> > > datasets smaller than -z7 dictionary size, and so on.
> >  
> > > It is rather annoying that xz is not smart enough to detect this
> > > and downgrade the -z level when it is given a seekable file
> > > (which it can stat() and know the full size of beforehand).
> > 
> > This wouldn't seem too hard to implement in xz - have you asked
> > upstream about it?
> 
> No, I haven't.  Feel free to do it!

This is a known issue. It's not too hard to fix if it is OK that the
*same* xz binary creates different output with the same compression
options depending on whether the input size is known or unknown. Most
of the time it doesn't matter but sometimes it can be at least mildly
annoying.

If the input size is unknown but the output is seekable, then one could
even go back and rewrite the header after compression. The problem
of different output from the same xz version remains though.

Maybe there could be an option to enable this or an option to turn it
off, depending on which behavior is the default. I don't promise
anything now.

LZMA Utils created different output depending on if the input size was
known, but it was for a different reason.

XZ Utils <= 4.999.9beta created different valid output on little and
big endian systems. People did complain about that so it was changed.

When compressing or decompressing multiple files with the same encoder
or decoder instance, using the same dictionary size for all files
avoids reallocating memory in liblzma which can be good for performance
with tiny files. It's not the most typical use case though.

Using uselessly high compression level wastes some encoder memory and
makes the decoder allocate unneeded memory. However, as long as resource
limits allow the allocation to succeed, the actual decompressor memory
usage won't differ and it's not a huge difference in encoding either.
This is because kernels don't physically allocate large memory
allocations until they actually get used, and it's done in steps, not
the whole buffer at once. You can see this in "top" right after
launching xz: VIRT doesn't change while RES keeps growing.

Uselessly high compression level doesn't affect encoding speed with
preset levels 6-9 (compressing tiny files may be an exception, but then
it only matters if compressing very many files). There's no effect on
the decoder speed.

When compressing official Debian packages, I think one should first
decide what to put to the "RAM (minimal)" column in Debian system
requirements, then choose the xz compression level based on
decompressor memory usage and use that level for all packages. (Maybe
some big packages that won't run on a low-end system anyway could use a
higher compression level if it improves compression.) For example, if
64 MiB of RAM is the minimum, then xz -8 (32 MiB dictionary) is the
highest possibly acceptable level, but xz -7 (16 MiB) or even xz -6
(8 MiB) would be safer.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode


Reply to: