Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages

To: Henrique de Moraes Holschuh <hmh@debian.org>
Cc: Riku Voipio <riku.voipio@iki.fi>, debian-devel@lists.debian.org
Subject: Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
From: Lasse Collin <lasse.collin@tukaani.org>
Date: Sun, 28 Sep 2014 23:59:21 +0300
Message-id: <[🔎] 20140928235921.6e90b255@tukaani.org>
In-reply-to: <[🔎] 20140925102838.GA10569@khazad-dum.debian.net>
References: <[🔎] 20140901122030.GA25726@gaara.hadrons.org> <[🔎] CAEe2ifUQd55wfuaMXPNkdgjkX+=JQDEOj=YqdGYGEjqqwrKVBA@mail.gmail.com> <[🔎] alpine.DEB.2.11.1409021323590.22640@tglase.lan.tarent.de> <[🔎] 20140902114530.GA11993@angband.pl> <[🔎] 20140902130655.GA4806@bryant.redmars.org> <[🔎] 20140902133929.GB10691@khazad-dum.debian.net> <[🔎] 5423070B.60104@debian.org> <[🔎] 20140924181802.GA16214@khazad-dum.debian.net> <[🔎] 20140925095914.GA4660@afflict.kos.to> <[🔎] 20140925102838.GA10569@khazad-dum.debian.net>

On 2014-09-25 Henrique de Moraes Holschuh wrote:
> On Thu, 25 Sep 2014, Riku Voipio wrote:
> > On Wed, Sep 24, 2014 at 03:18:02PM -0300, Henrique de Moraes
> > Holschuh wrote:
> > > OTOH, using -z9 on datasets smaller than the -z8 dictionary size
> > > *is* a waste of memory (I don't know about cpu time, and xz(1)
> > > doesn't say anything on that matter).  The same goes for -z8 and
> > > datasets smaller than -z7 dictionary size, and so on.
> >  
> > > It is rather annoying that xz is not smart enough to detect this
> > > and downgrade the -z level when it is given a seekable file
> > > (which it can stat() and know the full size of beforehand).
> > 
> > This wouldn't seem too hard to implement in xz - have you asked
> > upstream about it?
> 
> No, I haven't.  Feel free to do it!

This is a known issue. It's not too hard to fix if it is OK that the
*same* xz binary creates different output with the same compression
options depending on whether the input size is known or unknown. Most
of the time it doesn't matter but sometimes it can be at least mildly
annoying.

If the input size is unknown but the output is seekable, then one could
even go back and rewrite the header after compression. The problem
of different output from the same xz version remains though.

Maybe there could be an option to enable this or an option to turn it
off, depending on which behavior is the default. I don't promise
anything now.

LZMA Utils created different output depending on if the input size was
known, but it was for a different reason.

XZ Utils <= 4.999.9beta created different valid output on little and
big endian systems. People did complain about that so it was changed.

When compressing or decompressing multiple files with the same encoder
or decoder instance, using the same dictionary size for all files
avoids reallocating memory in liblzma which can be good for performance
with tiny files. It's not the most typical use case though.

Using uselessly high compression level wastes some encoder memory and
makes the decoder allocate unneeded memory. However, as long as resource
limits allow the allocation to succeed, the actual decompressor memory
usage won't differ and it's not a huge difference in encoding either.
This is because kernels don't physically allocate large memory
allocations until they actually get used, and it's done in steps, not
the whole buffer at once. You can see this in "top" right after
launching xz: VIRT doesn't change while RES keeps growing.

Uselessly high compression level doesn't affect encoding speed with
preset levels 6-9 (compressing tiny files may be an exception, but then
it only matters if compressing very many files). There's no effect on
the decoder speed.

When compressing official Debian packages, I think one should first
decide what to put to the "RAM (minimal)" column in Debian system
requirements, then choose the xz compression level based on
decompressor memory usage and use that level for all packages. (Maybe
some big packages that won't run on a low-end system anyway could use a
higher compression level if it improves compression.) For example, if
64 MiB of RAM is the minimum, then xz -8 (32 MiB dictionary) is the
highest possibly acceptable level, but xz -7 (16 MiB) or even xz -6
(8 MiB) would be safer.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Reply to:

References:
- Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Guillem Jover <guillem@debian.org>
- Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Changwoo Ryu <cwryu@debian.org>
- Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Thorsten Glaser <tg@debian.org>
- Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Jonathan Dowland <jmtd@debian.org>
- Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Henrique de Moraes Holschuh <hmh@debian.org>
- Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Thomas Goirand <zigo@debian.org>
- Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Henrique de Moraes Holschuh <hmh@debian.org>
- Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Riku Voipio <riku.voipio@iki.fi>
- Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
  - From: Henrique de Moraes Holschuh <hmh@debian.org>

Prev by Date: Re: Re: Bug#762839: bash without importing shell functions from the environment
Next by Date: Re: Allow encfs into jessie?
Previous by thread: Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
Next by thread: Re: Possible abuse of dpkg-deb -z9 for xz compressed binary packages
Index(es):
- Date
- Thread