[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide



Hi!

On Sun, 2015-06-14 at 16:48:21 +0200, Vincent Lefevre wrote:
> On 2015-06-14 05:46:00 +0200, Guillem Jover wrote:
> > On Sun, 2015-06-14 at 01:08:29 +0200, Thomas Goirand wrote:
> > > On 06/13/2015 10:55 AM, Paul Wise wrote:
> > > > On Sat, Jun 13, 2015 at 4:23 PM, Thomas Goirand wrote:
> > > >> As a friend puts it:
> > > >> 
> > > >> "This is a fundamental problem/defect with xz. This (and a lot of
> > > >> other such defects, e.g. non-robustness of xz archives that easily
> > > >> lead to file corruption etc) are the reason that there is lzip (and
> > > >> which is why gnu.org has, on a technical basis, decided that lzip is
> > > >> official gzip-successor for gnu software releases when they come in
> > > >> tarballs).
> > 
> > TBH this smells like FUD. For example I've never heard of corruption in
> > .xz files due to non-robustness, I'd expect that corruption to come from
> > external forces, and that integrity would help or not detect it.
> 
> xz-utils (4.999.9beta-1) experimental; urgency=low
> 
>   [ Jonathan Nieder ]
>   * New upstream release.
>      - Fix a data corruption in the compression code. (Closes: #544872)
> [...]
> 
> But of course, this is old,

Yes, that was even before dpkg started to use xz-utils to handle .xz
files.

> and any compression software can have the
> same kind of bug (possibly unless proved formally).

And in any case I don't see how this is a "fundamental problem" with
the format, this is simply just a bug in a beta version, although an
unfortunate one.

> However lzip compresses better, sometimes much better:
> 
> -rw-r----- 1 vinc17 vinc17   822474 2015-04-26 00:45:51 mail.log.lz
> -rw-r----- 1 vinc17 vinc17   915544 2015-04-26 00:45:51 mail.log.xz

Oh, interesting, this didn't use to be the case when we added .xz
support to dpkg.

> (this example is a postfix mail log) and uses much less memory for
> compression:
> 
> $ sh -c 'ulimit -v 200000; lzip -9 < mail.log > /dev/null'
> $ sh -c 'ulimit -v 800000; xz -9 < mail.log > /dev/null'
> xz: (stdin): Cannot allocate memory
> $ sh -c 'ulimit -v 800000; xz -9 < /dev/null > /dev/null'
> xz: (stdin): Cannot allocate memory
> 
> Note: see the 200000 for lzip and 800000 for xz.

The preset levels do not match between lzip and xz. For example for -9, xz
uses a dictionary size of 64 MiB, while lzip uses 32 MiB. Other parameters
are also probably quite different. In addition lzip seems to be
substantially slower (at least) when compressing compared to xz using the
same preset levels. With a small pdf file it took more than twice the time:

,---
$ cat Posix_1003.1e-990310.pdf >/dev/null
$ ls -la Posix_1003.1e-990310.pdf
-rw-r----- 1 guillem guillem 3486116 Feb 20 16:43 Posix_1003.1e-990310.pdf
$ /usr/bin/time xz -9k Posix_1003.1e-990310.pdf
1.24user 0.07system 0:01.31elapsed 99%CPU (0avgtext+0avgdata 98748maxresident)k
0inputs+3520outputs (0major+24291minor)pagefaults 0swaps
$ rm -f Posix_1003.1e-990310.pdf.xz
$ /usr/bin/time xz -9k Posix_1003.1e-990310.pdf
1.25user 0.06system 0:01.31elapsed 99%CPU (0avgtext+0avgdata 98952maxresident)k
0inputs+3520outputs (0major+24295minor)pagefaults 0swaps
$ ls -la Posix_1003.1e-990310.pdf.xz
-rw-r----- 1 guillem guillem 1801372 Feb 20 16:43 Posix_1003.1e-990310.pdf.xz
$ rm -f Posix_1003.1e-990310.pdf.xz
#
$ /usr/bin/time lzip -9k Posix_1003.1e-990310.pdf
2.93user 0.02system 0:02.96elapsed 99%CPU (0avgtext+0avgdata 37628maxresident)k
0inputs+3520outputs (0major+8957minor)pagefaults 0swaps
$ rm -f Posix_1003.1e-990310.pdf.lz
$ /usr/bin/time lzip -9k Posix_1003.1e-990310.pdf
2.94user 0.03system 0:02.98elapsed 99%CPU (0avgtext+0avgdata 37576maxresident)k
0inputs+3520outputs (0major+8955minor)pagefaults 0swaps
-rw-r----- 1 guillem guillem 1798338 Feb 20 16:43 Posix_1003.1e-990310.pdf.lz
$ rm -f Posix_1003.1e-990310.pdf.lz
`---

With the linux sources:

,---
$ cat linux_4.0.4.orig.tar >/dev/null
$ ls -la linux_4.0.4.orig.tar
-rw-r--r-- 1 guillem guillem 582932480 May 26 20:15 linux_4.0.4.orig.tar
$ /usr/bin/time lzip -k9 linux_4.0.4.orig.tar
619.52user 1.27system 10:21.95elapsed 99%CPU (0avgtext+0avgdata 363168maxresident)k
24inputs+156680outputs (0major+90387minor)pagefaults 0swaps
$ ls -la linux_4.0.4.orig.tar.lz
-rw-r--r-- 1 guillem guillem 80218126 May 26 20:15 linux_4.0.4.orig.tar.lz
$ rm -f linux_4.0.4.orig.tar.lz
$ /usr/bin/time lzip -k9 linux_4.0.4.orig.tar
618.94user 1.10system 10:21.02elapsed 99%CPU (0avgtext+0avgdata 363180maxresident)k
8inputs+156680outputs (0major+90389minor)pagefaults 0swaps
$ rm -f linux_4.0.4.orig.tar.lz
#
$ /usr/bin/time xz -k9 linux_4.0.4.orig.tar
514.76user 1.53system 8:37.22elapsed 99%CPU (0avgtext+0avgdata 691428maxresident)k
176inputs+156656outputs (1major+172417minor)pagefaults 0swaps
$ ls -la linux_4.0.4.orig.tar.xz 
-rw-r--r-- 1 guillem guillem 80205900 May 26 20:15 linux_4.0.4.orig.tar.xz
$ rm -f linux_4.0.4.orig.tar.xz
$ /usr/bin/time xz -k9 linux_4.0.4.orig.tar
515.96user 1.62system 8:38.60elapsed 99%CPU (0avgtext+0avgdata 691328maxresident)k
56inputs+156656outputs (0major+172413minor)pagefaults 0swaps
$ rm -f linux_4.0.4.orig.tar.xz
`---

So the comparison does not seem entirely fair. And it seems to me to be
a matter of tradeoffs?

Thanks,
Guillem


Reply to: