[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Compression ratios -- gz, bzip2 and xz



Summary:  "xz -9e" compresses a big source best.  The margin
is significant.

This perhaps isn't news to many on the present list, but it
seems worth posting anyway, if only for illustration.  Data and
trivia follow.

On Hideki Yamane's and Henrique Holschuh's advice, to save
space in Debian's archive, I have been aggressively compressing
a big *.orig.tar source.  Hideki and Henrique seem to be right:
the choice of compression technique matters.

My results:

FRACT. RATIO  SIZE      METHOD   FILE
100.0%  1.0:1 425287680          *.orig.tar
 19.4%  5.2:1  82320980 gzip     *.orig.tar.gz
 19.2%  5.2:1  81542973 gzip -9  *.orig.tar.gz
 14.5%  6.9:1  61704587 bzip2    *.orig.tar.bz2
 13.9%  7.2:1  58917455 bzip2 -9 *.orig.tar.bz2
  8.8% 11.4:1  37249420 xz       *.orig.tar.xz
  8.4% 11.9:1  35849616 xz -7    *.orig.tar.xz
  8.3% 12.0:1  35473920 xz -8    *.orig.tar.xz
  8.1% 12.3:1  34626532 xz -8e   *.orig.tar.xz
  7.9% 12.7:1  33571868 xz -9    *.orig.tar.xz
  7.7% 13.0:1  32685680 xz -9e   *.orig.tar.xz

This 0.4-GiB *.orig.tar source happens to consist of W3 web
standards documents in HTML format.  It is marked-up text with
some PNG and SVG graphics.  (Its filename on my laptop is
w3-recs_20161202.orig.tar, if you want to know; but this exact
file exists only on my laptop at the moment, so don't go
looking for it in the archive.)  As far as compressibility
goes, such an *.orig.tar might be fairly typical for Debian.

According to the xz(1) man page, "xz -9e" is useful only on
files larger than 32 MiB, so one does not advocate using
the -9e option by default.  Indeed, I am not advocating
anything at all, except that the above results might interest
some people.  In this test, compression of a big source
by "xz -9e" wins.

But isn't "xz -9e" too slow?  Answer:  well, it was indeed
slowest of the several methods tried, but still took less
than five *minutes* on my laptop, compared against 10 *seconds*
for plain old "gzip".  Yet, even if "xz -9e" had taken
five *hours* (it didn't), it would probably still have been
worth doing to save the archive space.  Decompression, of
course, is quick, at less than three seconds (though
decompression of the *.orig.tar.gz, two seconds, is admittedly
even quicker).

So, if that is interesting, there it is.  I really don't know
anything else about this, so if questions were asked, then
Hideki, Henrique or others might answer.  Nevertheless, the
results seemed worth a post at any rate.

(I am not subscribed to this list, so feel free to Cc me.)

References: Hideki [1]; Henrique [2].

    1: https://lists.debian.org/debian-dpkg/2012/08/msg00027.html
    2: https://lists.debian.org/debian-devel/2016/10/msg00748.html

Attachment: signature.asc
Description: Digital signature


Reply to: