Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide

To: Guillem Jover <guillem@debian.org>
Cc: debian-devel@lists.debian.org
Subject: Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide
From: Antonio Diaz Diaz <antonio@gnu.org>
Date: Sun, 26 Jul 2015 14:10:10 +0200
Message-id: <[🔎] 55B4CE22.9050304@gnu.org>
In-reply-to: <20150614034559.GA10559@gaara.hadrons.org>
References: <20150614034559.GA10559@gaara.hadrons.org>

Hello Guillem,

Guillem Jover wrote:

TBH this smells like FUD. For example I've never heard of corruption in
.xz files due to non-robustness, I'd expect that corruption to come from
external forces, and that integrity would help or not detect it.

Sure it comes from external forces, but xz does something that no othercompressor does: even if the corruption does not affect the data and xzis able to produce perfectly correct output, it will report "Compresseddata is corrupt" and will exit with non-zero status anyway. Just takeany xz file and append a null character to it. Bzip2, gzip and lzipsimply ignore the extra byte.

But not only that. Xz is the only format (of the four mentioned) whoseparts need to be aligned to a multiple of four bytes. The size of a xzfile must also be a multiple of four bytes. To achieve this, xz includespadding everywhere; after headers, blocks, the index, and the wholestream. The bad news is that if the (useless) padding is altered in anyway, "the decoder MUST indicate an error" according to the xz formatspecification.

This is specially bad when xz is used with tar, making the whole commandto fail and the whole archive to be discarded as corrupt.

And this fragility is one of the perverse effects of the unbelievablystupid design of xz; "It is possible that there is a new field presentwhich the decoder is not aware of, and can thus parse the Block Headerincorrectly[1]".


[1] http://tukaani.org/xz/xz-file-format.txt (see 3.1.6. Header Padding)

So yes, the xz format is objectively more fragile than the other three.

In any case .xz supports CRC32, CRC64 and SHA-256 for integrity
checks, .lz only supports CRC32.

To begin with, the affirmation that lzip "only supports CRC32" is false.Lzip provides a 4 factor integrity checking; CRC32, uncompressed size,compressed size, and the value remaining in the range decoder after thedecoding of the end-of-stream marker.

Do you know of any case where bzip2, gzip or lzip silently producedinvalid output because of any weakness in their integrity checking?

Have you considered that maybe lzip provides optimal integrity checking,while xz is just throwing buzzwords to the naive just like it did withLZMA2[2]? Bigger not always means better.


[2] http://www.nongnu.org/lzip/lzip_benchmark.html (see Lzip vs xz)

Lzip is very good at detecting errors. You may have noticed that in caseof corruption, instead of the unhelpful "Compressed data is corrupt"reported by xz, lzip says something like "Decoder error at pos 1234".This leaves very little work for the CRC32 in the detection of errors.

Also, lzip reports mismatches in the four factors separately. This wayif one factor fails but the other three are ok, most probably thecorruption affects the file trailer and you can consider the data to beintact. Lzip corruption detection is robust.

Lets make some tests. I have taken a small file (the COPYING filedistributed with the lzip source), compressed it, and then tried theeffect of all possible bit-flips on the decompression. (I used the'unzcrash' tool distributed with lziprecover).


-rw-r--r-- 1 18025 Jun 16  2014 COPYING
-rw-r--r-- 1  6150 Jun 16  2014 COPYING.bz2
-rw-r--r-- 1  6839 Jun 16  2014 COPYING.gz
-rw-r--r-- 1  6507 Jun 16  2014 COPYING.lz
-rw-r--r-- 1  6544 Jun 16  2014 COPYING.xz

Bzip2 seems to depend entirely on the CRCs for the detection of errors,but it provides 2 levels of CRCs; one for each block and one for thewhole stream. Of the 49200 bit-flips in COPYING.bz2, 29 were rejected(bad magic), 49163 were caught by the CRC, and eight produced correctoutput with status 0.

Gzip has 2 factor integrity checking and some ability to detect formatviolations. Of the 54712 bit-flips in COPYING.gz, 16 were rejected (badmagic), 5 failed because of bad flags, 53952 were caught by the CRC,26207 were caught by the uncompresed length, 540 were caught as formatviolations, 44 failed because of unexpected EOF, 8 failed because ofunknown compression method, and 116 produced correct output with status 0.

Lzip has 4 factor integrity checking and an excellent ability to detectdata errors. Of the 52056 bit-flips in COPYING.lz, 32 were rejected (badmagic), 51171 were caught by the decoder, 19 were caught by the valueremaining in the range decoder, 652 failed because of unexpected EOF, 7failed because of unsupported format version, 3 failed because ofinvalid dictionary size, 32 reported bad CRC, 64 reported baduncompressed size, 64 reported bad compressed size, and 12 producedcorrect output with status 0.

Note that the bad CRCs and sizes reported by lzip correspond tobit-flips in the CRCs and sizes themselves. In this particular file allthe bit-flips in the compressed data were caught by the decoder. Theintegrity information was not even needed.

Xz error messages appear particulary unhelpful about how the corruptionwas detected. Of the 52352 bit-flips in COPYING.xz, 48 were rejected(bad magic), 52299 reported "Compressed data is corrupt", 5 failedbecause of unexpected EOF, and not even one produced correct output withstatus 0.

It is not at all clear that xz could detect errors better than lzip, butxz has a point of failure not present in the other formats. If thecorruption affects the stream flags, xz won't be able to know the sizeof the checksum and won't be able to decode the stream.

More over lzip was created to overcome limitations in the .lzma format,
.xz came later and fixed the limitations of the .lzma format too.

Lzip certainly overcame the limitations in the .lzma format, but in thisrespect xz seems to just have changed the false negatives of lzma intofalse positives. From not reporting the corruption to reporting it "justin case" even if the decoding went well.

(And I could probably switch dpkg-deb's .xz integrity check to CRC64,
given that's the xz-utils command-line tool default.)

Have you verified if this really helps or makes things worse? Doublingthe size of the checksum also doubles the probability of false positivesproduced by the corruption of the checksum.

replacing xz with lzip on .deb or .dsc packages does not make any sense.


Why? Are .deb or .dsc packages using the filters of xz or something?

Whenever considering to add a new compressor, all surrounding tools need
to be modified to support it as well:

The future is long. You can save a lot of work in the long term byadding lzip and deprecating the rest.

Compressor formats are subject to network-effects like many other
file formats. In this case I think .xz "won" both because it was the
"official" successor from .lzma, and because it is superior to .lz.

You can repeat that .xz is superior to .lz as much as you want, but thiswon't make it true. The xz format is so bad that it manages to be as badfor long-term archiving as lzma-alone. Meanwhile lziprecover achievesthe unprecedented feat of repairing single-byte errors without the helpof any extra redundance. Could you tell us in what aspect is .xzsuperior to .lz?

I am not here to "win", but to help people keep their data safe.Remember that I am also the author of GNU ddrescue (whose data recoverycapabilities nicely complement those of lziprecover). Given that this isDebian, I have the hope that you may think of the public interest andeventually replace xz with lzip.

It would be specially adequate that Debian is the first distrodeprecating such bad format given that xz is also not copylefted. Just afew days ago we were discussing in GNU about how easily non-copyleftedsoftware can be rendered non-free by proprietery licenses like theAndroid SDK anti-fork provision.



Best regards,
Antonio.

Reply to:

Follow-Ups:
- Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide
  - From: Andrey Rahmatullin <wrar@debian.org>
- Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide
  - From: Andrew Shadura <andrew@shadura.me>
- Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide
  - From: Vincent Lefevre <vincent@vinc17.net>

Prev by Date: Bug#793644: ITP: hadoop -- Apache Hadoop distributed processing framework
Next by Date: Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide
Previous by thread: Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide
Next by thread: Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide
Index(es):
- Date
- Thread