[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Adding support for LZIP to dpkg, using that instead of xz, archive wide



Hello Guillem,

Guillem Jover wrote:
TBH this smells like FUD. For example I've never heard of corruption in
.xz files due to non-robustness, I'd expect that corruption to come from
external forces, and that integrity would help or not detect it.

Sure it comes from external forces, but xz does something that no other compressor does: even if the corruption does not affect the data and xz is able to produce perfectly correct output, it will report "Compressed data is corrupt" and will exit with non-zero status anyway. Just take any xz file and append a null character to it. Bzip2, gzip and lzip simply ignore the extra byte.

But not only that. Xz is the only format (of the four mentioned) whose parts need to be aligned to a multiple of four bytes. The size of a xz file must also be a multiple of four bytes. To achieve this, xz includes padding everywhere; after headers, blocks, the index, and the whole stream. The bad news is that if the (useless) padding is altered in any way, "the decoder MUST indicate an error" according to the xz format specification.

This is specially bad when xz is used with tar, making the whole command to fail and the whole archive to be discarded as corrupt.

And this fragility is one of the perverse effects of the unbelievably stupid design of xz; "It is possible that there is a new field present which the decoder is not aware of, and can thus parse the Block Header incorrectly[1]".

[1] http://tukaani.org/xz/xz-file-format.txt (see 3.1.6. Header Padding)

So yes, the xz format is objectively more fragile than the other three.


In any case .xz supports CRC32, CRC64 and SHA-256 for integrity
checks, .lz only supports CRC32.

To begin with, the affirmation that lzip "only supports CRC32" is false. Lzip provides a 4 factor integrity checking; CRC32, uncompressed size, compressed size, and the value remaining in the range decoder after the decoding of the end-of-stream marker.

Do you know of any case where bzip2, gzip or lzip silently produced invalid output because of any weakness in their integrity checking?

Have you considered that maybe lzip provides optimal integrity checking, while xz is just throwing buzzwords to the naive just like it did with LZMA2[2]? Bigger not always means better.

[2] http://www.nongnu.org/lzip/lzip_benchmark.html (see Lzip vs xz)

Lzip is very good at detecting errors. You may have noticed that in case of corruption, instead of the unhelpful "Compressed data is corrupt" reported by xz, lzip says something like "Decoder error at pos 1234". This leaves very little work for the CRC32 in the detection of errors.

Also, lzip reports mismatches in the four factors separately. This way if one factor fails but the other three are ok, most probably the corruption affects the file trailer and you can consider the data to be intact. Lzip corruption detection is robust.

Lets make some tests. I have taken a small file (the COPYING file distributed with the lzip source), compressed it, and then tried the effect of all possible bit-flips on the decompression. (I used the 'unzcrash' tool distributed with lziprecover).

-rw-r--r-- 1 18025 Jun 16  2014 COPYING
-rw-r--r-- 1  6150 Jun 16  2014 COPYING.bz2
-rw-r--r-- 1  6839 Jun 16  2014 COPYING.gz
-rw-r--r-- 1  6507 Jun 16  2014 COPYING.lz
-rw-r--r-- 1  6544 Jun 16  2014 COPYING.xz

Bzip2 seems to depend entirely on the CRCs for the detection of errors, but it provides 2 levels of CRCs; one for each block and one for the whole stream. Of the 49200 bit-flips in COPYING.bz2, 29 were rejected (bad magic), 49163 were caught by the CRC, and eight produced correct output with status 0.

Gzip has 2 factor integrity checking and some ability to detect format violations. Of the 54712 bit-flips in COPYING.gz, 16 were rejected (bad magic), 5 failed because of bad flags, 53952 were caught by the CRC, 26207 were caught by the uncompresed length, 540 were caught as format violations, 44 failed because of unexpected EOF, 8 failed because of unknown compression method, and 116 produced correct output with status 0.

Lzip has 4 factor integrity checking and an excellent ability to detect data errors. Of the 52056 bit-flips in COPYING.lz, 32 were rejected (bad magic), 51171 were caught by the decoder, 19 were caught by the value remaining in the range decoder, 652 failed because of unexpected EOF, 7 failed because of unsupported format version, 3 failed because of invalid dictionary size, 32 reported bad CRC, 64 reported bad uncompressed size, 64 reported bad compressed size, and 12 produced correct output with status 0.

Note that the bad CRCs and sizes reported by lzip correspond to bit-flips in the CRCs and sizes themselves. In this particular file all the bit-flips in the compressed data were caught by the decoder. The integrity information was not even needed.

Xz error messages appear particulary unhelpful about how the corruption was detected. Of the 52352 bit-flips in COPYING.xz, 48 were rejected (bad magic), 52299 reported "Compressed data is corrupt", 5 failed because of unexpected EOF, and not even one produced correct output with status 0.

It is not at all clear that xz could detect errors better than lzip, but xz has a point of failure not present in the other formats. If the corruption affects the stream flags, xz won't be able to know the size of the checksum and won't be able to decode the stream.


More over lzip was created to overcome limitations in the .lzma format,
.xz came later and fixed the limitations of the .lzma format too.

Lzip certainly overcame the limitations in the .lzma format, but in this respect xz seems to just have changed the false negatives of lzma into false positives. From not reporting the corruption to reporting it "just in case" even if the decoding went well.


(And I could probably switch dpkg-deb's .xz integrity check to CRC64,
given that's the xz-utils command-line tool default.)

Have you verified if this really helps or makes things worse? Doubling the size of the checksum also doubles the probability of false positives produced by the corruption of the checksum.


replacing xz with lzip on .deb or .dsc packages does not make any sense.

Why? Are .deb or .dsc packages using the filters of xz or something?


Whenever considering to add a new compressor, all surrounding tools need
to be modified to support it as well:

The future is long. You can save a lot of work in the long term by adding lzip and deprecating the rest.


Compressor formats are subject to network-effects like many other
file formats. In this case I think .xz "won" both because it was the
"official" successor from .lzma, and because it is superior to .lz.

You can repeat that .xz is superior to .lz as much as you want, but this won't make it true. The xz format is so bad that it manages to be as bad for long-term archiving as lzma-alone. Meanwhile lziprecover achieves the unprecedented feat of repairing single-byte errors without the help of any extra redundance. Could you tell us in what aspect is .xz superior to .lz?

I am not here to "win", but to help people keep their data safe. Remember that I am also the author of GNU ddrescue (whose data recovery capabilities nicely complement those of lziprecover). Given that this is Debian, I have the hope that you may think of the public interest and eventually replace xz with lzip.

It would be specially adequate that Debian is the first distro deprecating such bad format given that xz is also not copylefted. Just a few days ago we were discussing in GNU about how easily non-copylefted software can be rendered non-free by proprietery licenses like the Android SDK anti-fork provision.


Best regards,
Antonio.


Reply to: