[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Optionally exit with nonzero status if trailing garbage



Jakub Wilk wrote:
"Lzip will correctly decompress a file which is the concatenation
of two or more compressed files. The result is the concatenation of
the corresponding uncompressed files. Integrity testing of
concatenated compressed files is also supported."

Whatever follows a file that is not a valid header is classified as
"trailing garbage" and ignored.

Sounds like a serious design flaw that could lead to data loss.

You are about right. IMHO, this is a (not so) serious design flaw of gzip, improved somewhat by bzip2 and lzip, worsened by xz, but never properly addressed. Except in the case of xz (see below), this "flaw" is not related to any format, but just to what should the decompressor do in a situation that may involve a corrupt header or just trailing garbage.


The attached files differ only by one bit. The output for the
corrupted file is truncated, yet there is no error or warning:

Just use "lzip -vvvv" to see the warning:

     When decompressing or testing, further -v's (up to 4) increase the
     verbosity level, showing status, compression ratio, dictionary
     size, trailer contents (CRC, data size, member size), and up to 6
     bytes of trailing garbage (if any).

BTW, lzip is the only one that shows the "trailing garbage", allowing you to determine if it is really garbage or not. In this case the "garbage" is awfully similar to a lzip signature (4C 5A 49 50):

$ lzip -tvvvv corrupted.lz
corrupted.lz: dictionary size 4 KiB. 0.100:1, 80.000 bits/byte, -900.00% saved. data CRC 7E3265A8, data size 4, member size 40. ok
  corrupted.lz: first bytes of trailing garbage found = 4D 5A 49 50 01 0C


I see that the bit-flip in corrupted.lz affects one of the magic bytes in the second member of the file.

The probability of corruption happening in the magic bytes of the second or successive members/streams is (except in the case of xz) about 4 times smaller than the probability of getting a false positive caused by the corruption of the integrity information itself. It can be considered to be under the noise level. This along with the fact that human judgement is needed to tell garbage from a corrupt header are probably the causes why AFAIK nobody has never cared about it so much as to write a feature request in bug-gzip or lzip-bug.


> Xz has broken with this tradition

Glad to hear that.

Don't be so glad about xz breaking the tradition. Xz did it because its probability of truncating the output is the highest of all, both because of its longer magic string and because of possible corruption in stream padding. (The stream padding of xz is optional, but its size has no limit).

bzip2/gzip/lzip
+========+========+========+
| member | member | member |
+========+========+========+

xz
+========+=========+========+=========+========+=========+
| stream | padding | stream | padding | stream | padding |
+========+=========+========+=========+========+=========+

Bzip2 and lzip behave optimally in the most frequent case of files with just one member/stream, where trailing garbage can't make the decoder produce incorrect output, and there is no risk in ignoring it by default. (This is, I think, the case of Debian packages). In the four examples below tar extracted the files correctly, but only bzip2 and lzip returned with 0 status (the string "garbage" was appended to each tarball):

$ tar -xf garbage_added.tar.bz2 ; echo $?
bzip2: (stdin): trailing garbage after EOF ignored
0

$ tar -xf garbage_added.tar.gz ; echo $?
gzip: stdin: decompression OK, trailing garbage ignored
tar: Child returned status 2
tar: Error is not recoverable: exiting now
2

$ tar -xf garbage_added.tar.lz ; echo $?
0

$ tar -xf garbage_added.tar.xz ; echo $?
xz: (stdin): Unexpected end of input
tar: Child returned status 1
tar: Error is not recoverable: exiting now
2

Note the contradictory messages in the gzip example: "decompression OK" vs "Error is not recoverable". Xz missed the point entirely.

For more advanced (but less frequent) uses like multimember or concatenated files I propose the following change:

1) Ignore trailing garbage by default, as bzip2 and lzip do now.

2) Add an option (say --trailing-error) that forces the decompressor to exit with nonzero status if any remaining input is detected after the last member.

The proposed option would catch the improbable case of corruption in the magic bytes of the second or successive members, but there is nothing the decompressor can do to catch the similarly improbable case of file truncation just after the last byte of a member/stream.

I suggest any replies to this message to be made in lzip-bug. I guess discussing the behaviour of decompressors in corner cases like this is off-topic in debian-devel.


Best regards,
Antonio.


Reply to: