[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: pristine tarball generator

Anthony Towns wrote:
> On Tue, Oct 02, 2007 at 04:15:44PM -0400, Joey Hess wrote:
>> BTW, the next release of pristine-tar will support generating pristine
>> gz files too, so will fully support pristine .orig.tar.gz. Regenerating
>> pristine gz files from small deltas is quite a lot trickier, and
>> currently works for about 99% of the .orig.tar.gz files in the Debian
>> archive. Many thanks to paravoid for making it happen..
> Oh wow, that's cool. Any chance of a post/blog on how that was achieved?
There are mostly two kinds of gzip, both compatible with each other:
a) GNU gzip, which are relatively easy; they can have:
   * the name of the original file (optional)
   * the timestamp of creation
   * a compression level ("normal", --fast, --best)
One can easily figure out these from the gzip headers and recreate them
passing the according gzip options (-n and the undocumented -m and -M).

There's also --rsyncable which is appears mostly (if not only) on Debian
and unfortunately can't be figured out from the headers.

GNU gzip is the vast majority of the archive.

b) zlib's gzip; the BSDs use a CLI-compatible gzip based on zlib and
most of the files in this category come from there.
zlib obviously results in a different content on all compression levels
because of a different algorithm.
Apart from that, since it's a library that many can easily use, there
are some really strange gzips out there; many have full or relative
paths in the original name field while others have a --best compression
level without indicating so in the headers (zlib doesn't write the
headers for you, unfortunately).
Some implementations also have a modified Operating System flag in the
gzip headers

For this, I ported NetBSD's gzip and heavily modified it so that it can
take "expert" arguments so that you can set e.g. the OS flag or various

Unfortunately, it's not easy to separate the two implementations or the
quirks and pristine-gz tries to create all of them until it succeeds.
It's trying to be smart (e.g. by not using GNU gzip if the osflag is not
Unix or if the original name contains slashes) but recognizing a gzip
may take some time.

Something that doesn't work at the moment -and I'd be grateful for any
help- is the majority of MS-DOS and Win32/NTFS implementatations.
Multipart gzips would also not work even though I haven't yet find any.

On the first run of the tool on the whole archive, pristine-gz succeded
in recognizing 21869 of 22566 orig.tar.gz (almost 97% of the archive).
It explicitelly failed on 206 of them (0.91%) while something weird,
probably a bug, happened on the rest 491 (2.18%).

joeyh is doing another run on the archive with updated versions of both
pristine-tar and pristine-gz, we'll have more of these nice statistics soon.


Reply to: