[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Packages file missing from unstable archive

On Tue, Nov 01, 2005 at 09:54:09AM -0500, Michael Vogt wrote:
> My next test was to use only the data.tar.gz of the two
> archives. Zsync will extract the gzip file then and use the tar as the
> base. With that I got:
> --------------------8<------------------------
> Read data.tar.gz. Target 34.1% complete.
> used 1056768 local, fetched 938415
> --------------------8<------------------------
> The size of the data.tar.gz is 1210514. 

Fetching 938kB instead of 1210kB is a 22.5% saving, so 12% of the desired
data was apparently already present, but redownloaded anyway.

> A problem is that zsync needs to teached to deal with deb files (that
> is, that it needs to unpack the data.tar and use that for the syncs).

That seems kinda awkward -- you'd need to start by downloading the
ar header, working out where in the file the data.tar.gz starts, then
redownloading from there. I guess you could include that info in the
.zsync file though. OTOH, there should be savings in the control.tar.gz
too, surely -- it'd change less than data.tar.gz most of the time, no?

How much zsync data is required for that 22.5% saving over 1MB? I guess
it'd be about 16 bytes per 4k of uncompressed data, assuming 33%
compression, that's 16bytes per 3kB, or .5% overhead. For 100GB of debs
in the archive, that's about an extra half gig of space used.

Hrm, thinking about it, I guess zsync probably works by storing the
state of the gzip table at certain points in the file and doing a
rolling hash of the contents and recompressing each chunk of the file;
that'd result in the size of the .gz not necessarily being the same, let
alone the md5sum.

Feh, trying to verify this with ~512kB of random data, gzipped, I just
keep getting "Aborting, download available in zsyncnew.gz.part". That's
not terribly reassuring. And trying it with gzipped text data, I get
stuck on 99.0%, with zsync repeatedly requesting around 700 bytes.

Anyway, if it's recompressing like I think, there's no way to get the
same compressed md5sum -- even if the information could be transferred,
there's no guarantee the local gzip _can_ produce the same output as
the remote gzip -- imagine if it had used gzip -9 and your local gzip
only supports -1 through -5, eg.

Hrm, it probably also means that mirrors can't use zsync -- that is,
if you zsync fooA to fooB you probably can't use fooA.zsync to zsync
from fooB to fooC.

Anyway, just because you get a different file, that doesn't mean it'll
act differently; so we could just use an "authentication" mechanism
that reflects that. That might involve providing sizes and sha1s of the
uncompressed contents of the ar in the packages file, instead of the
md5sum of the ar. Except the previous note probably means that you'd
still need to use the md5sum of the .deb to verify mirrors; which means
mirrors and users would have different ways of verifying their
downloads, which is probably fairly undesirable.

Relatedly, mirrors (and apt-proxy users, etc) need to provide Packages.gz
of a particular md5sum/size, so they can't use Packages.diff to speed
up their diffs. It might be worth considering changing the Release file
definition to just authenticate the uncompressed files and expect tools
like apt and debootstrap to authenticate only after uncompressing. A
"Compression-Methods: gz, bz2" header might suffice to help tools work
out whether to try downloading Packages.gz, Packages.bz2 or just plain
Packages first. Possibly "Packages-Compress:" and "Sources-Compress:"
might be better.


Attachment: signature.asc
Description: Digital signature

Reply to: