Re: package pool and big Packages.gz file

To: Jason Gunthorpe <jgg@debian.org>
Cc: Goswin Brederlow <goswin.brederlow@student.uni-tuebingen.de>, Debian Developers <debian-devel@lists.debian.org>
Subject: Re: package pool and big Packages.gz file
From: Goswin Brederlow <goswin.brederlow@student.uni-tuebingen.de>
Date: 08 Jan 2001 13:18:39 +0100
Message-id: <[🔎] 873deunso0.fsf@mose.informatik.uni-tuebingen.de>
In-reply-to: Jason Gunthorpe's message of "Sun, 7 Jan 2001 20:43:07 -0700 (MST)"
References: <[🔎] Pine.LNX.3.96.1010107203006.21865R-100000@wakko.deltatee.com>

>>>>> " " == Jason Gunthorpe <jgg@debian.org> writes:

     > On 8 Jan 2001, Goswin Brederlow wrote:

    >> I don't need to get a filelisting, apt-get tells me the
    >> name. :)

     > You have missed the point, the presence of the ability to do
     > file listings prevents the adoption of rsync servers with high
     > connection limits.

Then that feature should be limited to non-recursive listings or
turned off. Or .listing files should be created that are just served.

    >> > Reversed checksums (with a detached checksum file) is
    >> something > someone should implement for debian-cd. You calud
    >> even quite > reasonably do that totally using HTTP and not run
    >> the risk of > rsync load at all.
    >> 
    >> At the moment the client calculates one roling checksum and
    >> md5sum per block.

     > I know how rsync works, and it uses MD4.

Ups, then s/5/4/g.

    >> Given a 650MB file, I don't want to know the hit/miss ratios
    >> for the roling checksum and the md5sum. Must be realy bad.

     > The ratio is supposed to only scale with block size, so it
     > should be the same for big files and small files (ignoring the
     > increase in block size with file size).  The amount of time
     > expended doing this calculation is not trivial however.

Hmm, in the technical paper it says that it creates a 16 bit external
hash, each entry a linked list of items containing the full 32 Bit
rolling checksum (or the other 16 bit) and the md4sum.

So when you have more blocks, the hash will fill up. So you have more
hits on the first level and need to search a linked list. With a block
size of 1K a CD image has 10 items per hash entry, its 1000% full. The
time wasted alone to check the rolling checksum must be huge.

And with 650000 rolling checksums for the image, theres a ~10/65536
chance chance of hitting the same checksum with differen md4sum, so
thats about 100 times per CD, just by pure chance.

If the images match, then its 650000 times.

So the better the match, the more blocks you have, the more cpu it
takes. Of cause larger blocks take more time to compute a md4sum, but
you will have less blocks then.

     > For CD images the concern is of course available disk
     > bandwidth, reversed checksums eliminate that bottleneck.

That anyway. And ram.

MfG
        Goswin

Reply to:

Follow-Ups:
- Re: package pool and big Packages.gz file
  - From: Jason Gunthorpe <jgg@debian.org>

References:
- Re: package pool and big Packages.gz file
  - From: Jason Gunthorpe <jgg@debian.org>

Prev by Date: RFC Implementation of SGML/XML Proposal for LSB in Debian
Next by Date: Re: Solving the compression dilema when rsync-ing Debian versions
Previous by thread: Re: package pool and big Packages.gz file
Next by thread: Re: package pool and big Packages.gz file
Index(es):
- Date
- Thread