[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Rsync on servers (was Re: RFC: Checking for updates)



On Sat, Nov 03, 2001 at 04:16:34PM -0500, Matt Zimmerman wrote:
> On Sat, Nov 03, 2001 at 11:10:20PM +1100, Martijn van Oosterhout wrote:
> > Last time I heard this idea, it was pointed out that the checksum data is 4
> > or 8 times larger than the file it is checksumming.
> > 
> > I don't think archive maintainers would like that...
> 
> If that were true, rsync wouldn't save bandwidth by transferring and
> verifying the checksums, no?  Even so, we're only talking about the
> Packages files, which are relatively small compared to the archive as a
> whole.

The algorithm works like this: The client divides the file it has up into
blocks, calculates the checksum for each and sends those to the server. The
server scans through the file on the server, doing a rolling checksum at
every position in the file with the same blocksize and sends back a list of
tokens representing either data or blocks on the client.

So by precalculating the checksums on the server, you are asking it to
remember the 4 (or 8) byte checksum value for each possible block in the
file.

You can probably see that this algorithm can be reversed. Client asks for
block checksum list from server. Client matches those checksums to what it
has and requests a list of data blocks from the server for blocks it
couldn't match. Fairly light on the server end and precalculation is a win
here because the checksums would be less than 1% of the original file. Not
sure why it hasn't been done yet.

The reason it was done the other way first is because it only required a
single request/response model which could be streamed for extra performance.

HTH,
-- 
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> Magnetism, electricity and motion are like a three-for-two special offer:
> if you have two of them, the third one comes free.



Reply to: