[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debian's problems, Debian's future



On Wed, Apr 10, 2002 at 09:22:50AM +0200, Michael Bramer wrote:
> On Wed, Apr 10, 2002 at 10:25:22AM +1000, Martijn van Oosterhout wrote:
> > With the standard rsync algorithm, the rsync checksum files would actually
> > be 8 times larger than the original file (you need to store the checksum
> > for each possible block in the file).
> 
> I don't see that the checksum file is larger than the origanl file. If
> the checksum file is larger, we will have more bytes to download... This
> was not the goal.

That's because the client doesn't not download the checksums. Look below.

> maybe I don't understand the rsync algorithm...
> 
> IMHO the rsync algorithm is:
>  1.) Computer beta splits file B in blocks.
>  2.) calculate two checksums 
>      a.) weak ``rolling'' 32-bit checksum
>      b.) md5sum
>  3.) Computer B send this to computer A.
>  4.) Computer A search in file A for parts with the same checksums from
>      file B
>  5.) Computer A request unmatch blocks from computer B and 
>      build the file B.
> 
> I get this from /usr/share/doc/rsync/tech_report.tex.gz

Computer A wants to download a file F from computer B.

1. Computer A splits it's version into blocks, calculates the checksum for
each block.
2. Computer A sends this list to computer B. This should be <1% the size of
the original file. Depends on the block size.
3. Computer B takes this list and does the rolling checksum over the file.
Basically, it calculates the checksum for bytes 0-1023, checks for it in the
list from the client. If it's a match send back a string indicating which
block it is, else send byte 0. Calculate checksum of 1-1024 and do the same.
The rolling checksum is just an optimisation.
4. Computer A receives list of "tokens" which are either bytes of data or
indications of which block to copy from the original file.

Notice that:
a. The server (computer B) does *all* the work.
b. The data forms a stream. The client can split itself into two and can be
analysing the next file while the server is still processing the current
one. Your above algorithm requires two requests for each file. The streaming
help performance over high latency links.
c. Precalculating checksums on the client is useless
d. Precalculating checksums on the server is also useless because the
storage would be more (remember, checksum for bytes 0-1023, then for 1-1024,
2-1025, etc). It's faster to calculate them than to load them off disk.

So, the main difference between what you are proposing is 1 versus 2
requests per file. And rsync definitly only has one.

Besides, look at the other posts on this thread. Diff requires less download
than rsync.
-- 
Martijn van Oosterhout <kleptog@svana.org>   http://svana.org/kleptog/
> Ignorance continues to thrive when intelligent people choose to do
> nothing.  Speaking out against censorship and ignorance is the imperative
> of all intelligent people.


-- 
To UNSUBSCRIBE, email to debian-devel-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org



Reply to: