[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debian mirror scripts



Otto Wyss <otto.wyss@orpatec.ch> writes:

> Goswin von Brederlow wrote:
>> 
>> > Why there isn't there already a rsync method for apt is probably a
>> > mystery nobody ever will solve.
>> 
>> It is not wanted due to rsync causing excessive server load.
>> 
> That is simply not true. This statement is repeated all the time but
> nobody ever was able to show hard figures. 

Rsync by default uses ~3% of the file size in ram to store block
checksums. Consider a new kde-i18n release with its 200Mb file (for an
extrem case). 33 donloads waste 200MB ram, 330 download 2GB ram. When
do you thing the mirror will start swapping?

The ~3% are filled with the clients checksums first and then rsync
reads in the full file computing the adler32 checksum per byte and a
md4sum for each potential match.

Even though adler32 and md4 are very fast they are more load than just
sending out the file. And all that for just ~1% saving (without
rsyncable).

Worse is downloading multiple files (as you mention below) using
include and exclude patterns. Downloading 1000 files at once through
this takes somewhere around an hour just to build a file listing what
to get doing a complete find over the archive (wasting tons of I/O).

But downloading files seperately isn't that much better since then
every file opens a new connect and forks a new rsync on the
server. Starting a new full Debian-amd64 mirror with a 300ms ping
reply (that is what I get roughly) to the server would waste 75 hours
on waiting for the initial three-way handshake for a connect. And
another 50 hours for the round-robin sending the name of a file and
getting the data.

You can say all this is bad design in rsync and the solution is dead
simple (now): zsync.

> Where rsync produces much load is during the phase when it collects all
> the files for synchronisation and not during MD5 computation but this is
> is only due to not well designed scripts. DpartialMirror doesn't impose
> this phase since it only require single-file transfers and does the file
> collecting phase on the client.

Depends on use and available client data.

>> New versions. The size of the Packages files is comparatively tiny
>> compared to all the debs. Even the 1% saving for rsyncing debs is
>> hardly worth it due to the extra traffic for the checksums and the
>> server load it causes.
>> 
> Sorry rsync reports the overall use, incl. checksums etc.

What I ment is the extra outgoing traffic on the client side. For
example for dsl users sending the checksums might actualy slow things
more down than the 1% saving speeds things up.

> Of course 1% saving doesn't make much sense so that's the main reason I
> don't develop DpartialMirror further. Anyway the next time a
> distribution concept is designed it will be based on a p2p solution.
>
>> zsync has the option of looking into gziped files and rsync them as if
>> they would be ungziped (while still just downloading chunks of the
>> gziped file). Its a bit more complex algorithm but works even better
>> than rsyncable files and rsync.
>>
> As long as zsync allows multi-file transfers it won't be better that rsync.
>
> O. Wyss

zsync works via http. YOU have to know the filename to request, so no
large "find" like rsync does for multiple files, but through
keep-alive you can still do multiple files, even in parallel if you
like and code it, with a single connection.

Also note that zsync uses precomputed checksum files on the server
side and does the adler/md4 computations on the client. Thats why it
can use simple http/1.1. It's a win-win situation.

MfG
        Goswin

PS: I think the sarge/sid rsync has improved the multiple file
download but that doesn't help woody.



Reply to: