[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: New mirror scripts for Debian mirrors



Joerg Jaspert (joerg@debian.org) wrote on 9 November 2008 21:39:
 >>>  >Unknown yet, currently its only expected to get more.
 >>>  >4 or 6 of them seem to be a good number.
 >>> 6 is doable but it's the limit. For intervals shorter than 4h another
 >>> method would have to be used.
 >
 >>> The problem, in our case, is not the download, it's the scavenging of
 >>> an enormous tree to determine what has to be transfered. This is very
 >>> heavy because it's almost only disk seeks.
 >> Yes, this part of the sync process is time consuming and is what will
 >> eventually limit the frequency of pushes.
 >
 >Ay, rsync is annoying for that part of the sync.
 >
 >> I would like to see trials on a mirror by mirror basis before the
 >> pushing frequencies are increased arbitrarily. I'm sure we can take 3 a
 >> day and even 4. But more than that will probably jam things.
 >
 >I think 4 should be possible without too much effect for the mirror
 >network. Thats one run every 6 hours "only".

Agreed.

 >Now, I would *love* to have hourly runs. Yes, I realize that *currently*
 >this is no option, thanks to the amount of files rsync has to check.
 >We need something better here first. I dont know yet what. Maybe the
 >"batch-mode" from rsync (never really tried it), maybe something totally
 >different. If someone has good ideas, I'm happy to hear them. (Even if
 >they go as far as changing the archive structure).

What an enthusiastic mirror/archive boss! :-) Count me in for hourly
updates. I just think that the probability of you convincing the other
bosses to radically change the archive structure for the sake of so
short updates is of the order of the inverse of Avogadro's number :-)

Seriously, the structure of the archive is very bad for updates
because all files from all releases are mixed. This forces the useless
stat of *many* files that never change. Besides, Debian is the largest
distro. The combination of these two factors make it by far the
heaviest distro to update.

A method that is efficient enough for hourly updates is to use a
change file provided by the master. The mirrors pull this file, parse
it and create a --files-from that rsync uses to pull just new stuff,
without having to do any scavenging. This doesn't work for hardlinks
but can be used for pool, doc and project dirs, because they don't
have hard links. Then the mirrors do a standard sync of dists and
indices, which is fast because they only have about 10,000 files.

The change file has to include info on removed files as well. They
have to be removed "by hand" because rsync doesn't remove files
without a directory scan, which is exactly what we want to avoid. Of
course the master could provide the ready --files-from, which rsync
can pull automatically, and the list of deletions to be done after the
sync of dists and indices. The more centralized the process the more
reliable it is.

This method would also significantly reduce the load on all mirrors,
so it'd be very welcome even without hourly updates (hint, hint) :-)


Reply to: