On Mon, Sep 10, 2001 at 12:39:56PM -0500, Adam Heath wrote: > On Mon, 10 Sep 2001, Zygo Blaxell wrote: > > This has a nasty side-effect: there may be many machines attempting to > > download all of the same packages in the same order through the same > > proxy if there is a large number of packages upgraded that day. The HTTP > > caches can't have a cache hit on a package until that package is fully > > downloaded, so ultimately many machines will end up downloading the same > > package at the same time through the same HTTP proxy and Internet feed. > > It's the fault of the web cache/proxy that it does not start serving out the > partial content, and then multiplexing the content still coming in from the > first request, out to all the subsequent requests. Actually, I have played around with this kind of solution, using 'apt-get --print-uris' to generate data to simulate a 'dist-upgrade', and HTTP proxy cache logs for data to simulate an 'update'. In a nutshell, the solution you propose can be worse than using no cache at all. The solution I propose improves performance in some situations, regardless of the kind of cache used. I have to deal with some eastern European site offices which have lots of available bandwidth, but average 30% (best case 10%, worst case 80%, one or two hours per month of no connectivity at all) packet loss between the ISP and anything interesting, like a Debian mirror or another corporate office. Any one TCP connection is able to use less than 2% of the bandwidth between sites--the TCP congestion window never opens up, because every third packet disappears in transit. If I open 10 or 20 TCP connections, each fetching a different package, each one behaves the same as if I had opened only a single connection--there is no bandwidth starvation nor significant additional latency, because the TCP congestion window never opens more than one or two segments. Some results of my simulations: With no cache at all, 30% packet loss at the ISP, 3% of local bandwidth consumed per TCP connection, and unmodified apt-get, the average run time of a parallel 'apt-get dist-upgrade' or 'apt-get upgrade' is identical given a machine pool sized between 1 and 10 machines. Some machines finish faster than others--there is considerable variance between machines. No machine ever uses enough bandwidth to affect packet loss, latency, or available bandwidth for other connections, so the TCP connections don't interact with each other at all. Once you get 20 or so machines running in parallel, things start to get non-linear. With a cache that behaves as I described (no cache hits until a complete object is downloaded), under the same conditions, the total run time of 'apt-get dist-upgrade' run in parallel on 10 machines is about 30% less, because some of the machines do get cache hits if they are sufficiently delayed during one HTTP object fetch that the following object can be fetched from the cache. In the event of a cache miss, behavior is identical to the uncached scenario. I think the 30% packet loss and 30% speed improvement in this simulation is just a coincidence--I can't think of any mechanism by which they would be related. With a cache that behaves as you described (partial content hits and multiplexing live upstream HTTP server connections amongst downstream HTTP clients), under the same conditions, the total run time is constant for all machines, and equal to the worst-case run time of the uncached case. Actually it's very close to having one machine perform the entire apt-get dist-upgrade through a cache, followed by all other machines using the cache sequentially. In practice, running all of the apt-get dist-upgrade's sequentially through a cache under these conditions is too slow to be worth mentioning. The Packages.gz files for unstable change while the dist-upgrades are running, invalidating any cached copies of these files and losing packages (update has to be run again), and generally wasting an entire day. If all of the apt-get's in a cluster can be configured somehow to fetch their packages in parallel through a caching web proxy that behaves either as you describe (joining all requests to the same object into one) or as I describe (treating all requests independently until a complete object can be cached), the run time on all machines is equal, and it is the total download time of the longest file in each of the groups of N packages fetched in the update/upgrade. N would be about 10, given the network (non-)performance characteristics I've been discussing so far. If each apt-get fetches its objects in random order, as I proposed, you get as close to this result as possible without introducing an external synchronization mechanism. If I understand correctly, apt-get does not care about the order of downloaded packages, since it won't actually install any until the entire download of all packages is complete. Hmmm...it occurs to me that in the process of gathering experimental data, I've already got tools that could be easily adapted to pre-fill the HTTP cache for a dist-upgrade. Something like... for x in \ http://http.us.debian.org/debian/dists/{stable,testing,unstable}/{main,contrib,non-free}/binary-{i386,alpha}/Packages.gz \ http://non-us.debian.org/debian-non-US/dists/{stable,testing,unstable}/{main,contrib,non-free}/binary-{i386,alpha}/Packages.gz \ ; do wget --cache on --delete-after "$x" & done wait for x in `cat machines`; do ssh root@$x sh <<'COMMANDS' apt-get update; apt-get --print-uris -y dist-upgrade | \ while read url other; do case "$url" in \'*\') echo "${url//\'/}" ;; esac done | \ tr '\n' '\0' | \ xargs -0 -P10 -n1 -rt wget --cache on --delete-after COMMANDS done ...would do the job crudely, but quite effectively. A little bit of Perl could parse /etc/apt/sources.list and handle the pre-fetching more gracefully than wget. Hmmm. > It is my opinion that this bug should be filed on the web proxy/cache software > you are using, and not on apt, and that this bug should be closed. Of course the other issue is that ultimately I don't control all of the HTTP caching software. Two of the caches I use are supplied by the upstream ISP or the corporate IT department...if I wanted my own caching, I'd have to build yet another cache behind these machines... -- Opinions expressed are my own, I don't speak for my employer, and all that. Encrypted email preferred. Go ahead, you know you want to. ;-) OpenPGP at work: 3528 A66A A62D 7ACE 7258 E561 E665 AA6F 263D 2C3D
Attachment:
pgpJbv6R2jkNy.pgp
Description: PGP signature