Bug#111879: apt-get: wishlist: random download order for better HTTP cache hit rate

To: Adam Heath <adam@lapdoog.doogie.brainfood.com>
Cc: 111879@bugs.debian.org
Subject: Bug#111879: apt-get: wishlist: random download order for better HTTP cache hit rate
From: Zygo Blaxell <zblaxell@genki.hungrycats.org>
Date: Mon, 10 Sep 2001 16:22:14 -0400
Message-id: <[🔎] 20010910162214.A19441@genki.hungrycats.org>
Reply-to: Zygo Blaxell <zblaxell@genki.hungrycats.org>, 111879@bugs.debian.org
In-reply-to: <[🔎] Pine.LNX.4.33.0109101238280.26387-100000@localhost>
References: <[🔎] E15gSFw-0001xT-00@genki.hungrycats.org> <[🔎] Pine.LNX.4.33.0109101238280.26387-100000@localhost>

On Mon, Sep 10, 2001 at 12:39:56PM -0500, Adam Heath wrote:
> On Mon, 10 Sep 2001, Zygo Blaxell wrote:
> > This has a nasty side-effect: there may be many machines attempting to
> > download all of the same packages in the same order through the same
> > proxy if there is a large number of packages upgraded that day.  The HTTP
> > caches can't have a cache hit on a package until that package is fully
> > downloaded, so ultimately many machines will end up downloading the same
> > package at the same time through the same HTTP proxy and Internet feed.
> 
> It's the fault of the web cache/proxy that it does not start serving out the
> partial content, and then multiplexing the content still coming in from the
> first request, out to all the subsequent requests.

Actually, I have played around with this kind of solution, using
'apt-get --print-uris' to generate data to simulate a 'dist-upgrade',
and HTTP proxy cache logs for data to simulate an 'update'.  In a
nutshell, the solution you propose can be worse than using no cache
at all.  The solution I propose improves performance in some situations,
regardless of the kind of cache used.

I have to deal with some eastern European site offices which have lots
of available bandwidth, but average 30% (best case 10%, worst case 80%,
one or two hours per month of no connectivity at all) packet loss
between the ISP and anything interesting, like a Debian mirror or another
corporate office.  Any one TCP connection is able to use less than 2% of
the bandwidth between sites--the TCP congestion window never opens up,
because every third packet disappears in transit.  If I open 10 or 20
TCP connections, each fetching a different package, each one behaves the
same as if I had opened only a single connection--there is no bandwidth
starvation nor significant additional latency, because the TCP congestion
window never opens more than one or two segments.

Some results of my simulations:

With no cache at all, 30% packet loss at the ISP, 3% of local bandwidth
consumed per TCP connection, and unmodified apt-get, the average
run time of a parallel 'apt-get dist-upgrade' or 'apt-get upgrade'
is identical given a machine pool sized between 1 and 10 machines.
Some machines finish faster than others--there is considerable variance
between machines.  No machine ever uses enough bandwidth to affect packet
loss, latency, or available bandwidth for other connections, so the
TCP connections don't interact with each other at all.  Once you get
20 or so machines running in parallel, things start to get non-linear.

With a cache that behaves as I described (no cache hits until a complete
object is downloaded), under the same conditions, the total run time of
'apt-get dist-upgrade' run in parallel on 10 machines is about 30% less,
because some of the machines do get cache hits if they are sufficiently
delayed during one HTTP object fetch that the following object can
be fetched from the cache.  In the event of a cache miss, behavior is
identical to the uncached scenario.  I think the 30% packet loss and
30% speed improvement in this simulation is just a coincidence--I can't
think of any mechanism by which they would be related.

With a cache that behaves as you described (partial content hits and
multiplexing live upstream HTTP server connections amongst downstream
HTTP clients), under the same conditions, the total run time is 
constant for all machines, and equal to the worst-case run time of
the uncached case.  Actually it's very close to having one machine
perform the entire apt-get dist-upgrade through a cache, followed by
all other machines using the cache sequentially.

In practice, running all of the apt-get dist-upgrade's sequentially
through a cache under these conditions is too slow to be worth mentioning.
The Packages.gz files for unstable change while the dist-upgrades are
running, invalidating any cached copies of these files and losing packages
(update has to be run again), and generally wasting an entire day.

If all of the apt-get's in a cluster can be configured somehow to fetch
their packages in parallel through a caching web proxy that behaves
either as you describe (joining all requests to the same object into one)
or as I describe (treating all requests independently until a complete
object can be cached), the run time on all machines is equal, and it is
the total download time of the longest file in each of the groups of N
packages fetched in the update/upgrade.  N would be about 10, given the
network (non-)performance characteristics I've been discussing so far.

If each apt-get fetches its objects in random order, as I proposed,
you get as close to this result as possible without introducing an
external synchronization mechanism.  If I understand correctly, apt-get
does not care about the order of downloaded packages, since it won't
actually install any until the entire download of all packages is
complete.

Hmmm...it occurs to me that in the process of gathering experimental data,
I've already got tools that could be easily adapted to pre-fill the
HTTP cache for a dist-upgrade.  Something like...

for x in \
  http://http.us.debian.org/debian/dists/{stable,testing,unstable}/{main,contrib,non-free}/binary-{i386,alpha}/Packages.gz \
  http://non-us.debian.org/debian-non-US/dists/{stable,testing,unstable}/{main,contrib,non-free}/binary-{i386,alpha}/Packages.gz \
; do
  wget --cache on --delete-after "$x" &
done

wait

for x in `cat machines`; do
  ssh root@$x sh <<'COMMANDS'
    apt-get update;
    apt-get --print-uris -y dist-upgrade | \
      while read url other; do 
        case "$url" in
          \'*\') 
	    echo "${url//\'/}"
          ;;
        esac
      done | \
      tr '\n' '\0' | \
      xargs -0 -P10 -n1 -rt wget --cache on --delete-after 
COMMANDS
done

...would do the job crudely, but quite effectively.  A little bit of
Perl could parse /etc/apt/sources.list and handle the pre-fetching more
gracefully than wget.  Hmmm.

> It is my opinion that this bug should be filed on the web proxy/cache software
> you are using, and not on apt, and that this bug should be closed.

Of course the other issue is that ultimately I don't control all of
the HTTP caching software.  Two of the caches I use are supplied by the
upstream ISP or the corporate IT department...if I wanted my own caching,
I'd have to build yet another cache behind these machines...

-- 
Opinions expressed are my own, I don't speak for my employer, and all that.
Encrypted email preferred.  Go ahead, you know you want to.  ;-)
OpenPGP at work: 3528 A66A A62D 7ACE 7258 E561 E665 AA6F 263D 2C3D

Attachment: pgpJbv6R2jkNy.pgp
Description: PGP signature

Reply to:

Follow-Ups:
- Re: Bug#111879: apt-get: wishlist: random download order for better HTTP cache hit rate
  - From: Jason Gunthorpe <jgg@debian.org>

References:
- Bug#111879: apt-get: wishlist: random download order for better HTTP cache hit rate
  - From: Zygo Blaxell <zblaxell@genki.hungrycats.org>
- Bug#111879: apt-get: wishlist: random download order for better HTTP cache hit rate
  - From: Adam Heath <adam@lapdoog.doogie.brainfood.com>

Prev by Date: Bug#111879: apt-get: wishlist: random download order for better HTTP cache hit rate
Next by Date: Bug#111914: apt: Apparently harmless apt.conf causes http method to break.
Previous by thread: Bug#111879: apt-get: wishlist: random download order for better HTTP cache hit rate
Next by thread: Re: Bug#111879: apt-get: wishlist: random download order for better HTTP cache hit rate
Index(es):
- Date
- Thread