On Wed, Apr 10, 2002 at 01:26:17AM -0700, Robert Tiberius Johnson wrote:
> - I tend to update every day. For people who update every day, the
> diff-based scheme only needs to transfer about 8K, but the
> checksum-based scheme needs to transfer 45K. So for me, diffs are
> better. :)
I think you'll find you're also unfairly weighting this against people
who do daily updates. If you do an update once a month, it's not as much
of a bother waiting a while to download the Packages files -- you're
going to have to wait _much_ longer to download the packages themselves.
I'd suggest your formula would be better off being:
bandwidthcost = sum( x = 1..30, prob(x) * cost(x) / x )
(If you update every day for a month, your cost isn't just one download,
it's 30 downloads. If you update once a week for a month, your cost
isn't that of a single download, it's four times that. The /x takes that
into account)
Bandwidth cost, then is something like "the average amount downloaded
by a testing/unstable user per day to update main".
My results, are something like:
0 days of diffs: 843.7 KiB (the current situation)
1 day of diffs: 335.7 KiB
2 days of diffs: 167.7 KiB
3 days of diffs: 93.7 KiB
4 days of diffs: 56.9 KiB
5 days of diffs: 37.5 KiB
6 days of diffs: 26.8 KiB
7 days of diffs: 20.7 KiB
8 days of diffs: 17.2 KiB
9 days of diffs: 15.1 KiB
10 days of diffs: 13.9 KiB
11 days of diffs: 13.2 KiB
12 days of diffs: 12.7 KiB
13 days of diffs: 12.4 KiB
14 days of diffs: 12.3 KiB
15 days of diffs: 12.2 KiB
...which pretty much matches what I'd expect: at the moment, just to
update main, people download around 1.2MB per day; if we let them just
download the diff against yesterday, the average would plunge to only
a couple of hundred k, and you rapidly reach the point of diminishing
returns.
I used figures of 1.5MiB for the standard gzipped Packages file you
download if you can't use diffs, and 12KiB for the size of each daily
diff -- if you're three days out of date, you download three diffs and
apply them in order to get up to date. 12KiB is the average size of
daily bzip2'ed --ed diffs over last month for sid/main/i386.
The script I used for the above was (roughly):
#!/usr/bin/python
def cost_diff(day, ndiffs):
if day <= ndiffs:
return 12 * 1024 * day
else:
return 1.5 * 1024 * 1024
def prob(d):
return (2.0 / 3.0) ** d / 2.0
def summate(f,p):
cost = 0.0
for d in range(1,31):
cost += f(d) * p(d) / d
return cost
for x in range(0,16):
print "%s day/s of diffs: %.1f KiB" % \
(x, summate(lambda y: cost_diff(y,x), prob) / 1024)
I'd be interested in seeing what the rsync stats look like with the
"/ days" factor added in.
Cheers,
aj
--
Anthony Towns <aj@humbug.org.au> <http://azure.humbug.org.au/~aj/>
I don't speak for anyone save myself. GPG signed mail preferred.
``BAM! Science triumphs again!''
-- http://www.angryflower.com/vegeta.gif
Attachment:
pgpCHQEYrdQ5o.pgp
Description: PGP signature