[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debian's problems, Debian's future



On Wed, Apr 10, 2002 at 01:26:17AM -0700, Robert Tiberius Johnson wrote:
> - I tend to update every day.  For people who update every day, the
> diff-based scheme only needs to transfer about 8K, but the
> checksum-based scheme needs to transfer 45K.  So for me, diffs are
> better. :)

I think you'll find you're also unfairly weighting this against people
who do daily updates. If you do an update once a month, it's not as much
of a bother waiting a while to download the Packages files -- you're
going to have to wait _much_ longer to download the packages themselves.

I'd suggest your formula would be better off being:

	bandwidthcost = sum( x = 1..30, prob(x) * cost(x) / x )

(If you update every day for a month, your cost isn't just one download,
it's 30 downloads. If you update once a week for a month, your cost
isn't that of a single download, it's four times that. The /x takes that
into account)

Bandwidth cost, then is something like "the average amount downloaded
by a testing/unstable user per day to update main".

My results, are something like:
    
     0 days of diffs: 843.7 KiB (the current situation)
     1  day of diffs: 335.7 KiB
     2 days of diffs: 167.7 KiB
     3 days of diffs:  93.7 KiB
     4 days of diffs:  56.9 KiB
     5 days of diffs:  37.5 KiB
     6 days of diffs:  26.8 KiB
     7 days of diffs:  20.7 KiB
     8 days of diffs:  17.2 KiB
     9 days of diffs:  15.1 KiB
    10 days of diffs:  13.9 KiB
    11 days of diffs:  13.2 KiB
    12 days of diffs:  12.7 KiB
    13 days of diffs:  12.4 KiB
    14 days of diffs:  12.3 KiB
    15 days of diffs:  12.2 KiB

...which pretty much matches what I'd expect: at the moment, just to
update main, people download around 1.2MB per day; if we let them just
download the diff against yesterday, the average would plunge to only
a couple of hundred k, and you rapidly reach the point of diminishing
returns.

I used figures of 1.5MiB for the standard gzipped Packages file you
download if you can't use diffs, and 12KiB for the size of each daily
diff -- if you're three days out of date, you download three diffs and
apply them in order to get up to date. 12KiB is the average size of
daily bzip2'ed --ed diffs over last month for sid/main/i386.

The script I used for the above was (roughly):

#!/usr/bin/python

def cost_diff(day, ndiffs):
        if day <= ndiffs:
                return 12 * 1024 * day
        else:
                return 1.5 * 1024 * 1024

def prob(d):
        return (2.0 / 3.0) ** d / 2.0

def summate(f,p):
        cost = 0.0
        for d in range(1,31):
                cost += f(d) * p(d) / d
        return cost

for x in range(0,16):
        print "%s day/s of diffs: %.1f KiB" % \
		(x, summate(lambda y: cost_diff(y,x), prob) / 1024)


I'd be interested in seeing what the rsync stats look like with the
"/ days" factor added in.

Cheers,
aj

-- 
Anthony Towns <aj@humbug.org.au> <http://azure.humbug.org.au/~aj/>
I don't speak for anyone save myself. GPG signed mail preferred.

     ``BAM! Science triumphs again!'' 
                    -- http://www.angryflower.com/vegeta.gif

Attachment: pgpCHQEYrdQ5o.pgp
Description: PGP signature


Reply to: