Re: Debian's problems, Debian's future
On Wed, 2002-04-10 at 04:35, Michael Bramer wrote:
> > Scheme Disk space Bandwidth
> > -----------------------------------------------------------
> > Checksums (bwidth optimal) 26K 81K
> > diffs (4 days) 32K 331K
> > diffs (9 days) 71K 66K
> > diffs (20 days) 159K 27K
>
> can you explain your counts?
Sure. At the end of this message is a script that you can use with the
program "gp" to recreate my numbers. Here's a quick description:
Anthony Towns said that the average size of a diff between two Packages
files is 12K (after compression with bzip2). So if the server keeps d
days of diffs, this will take about d*12K of disk space. If I go for i
days without updating, then when I do update, if i <= d, then I will
need to download i diff files, using about i*(12K + 1K) bandwidth. (The
1K is for each GET request, since I'm downloading i files). If i > d,
then I need to get the whole Packages.gz file, which I estimate as about
1.6M. So let bwidth(d,i) = the amount of bandwidth used doing an update
after i days, where the server has kept d days of diffs.
So how much bandwidth is used _on average_? Well, it depends on how
often everybody updates. If everybody updates everyday, then everybody
would just need to download 1 diff, using 13K. If everybody updates
every week, then the average bandwidth is 7*13K=91K. In reality, we
don't know how often people update, but my guess is that people tend to
update often. So I just guessed that the probability that someone
updates after i days is prob(i)=((2/3)^i)/2. Why this formula? It
seemed good at the time. So then the average bandwidth used is
average_bwidth(d)=sum i=1,...,infinity of bwidth(d,i) * prob (i)
That's it for the diff stuff.
For the checksum scheme, the disk space required is the number of
checksums times the size of each checksum. The number of checksums is
the size of Packages.gz divided by the block size. Since the checksum
file has to be transferred to every client, the size of the checksum
file contributes to the bandwidth estimate, as well. Additionally, I
estimate that 75 packages change in debian every day (derived by looking
at debian-devel-changes in Feb. and March). Using a little probability,
I computed the average number of blocks in Packages.gz that will change
in i days. I then estimate that each of these blocks will have to be
transferred during an update, and use that to estimate the amount of
bandwidth required for an update. Then I average, as with the diff
scheme. Let me know if you think there's any problems with this..
I've been playing around recently with a more realistic (in my opinion)
user model. I now predict that the probability that a user will update
every i days is n/(i+1)^3 (n is a normalization factor). I like this
model because it predicts that if a user hasn't updated in a long time,
it'll probably be a long time before they update. This seems
intuitively correct to me. So here's some new numbers comparing the
diff scheme and the rsync scheme in this new user model. In my opinion,
diff still wins.
These numbers use prob(i)=(n/(i+1)^3)/i, so these numbers are the
average bandwidth, averaged over what the server sees. For an
explanation of this, see
http://lists.debian.org/debian-devel/2002/debian-devel-200204/msg01076.html.
Diffs:
days dspace ebwidth
-------------------------------
1 12.000K 296.80K
2 24.000K 110.70K
3 36.000K 58.800K
4 48.000K 39.100K
5 60.000K 30.000K
6 72.000K 25.300K
7 84.000K 22.600K
8 96.000K 21.000K
9 108.00K 19.900K
10 120.00K 19.200K
11 132.00K 18.700K
12 144.00K 18.400K
13 156.00K 18.100K
14 168.00K 17.900K
15 180.00K 17.800K
Checksum files:
bsize dspace ebwidth
-------------------------------
20 312.50K 315.40K
40 156.30K 161.10K
60 104.20K 111.00K
80 78.100K 86.800K
100 62.500K 73.100K
120 52.100K 64.600K
140 44.600K 59.100K
160 39.100K 55.500K
180 34.700K 53.000K
200 31.300K 51.500K
220 28.400K 50.500K
240 26.000K 50.100K
260 24.000K 50.000K
280 22.300K 50.100K
300 20.800K 50.500K
320 19.500K 51.100K
340 18.400K 51.800K
360 17.400K 52.700K
380 16.400K 53.700K
400 15.600K 54.700K
Best,
Rob
/***********************
* Info about debian
***********************/
/* The number of packages in debian */
npkgs=8000.0
/* How big is Packages[.gz] as a function of the number of packages */
compressed_bytes_per_pkg=200.0
uncompressed_bytes_per_pkg=800.0
Packagesgz_size=npkgs*compressed_bytes_per_pkg
Packages_size=npkgs*uncompressed_bytes_per_pkg
/* The average number of packages that change each day */
changes_per_day=75.0
/* The probability that a given package will change on a given day. */
p=changes_per_day/npkgs
/*****************************
* The probability that a given GET request has a If-Modified-Since
* field that is i days old.
*****************************/
/* My lame approximation */
/* tprob(i)=1.0/(i+1)^3 */
tprob(i)=(2.0/3)^i
/* A guess derived from the survey on debianplanet.org */
/*
tprob(x)=1/x^1.25
*/
/* Normalize the probability defined above */
tpn=sum(i=1,100000,tprob(i))
prob(i)=tprob(i)/tpn
/**************************
* Info about http/ftp queries
**************************/
/* The overhead of requesting a file download.
This is definitely an overestimate. */
query_size=1000.0
/***************************
* utility functions
***************************/
/* Convert bytes to KB, rounded */
toK(x)=round(10.0*x/1024)/10.0
/***************************
* Diff method
***************************/
/* From Anthony Towns: the average size of a bzip2 compressed
diff -ed between two successive Packages files is 12K. */
diff_bytes_per_change=12.0*1024/changes_per_day
/* Expected size of a compressed diff between two
Packages files seperated by s days. */
diff_size(s) = (1-(1-p)^s)*npkgs*diff_bytes_per_change
/* Amount of diskspace required to keep d days of diffs,
each diff spanning s days. */
diff_diskspace(s, d) = sum(x=1,s,diff_size(x))+(d-s)*diff_size(s)
/* Amount of bandwidth used by a user performing an
'apt-get update' after i days */
diff_bwidth(s, d, i) = if (i<=s, query_size + diff_size(i), \
if (i>d, query_size +
Packagesgz_size, \
query_size + diff_size(s) +
diff_bwidth(s,d,i-s)))
/* Average amount of bandwidth used by a user */
diff_ebwidth(s, d) =
sum(x=1,d,prob(x)*diff_bwidth(s,d,x))+sum(x=d+1,180,prob(x))*diff_bwidth(s,d,d+1)
/* Display a table of the average bandwidth used for varying numbers of
days */
diff_table(s,m) = print ("days\tdspace\t\tebwidth"); \
print ("-------------------------------"); \
for (d=1, m, printp
(d,"\t",toK(diff_diskspace(s,d)),"K\t\t", \
toK(diff_ebwidth(s,d)),"K"))
/**************************
* xdelta method
**************************/
/* Basically the same as the diff method, but with a slightly larger
bytes_per_change */
/**************************
* checksums method
**************************/
/* Number of blocks. */
csum_nblocks(bs)=Packagesgz_size/bs
/* The size of the checksum file. */
csum_diskspace(cs,bs) = csum_nblocks(bs)*cs
/* The average number of package descriptions that will live in one
block */
csum_pkgs_per_block(bs)=bs/compressed_bytes_per_pkg
/* The probability that a given block changes after s days. */
csum_block_prob_changed(bs,s)=1-(1-p)^(s*csum_pkgs_per_block(bs))
/* The expected number of changed blocks after s days */
csum_changed_blocks(bs,s)=csum_block_prob_changed(bs,s)*csum_nblocks(bs)
/* The expected amount of bandwidth used by a user doing an update after
i days. */
csum_bwidth(cs,bs,i)=query_size + csum_diskspace(cs,bs) +
csum_changed_blocks(bs,i) * bs
/* The average amount of bandwidth used by a user. */
csum_ebwidth(cs,bs)=sum(x=1,180,prob(x)*csum_bwidth(cs,bs,x))
csum_table(cs,minbs,maxbs,step)=print ("bsize\tdspace\t\tebwidth"); \
print
("-------------------------------"); \
forstep(bs=minbs, maxbs, step, \
printp (bs,"\t",
toK(csum_diskspace(cs,bs)), "K\t\t", \
toK(csum_ebwidth(cs,bs)),"K"))
/**************************
* rsync method
**************************/
--
To UNSUBSCRIBE, email to debian-devel-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Reply to: