[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debian's problems, Debian's future



On Wed, 2002-04-10 at 04:35, Michael Bramer wrote:
> > Scheme                         Disk space         Bandwidth
> > -----------------------------------------------------------
> > Checksums (bwidth optimal)            26K               81K
> > diffs (4 days)                        32K              331K
> > diffs (9 days)                        71K               66K
> > diffs (20 days)                      159K               27K
> 
> can you explain your counts?

Sure.  At the end of this message is a script that you can use with the
program "gp" to recreate my numbers.  Here's a quick description:

Anthony Towns said that the average size of a diff between two Packages
files is 12K (after compression with bzip2).  So if the server keeps d
days of diffs, this will take about d*12K of disk space.  If I go for i
days without updating, then when I do update, if i <= d, then I will
need to download i diff files, using about i*(12K + 1K) bandwidth.  (The
1K is for each GET request, since I'm downloading i files).  If i > d,
then I need to get the whole Packages.gz file, which I estimate as about
1.6M.  So let bwidth(d,i) = the amount of bandwidth used doing an update
after i days, where the server has kept d days of diffs.

So how much bandwidth is used _on average_?  Well, it depends on how
often everybody updates.  If everybody updates everyday, then everybody
would just need to download 1 diff, using 13K.  If everybody updates
every week, then the average bandwidth is 7*13K=91K.  In reality, we
don't know how often people update, but my guess is that people tend to
update often.  So I just guessed that the probability that someone
updates after i days is prob(i)=((2/3)^i)/2.  Why this formula?  It
seemed good at the time.  So then the average bandwidth used is

average_bwidth(d)=sum i=1,...,infinity of bwidth(d,i) * prob (i)

That's it for the diff stuff.

For the checksum scheme, the disk space required is the number of
checksums times the size of each checksum.  The number of checksums is
the size of Packages.gz divided by the block size.  Since the checksum
file has to be transferred to every client, the size of the checksum
file contributes to the bandwidth estimate, as well.  Additionally, I
estimate that 75 packages change in debian every day (derived by looking
at debian-devel-changes in Feb. and March).  Using a little probability,
I computed the average number of blocks in Packages.gz that will change
in i days.  I then estimate that each of these blocks will have to be
transferred during an update, and use that to estimate the amount of
bandwidth required for an update.  Then I average, as with the diff
scheme.  Let me know if you think there's any problems with this..

I've been playing around recently with a more realistic (in my opinion)
user model.  I now predict that the probability that a user will update
every i days is n/(i+1)^3 (n is a normalization factor).  I like this
model because it predicts that if a user hasn't updated in a long time,
it'll probably be a long time before they update.  This seems
intuitively correct to me.  So here's some new numbers comparing the
diff scheme and the rsync scheme in this new user model.  In my opinion,
diff still wins.

These numbers use prob(i)=(n/(i+1)^3)/i, so these numbers are the
average bandwidth, averaged over what the server sees.  For an
explanation of this, see
http://lists.debian.org/debian-devel/2002/debian-devel-200204/msg01076.html.

Diffs:
days	dspace		ebwidth
-------------------------------
1	12.000K		296.80K
2	24.000K		110.70K
3	36.000K		58.800K
4	48.000K		39.100K
5	60.000K		30.000K
6	72.000K		25.300K
7	84.000K		22.600K
8	96.000K		21.000K
9	108.00K		19.900K
10	120.00K		19.200K
11	132.00K		18.700K
12	144.00K		18.400K
13	156.00K		18.100K
14	168.00K		17.900K
15	180.00K		17.800K

Checksum files:
bsize	dspace		ebwidth
-------------------------------
20	312.50K		315.40K
40	156.30K		161.10K
60	104.20K		111.00K
80	78.100K		86.800K
100	62.500K		73.100K
120	52.100K		64.600K
140	44.600K		59.100K
160	39.100K		55.500K
180	34.700K		53.000K
200	31.300K		51.500K
220	28.400K		50.500K
240	26.000K		50.100K
260	24.000K		50.000K
280	22.300K		50.100K
300	20.800K		50.500K
320	19.500K		51.100K
340	18.400K		51.800K
360	17.400K		52.700K
380	16.400K		53.700K
400	15.600K		54.700K


Best,
Rob

/***********************
 * Info about debian
 ***********************/

/* The number of packages in debian */
npkgs=8000.0

/* How big is Packages[.gz] as a function of the number of packages */
compressed_bytes_per_pkg=200.0
uncompressed_bytes_per_pkg=800.0
Packagesgz_size=npkgs*compressed_bytes_per_pkg
Packages_size=npkgs*uncompressed_bytes_per_pkg

/* The average number of packages that change each day */
changes_per_day=75.0

/* The probability that a given package will change on a given day. */
p=changes_per_day/npkgs

/*****************************
 * The probability that a given GET request has a If-Modified-Since
 * field that is i days old. 
 *****************************/

/* My lame approximation */
/* tprob(i)=1.0/(i+1)^3 */

tprob(i)=(2.0/3)^i

/* A guess derived from the survey on debianplanet.org */
/*
tprob(x)=1/x^1.25
*/

/* Normalize the probability defined above */
tpn=sum(i=1,100000,tprob(i))
prob(i)=tprob(i)/tpn

/**************************
 * Info about http/ftp queries
 **************************/

/* The overhead of requesting a file download.
   This is definitely an overestimate. */
query_size=1000.0

/***************************
 * utility functions
 ***************************/

/* Convert bytes to KB, rounded */
toK(x)=round(10.0*x/1024)/10.0

/***************************
 * Diff method 
 ***************************/

/* From Anthony Towns: the average size of a bzip2 compressed 
   diff -ed between two successive Packages files is 12K. */
diff_bytes_per_change=12.0*1024/changes_per_day

/* Expected size of a compressed diff between two
   Packages files seperated by s days. */
diff_size(s) = (1-(1-p)^s)*npkgs*diff_bytes_per_change

/* Amount of diskspace required to keep d days of diffs,
   each diff spanning s days. */
diff_diskspace(s, d) = sum(x=1,s,diff_size(x))+(d-s)*diff_size(s)

/* Amount of bandwidth used by a user performing an 
   'apt-get update' after i days */
diff_bwidth(s, d, i) = if (i<=s, query_size + diff_size(i),						\
                                 if (i>d, query_size +
Packagesgz_size,					\
                                          query_size + diff_size(s) +
diff_bwidth(s,d,i-s)))

/* Average amount of bandwidth used by a user */
diff_ebwidth(s, d) =
sum(x=1,d,prob(x)*diff_bwidth(s,d,x))+sum(x=d+1,180,prob(x))*diff_bwidth(s,d,d+1)

/* Display a table of the average bandwidth used for varying numbers of
days */
diff_table(s,m) = print ("days\tdspace\t\tebwidth");						\
                  print ("-------------------------------");					\
                  for (d=1, m, printp
(d,"\t",toK(diff_diskspace(s,d)),"K\t\t",			\
                                             
toK(diff_ebwidth(s,d)),"K"))

/**************************
 * xdelta method
 **************************/

/* Basically the same as the diff method, but with a slightly larger
bytes_per_change */

/**************************
 * checksums method
 **************************/

/* Number of blocks. */
csum_nblocks(bs)=Packagesgz_size/bs

/* The size of the checksum file. */
csum_diskspace(cs,bs) = csum_nblocks(bs)*cs

/* The average number of package descriptions that will live in one
block */
csum_pkgs_per_block(bs)=bs/compressed_bytes_per_pkg

/* The probability that a given block changes after s days. */
csum_block_prob_changed(bs,s)=1-(1-p)^(s*csum_pkgs_per_block(bs))

/* The expected number of changed blocks after s days */
csum_changed_blocks(bs,s)=csum_block_prob_changed(bs,s)*csum_nblocks(bs)

/* The expected amount of bandwidth used by a user doing an update after
   i days. */
csum_bwidth(cs,bs,i)=query_size + csum_diskspace(cs,bs) +
csum_changed_blocks(bs,i) * bs

/* The average amount of bandwidth used by a user. */
csum_ebwidth(cs,bs)=sum(x=1,180,prob(x)*csum_bwidth(cs,bs,x))

csum_table(cs,minbs,maxbs,step)=print ("bsize\tdspace\t\tebwidth");				\
                                print
("-------------------------------");			\
                                forstep(bs=minbs, maxbs, step,					\
                                        printp (bs,"\t",
toK(csum_diskspace(cs,bs)), "K\t\t",	\
                                                        
toK(csum_ebwidth(cs,bs)),"K"))

/**************************
 * rsync method
 **************************/




-- 
To UNSUBSCRIBE, email to debian-devel-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org



Reply to: