[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Rsyncable GZIP (was Re: Package metadata server)



On Sun, 2002-04-07 at 11:16, Otto Wyss wrote:
> What amazes me is that nobody is able or willing to provide any figures.
> So I guess no provider of an rsync server is interested in this subject
> and therefore it can't be a big problem. 

Here are some experiments, and a mathematical analysis of different
approaches.  The only missing piece of data that I had to fudge is: how
often do people run apt-get update?  If anyone can give me server logs
containing If-Modified-Since fields, that would be great.

Quick Summary:
------------------
Diffs compressed with bzip2 generated the smallest difference files, and
hence the smallest downloads.  Using the scheme described below, I
estimate that mirrors retaining 20 days worth of diffs will need about
159K of disk space per Packages file and the average 'apt-get update'
will transfer about 24.2K per changed Packages file.

xdelta would have slightly higher bandwidth and disk space requirements,
but would be applicable to binary files (such as debs).  rsync has no
disk space requirements, but uses 10 times as much bandwidth and
requires more memory, cpu power, etc, on the server.  rsync also has the
advantage of already being implemented, but managing a series of diffs
seems like a trivial shell script.

So in my opinion, diff/bzip2 or xdelta looks like the way.

Example: Diffs between unstable main Packages file from Apr 6 and 7:
-----------------------
diff+bzip2: 12987 bytes
diff+gzip:  13890 bytes
xdelta:     15176 bytes
rsync:     163989 bytes (*)
(*) rsyncing uncompressed Packages files with 
    rsync -a --no-whole-file --stats -z -e ssh

The Scheme (proposed earlier by others)
----------
Assuming debian admins tend to update relatively frequently, the
following diff-based scheme seems to offer the best compromise on mirror
disk space and download bandwidth:

For the 20 most recent Packages files, the server stores the diff
between each pair of consecutive Packages files.  apt-get then simply
does:

do
{
  m = md5sum of my Packages file
  d = fetch Packages_diff_${md5}.bz2
  if (d does not exist)
  {
    fetch Packages.gz
    break;
  }
  patch my Packages file with d
} while (d is not an empty file)

This scheme easily allows mirrors to tweak the parameters to best suite
their own disk space and bandwidth limitations, and they are not
required to have any cgi-scripts or extra services running.  For
example, a mirror that's tight on disk space can just delete some of the
older diffs, but it will incur a slight bandwidth penalty as a result.

The only disadvantage to using diffs (compared to rsync or some other
dynamic scheme) is the additional disk space requirement.  The disk
space requirement is very small, and disk space is cheaper than cpu
time, memory, and bandwidth.

Analysis:
---------

The anlysis uses gp, a great math tool that's available in debian.

I. Diff vs. xdelta
------------------
By looking at debian-devel-changes, I figured that between Feb. 1 and
April 1, an average of 75 packages were uploaded each day.  There are
around 8000 packages listed in testing main, so the probability that any
given package changes on any given day is p=75/8000.  Thus the expected
number of packages that change in s days is (1-(1-p)^s)*8000.  For
example, the expected number of changed packages in 60 days is 3453. 
Comparing the Packages files from Feb. 7 and April 7 shows 3884 changed
packages, so the model seems reasonably accurate.

My experiments with diff, xdelta, bzip2, etc. concluded that if you diff
two Packages files with 75 changed packages between them, and then
compress the diff with bzip2 -9, the resulting file is about 7936 bytes,
or roughly 106 bytes per changed package.  Thus the average size of a
compressed diff between Packages files seperated by s days is

diffsize(s)=(1-(1-p)^s)*8000*106

The xdelta of the same files is about 25% larger.  If this scheme is
extended to all deb files, not just Packages files, it may just be more
convenient to use xdelta, though.

II. Successive diffs vs. all-at-once diffs
------------------------------------------
This analysis applies to either diff or xdelta; it doesn't matter.

A. Disk space
The next question is, should we diff consecutive Packages files, or
should we compute diffs between the last 20 Packages files and today's
Packages file?  The latter will allow apt-get to fetch just one diff and
be done with it.  However, it uses more disk space on the mirrors.  The
former may require apt-get to fetch several patches in order to update
its Packages file, but will use less disk space on the mirrors.

There is actually a spectrum of choices here.  A mirror may store diffs
between every 3, 4, 5, etc. Packages files.  So, if a client has the
Packages file from 14 days ago, it will first be given a patch bringing
its Packages file to 9 days ago, then 4, and then to the current
Packages file.  If a server stores diffs between Packages files
seperated by s days, and stores d days back, then it will need 

dspace(s,d)=sum(x=1,s,diffsize(x))+(d-s)*diffsize(s)

bytes of disk space.  So, for example, to store 20 days of consecutive
patches, the server needs 159K.  For 20 days of all-at-once diffs, it
needs 1.6M.  For 5-days-at-a-time diffs, it needs 702K.

B. Bandwidth
Consecutive patches save disk space, but clients have to download more
patches, requiring more bandwidth.  If a user performs an update after
having gone i days without an update, then the amount of bandwidth she
uses is

bwidth(s,d,i)=if(i<=s,1+diffsize(i),if(i>d,npkgs*200,1+diffsize(s)+bwidth(s,d,i-s)))

(I charge the user 1K overhead for each seperate file she has to
download.)

So in order to figure out how much bandwidth each scheme requires on
average, we need a model of how frequently users perform apt-get
update.  I chose a simple geometric model, where users update every i
days with probability (2/3)^i / 2.  I would be very grateful if some
mirror maintainers could get me some more accurate statistics on this. 
Anyway, using this crude model, the expected bandwidth required by any
user is 

ebwidth(s,d)=sum(x=1,d,prob(x)*bwidth(s,d,x))+(1-sum(x=1,d,prob(x)))*bwidth(s,d,d+1)

So for the consecutive-patches scheme, the average bandwidth used by
each user is 24.2K.  For the all-at-once patch scheme, the average
bandwidth is 23.8K.  This is a miniscule bandwidth reduction at the cost
of explosive growth in disk space, so the consecutive patches scheme
seems best.

This analysis really depends on the model for how frequently users
perform apt-get update.  If I got that wrong, then the numbers are way
off.  So please, if any mirror maintainers can send me logs of the
If-Modified-Since fields in Packages GET requests, that would be very
helpful.  Thanks.

III. Rsync
----------
I used rsync to synchronize the unstable main Packages files from April
6 and April 7.  There were a large number of changed entries between
these two: about 190 changes.  The bzip2 compressed diff was under 13K. 
Using 'rsync -a --no-whole-file --stats -z -e ssh' to synchronize the
uncompressed Packages files, the total amount of traffic was 164K. 
Clearly, rsync is much less efficient for this application.

IV. Other factors
------------------
Since rsync negotiates the diff on the fly, the amount of bandwidth it
uses will increase both as the number of packages in debian grows and as
the number of changing packages in debian grows.  The size of diff
depends only on the number of changed packages, not on the size of the
Packages file, so its advantage should increase over time.

The consecutive patches scheme requires producing only one new patch for
every Packages file, where-as the all-at-once scheme requires producing
30 new patches, so mirroring will be easier in the consecutive patch
scheme.  Mirrors can then merge these patches if they wish to tweak the
setup for their bandwidth and disk space constraints.

The diff scheme requires no new services to run on the mirrors.  No
cgi-scripts, no rsync, nothing.  And it works over any transport which
already works with apt-get.

Some have suggested splitting up the Packages file.  I haven't done any
experiments on this approach, but my guess is that fetching lots of
little files will incur way too much overhead.  Afterall, 24K is hard to
beat.

If debian started producing xdeltas between successive versions of deb
files, then the bandwidth demand on mirrors could probably be reduced by
a factor of 50.  Not as many diffs are required for debs, since they
change more slowly than the Packages file, so I estimate that storing
diffs for all the packages in debian would increase the disk space
required on a mirror by less than 5% (probably much less).  A 50 fold
cut in bandwidth for a 5% increase in disk usage is probably a win.

Finally, the size of the diffs are really small.  For example, including
mozilla in debian is a much greater burden on mirror sites than adopting
this patch policy would be, but nobody worries about mirror disk space
when adding new software to debian.

Thanks for the greatest operating system on earth.  I hope my
experiments and analysis are helpful.  Please let me know if you have
any questions.

Best,
Rob

P.S. I know this has to wait till after the release.  I just thought it
was a cool thread.



-- 
To UNSUBSCRIBE, email to debian-devel-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org



Reply to: