[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

New method for Packages/Sources file updates



Hi,

I know the idea of better ways to update the Packages/Sources file
comes up frequently but below I present a simple and efficient way of
achieving reduced traffic (<1% of now), reduced (server) cpu load (due
to less downloads) with minimum extra space while keeping compatible
to the existing infrastructure.


What is the problem anyway?
---------------------------

The Packages and Sources file for debian are huge and constanly
growing. The main Packages file is now 13 MB big (3.3MB gz, 2.5MB bz2)
and must be downloaded completly up to once a day (depening how often
you update).

On the other hand only a fraction of the contents of the files changes
every day.



How to download only changes?
-----------------------------

In the past people have suggested using rsync or diff files [Eduard
Bloch is providing some] to download only the changes. But rsync is
quite cpu intensive and quite ineffective on gzip/bzip2 files and diff
files need a lot of space to provide patches over a longer timeframe
(one per day) and a lot of cpu time to generate.

My Idea is quite simple:

1. sort the Packages(+) file by date, newest entries first
   
2. provide a list of packages that are removed completly from the
   Packages file, newest removal first


Updating the Packages file then consists of 3 steps:

A) download all new entries for the Packages file

   Since the Packages file are sorted by date the client just downloads
   the file until it hits an entry it already has in the existing
   file. At that point the client can stop downloading and prepend it
   to the old file. Note that the client can download the gziped file
   and gunzip it on the fly.

   This already gives a fully functional Packages file but now there
   are some obsolete entries (which doesn't realy harm anyone).

B) remove entries that have been superseeded by newer versions

   The package names in a Packages file are uniq [excluding
   snapshots.debian.net here]. The client can remove all but the
   highest version of each package to remove most obsolete entries.

   This gives nearly the right Packages file but there are still
   entries left that have been removed completly.

C) download the list of removed packages (again only until you hit one
   you already have in the old removals file)

   For any new entry in the removals file remove the respective entry
   from the Packages file. (If a package is readded later it will be
   added again in step A and not removed here since it isn't new. The
   package has to be removed from the removals file on the server but
   not neccessarily on the client.)

   Now we should have a Packages file identical with the one on the
   server. If not something went horribly wrong.


As you can see the steps for updating a Packages file are quite
simple. All it needs is a small change in the generation of the
Packages file and a tiny extra file for removals.

I also created a little program that will take yesterdays
Packages.by.date and Packages.removals files and todays Packages file
to create todays Packages.by.date and Packages.removals. Takes about
1s to run on my amd64.


I created sorted Packages files and removal files for all of October
and measured the changes. Here are the stats (*):

Date          Packages  removals      Packages  removals
                 daily update         last update on Oct 1st
2004.10.01      0       0               0       0
2004.10.02      36674   0               36708   0
2004.10.03      47192   25              83900   25
2004.10.04      19720   0               103620  25
2004.10.05      18573   0               122193  25
2004.10.06      17208   0               139401  25
2004.10.07      27870   0               167271  25
2004.10.08      26384   0               193655  25
2004.10.09      47707   0               241362  25
2004.10.10      25272   0               266634  25
2004.10.11      17835   221             284469  246
2004.10.12      18931   0               303400  246
2004.10.13      30634   0               334034  246
2004.10.14      21711   0               355745  246
2004.10.15      40805   52              396550  298
2004.10.16      0       0               396550  298
2004.10.17      33666   0               430216  298
2004.10.18      19882   0               450098  298
2004.10.19      18834   0               468932  298
2004.10.20      13869   0               482801  298
2004.10.21      22085   0               504886  298
2004.10.22      11224   1276            516110  1574
2004.10.23      19286   0               535396  1574
2004.10.24      8384    0               543780  1574
2004.10.25      56649   1986            600429  3560
2004.10.26      14095   0               614524  3560
2004.10.27      19438   0               633962  3560
2004.10.28      38879   0               672841  3560
2004.10.29      10600   0               683441  3560
2004.10.30      0       0               683441  3560
2004.10.31      23859   0               707300  3560


As you can see from those numbers even a monthly update only needs
<25% the download of the full file at the cost of a few K file for
removals. On average there are ~25KB downloads each day or <1% of the
traffic.



Summary:
--------

The above Idea achives all the goals for a better Packages update:

- compatible with existing method
- minimum server load to generate helper files (a few seconds extra
  once a day for master won't hurt)
- minimum space requirements (only 3.5Kb extra for all of October)
- no load on the mirrors (no cpu load and no (to speak of) disk space)
- simple understandable algorithm
- allow updates at irregular intervals even over month


Comments, suggestions, a million dollar welcome.

MfG
        Goswin


--
(+) From here on Packages means Packages or Sources

(*) All numbers are in bytes. First column is the date, obviously. The
2nd and 3rd column say how much you need to download if you update
daily. The 4th and 5th colum give the number if your last update had
been on Oct 1st. Add a small overhead for the on-the-fly gunzip
buffering and the needed overlap to detect an existing entry.



Reply to: