[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

RFC: Checking for updates



(i previously posted something like this in a comment at debianplanet)

It is possible to create a tool that checks for updated packages that
would only send minimal amounts of data guarenteed to be less than 90kB
for sid in its current state, and more likely only having to download
about 45kB (est)

The benefit would be that, instead of downloading a new 1MB (or
whatever) Packages.gz file just to see if anything you want has been
updated, you could instead first download package index files to check
if there is anything in the available file that you want, if there is
something you want to install, then do a tradional download, requiring
the big Packages.gz file that you avoided downloading previously.

So it would save bandwidth for instances where a Packages.gz was
downloaded, but it wasnt used for anything.

All that is needed to check for new packages is a list of each packages
names, version and revision. Specifically we dont need to download
descriptions/dependencies/other fields of the packages just to see if
its new.

So.. to store information about a current release we need to 4
tables/files,

(all based on current sid)
A file with just unique package names, there are 8233 unique package
names, which amounts to
94074 Bytes uncompressed
35644 Bytes compressed with bzip2 -9
41332 Bytes compressed with gzip -9

A second file with unique versions, there are 2150 unique versions,
which amounts to
16390 Bytes uncompressed
6460 Bytes compressed with bzip2 -9
7052 Bytes compressed with gzip -9

A third file with unique revisions, there are 111 unique revisions,
which amounts to
564 Bytes uncompressed
357 Bytes compressed with bzip2 -9
362 Bytes compressed with gzip -9

As you can see there as much duplication of data in version and
revision, by storing the data like this we can reduce each
name/version/revision to a number (its order in the file).

These tables wont be changing all the time, the names table will change
everytime a NEW package is added or an existing package is removed, but
not when a package is updated, how often the names table changed would
vary greatly depending on which dist you are running, stable shouldnt
change, unstable and testing i guess would change every few days or at
least every week.

Because there is much duplication of version and revision numbers its
more likely that a new verison will be already used, so these tables
would change less often than the name table.. as a total guess every few
weeks for sid.

To pull these three tables together we need a forth table, the packages
table which has the entry number for each packages name version and
revision.

I havent correctly generated this file yet, but it will need exactly 5
Bytes for each package entry, 2 bytes for the name number, 2 for the
version number and 1 for revision number.

So we need a min of 5 x (aprox) 8000 == 45kB to represent the package
status of sid (uncompressed)

On top of that we could do a binary diff using xdelta (its a package) to
represent changes between the tables.

The md5sum of each of the three dependent tables would have to be stored
in the header of the package table to prevent them getting out of sync.

So to update you would have to download the package table, then update
the name, version or revision table if neccessary. Full package names
could then be rebuilt and compared against compared to the current
available file. You could then be presented witha list fo packages that
have been updated, if any interest you then you could do a traditional
update.

It could be extended to hand out individual package descritpions and
rebuild the available file to keep than in sync as well, but thats
looking a bit far into it at this stage.

I spent a couple of days on it, and have the code that generates the
four files about (but the package file is buggy), i need to work on
another project and get it done before i come back to this, if anyone
wants the code i have started let me know, if not i will get back to
this idea sometime in the distant future

Much of the code is derived from busybox (which is the other project i
need to work on)



Glenn



Reply to: