organisation of our metadata
I dont think we are handling our metadata as efficiently as we could.
Ive attempted to describe the way we current handle our metadata here http://people.debian.org/~bug1/metadata.htm
Its mostly background information to which most d-d would be aware of, and im sure isnt entirely acurate... so take it with a grain of salt.
The problems with our metadata are evident in the size of the Packages.gz, i realise rsyncable gzip would improve the current situation, but i think that is ignoring the real cause of the problem.
The problem is our metadata is fragmented and duplicated.
The worst offender is the control file from the control.tar.gz, it exists
- In the /var/lib/dpkg/status file if the package is installed(1)
- In the /var/lib/dpkg/available file (i think apt-get uses an alternative to this file)
- In possibly multiple /var/lib/apt/lists/*Packages.gz files
- In the binary package
- In the source diff or (.tar.gz if debian native package)
The control file is the most complete, but doesnt have all metadata in it, there are piece of metadata in the copyright (poorly formated), changelog(poorly formated), .changes (discarded after use), and .dsc files, but it is harder to get to.
If we put _all_ static metadata into a seperate file for each package and make that file generally avaibale, then we could avoid duplicating it, eg.
- The /var/lib/dpkg/status file would only need the Package: Version: and Status: fields, the status file would be reduced to information that is essential, removing information that can be easily replaced, though again see (1)
- The /var/lib/dpkg/available file would just be an index, listing the package names and versions,
- Packages.gz files could also be just an index of package names and versions.
These seperate package metadata files could be downloaded seperately which avoids downloading the same information for each common entry in your multiple _*_Pacakges.gz files.
Downloading 9000 seperate files instead of a single 1.6MB file could be a burden on servers initially, but on an update only NEW files would be downloaded. It would reduce bandwidth substantially at the expense of more open connections.
Also worth noting, in this way the metadata for the ENTIRE distribution doesnt need to be downloaded, the debian installer could for example just download the metadata for essential and base packages. Handling smaller amounts of metadata would reduce memory requirements for the installer.
A possible location for all the metadata would be loading it into the .dsc file, increasing its size wouldnt be a problem as its only used for source packages at the moment, and doesnt have much metadata. using it for binary and source package could get messy. it would have to be appended with extra information after every autobuild, which would require the revision number to change, to ensure that the dsc revsision number didnt get out of sync with the other package components we could use a sub revision number, i.e. -1.1 -1.2 all relate to revision 1 or the source and binary packages.
Im not sure that thats a good idea, but im not sure how else to handle seperate metadata in the archive, anyone else have an idea ?
Maybe this approach would be usefull for handling translated metdata as well.
I did consider that it could be usefull ot change the format of our metadata to XML/SGML/RDF/WHATEVERML, the advantages would be that there is pre-existing libraries to handle the metadata, and it would make the metdata more accessable to future tools (other advantages?), however i think the format we currently have is much more readable to people, simpler to parse, and more compact.
Weighing it all up, i dont think any other format would be worth the effort.
Anyone who's good with markup languages care to comment ?
(Ill be on vacation for a week starting in a day and a half, i wanted to get this out before i went)
1. A subset of the control is currently stored in the status file even if the package isnt installed, however ive read there are plans to remove it in future version of dpkg