[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: organisation of our metadata



On Sat, 23 Feb 2002, Glenn McGrath wrote:

> The problem is our metadata is fragmented and duplicated.

It's not duplicated.  You don't understand the system.

> The worst offender is the control file from the control.tar.gz, it exists
>  - In the /var/lib/dpkg/status file if the package is installed(1)
>  - In the /var/lib/dpkg/available file (i think apt-get uses an alternative to this file)

* status contains the info for the currently installed file.

* available contains what is available for installation.  available can
  contain a more recent record than status.

The real way to do this, is to have available used only by dselect, and dpkg
would just use status.  See your (1).


>  - In possibly multiple /var/lib/apt/lists/*Packages.gz files

Again, no duplication.  There is one entry for each download location for a
package.  Apt does merge these internally, but for saftey(?) sake, keeps the
original downloaded files around, for caching.

>  - In the binary package
>  - In the source diff or (.tar.gz if debian native package)

>
> The control file is the most complete, but doesnt have all metadata in it,
> there are piece of metadata in the copyright (poorly formated),
> changelog(poorly formated), .changes (discarded after use), and .dsc files,
> but it is harder to get to.

Which control file are you talking about above?

What metadata is there in the copyright?

Changelog is not poorly formatted.

.changes are not discarded.

> If we put _all_ static metadata into a seperate file for each package and
> make that file generally avaibale, then we could avoid duplicating it, eg.
>  - The /var/lib/dpkg/status file would only need the Package: Version: and
>    Status: fields, the status file would be reduced to information that is
>    essential, removing information that can be easily replaced, though again
>     see (1)
>  - The /var/lib/dpkg/available file would just be an index, listing the
>    package names and versions,
>  - Packages.gz files could also be just an index of package names and
>    versions.

Part of dpkg's slowness is the statting of lots of little files.  These files
are the .list files that reside in /var/lib/dpkg/info.  The reading of
/var/lib/dpkg/status is actually quite fast.

You are suggesting that we use lots of little files, for status.  This would
be a huge slowdown.

> These seperate package metadata files could be downloaded seperately which
> voids downloading the same information for each common entry in your
> multiple _*_Pacakges.gz files.
>
> Downloading 9000 seperate files instead of a single 1.6MB file could be a
> burden on servers initially, but on an update only NEW files would be
> downloaded. It would reduce bandwidth substantially at the expense of more
> open connections.

Each update would require statting all of those 9000 files.  It would be a
burden every time.  Plus, the protocol just to stat each file(think about all
the http headers sent in both directions, for example) would be high.

> Also worth noting, in this way the metadata for the ENTIRE distribution
> doesnt need to be downloaded, the debian installer could for example just
> download the metadata for essential and base packages. Handling smaller
> amounts of metadata would reduce memory requirements for the installer.

You can't do directory listings with http.  So, there would still have to be
an index file that listed all possible separate files.  Again, more overhead.

> A possible location for all the metadata would be loading it into the .dsc
> file, increasing its size wouldnt be a problem as its only used for source
> packages at the moment, and doesnt have much metadata. using it for binary
> and source package could get messy. it would have to be appended with extra
> information after every autobuild, which would require the revision number
> to change, to ensure that the dsc revsision number didnt get out of sync
> with the other package components we could use a sub revision number, i.e.
> -1.1 -1.2 all relate to revision 1 or the source and binary packages.

There can be binaries(debs) for which there is no source upload.  This occurs
when a package is recompiled against new libraries(this can happen when an
abi(or something) changes).

> Im not sure that thats a good idea, but im not sure how else to handle
> seperate metadata in the archive, anyone else have an idea ?

This idea will never fly.  Attempting to Xu-ise the metadata is a sure fire
way to get yourself ignored, as trying to apply that technique to this is the
wrong approach.

> Maybe this approach would be usefull for handling translated metdata as well.

It has nothing to do with that.  I already have ideas on how to handle it, it
will just take time to implement.

> I did consider that it could be usefull ot change the format of our
> metadata to XML/SGML/RDF/WHATEVERML, the advantages would be that there is
> pre-existing libraries to handle the metadata, and it would make the metdata
> more accessable to future tools (other advantages?), however i think the
> format we currently have is much more readable to people, simpler to parse,
> and more compact.

Wearing my dpkg-developer hat:

  I will never link dpkg to an xml library.  I will write dpkg in c++.  Dpkg
  absolutely MUST be kept simple, because of it's critical nature to the
  system.

> Weighing it all up, i dont think any other format would be worth the effort.
> Anyone who's good with markup languages care to comment ?

XML can be rather space-inefficient.  What we have is rather efficient,
parseable and edittable by humans, and has existed for decades(rfc822).


> (Ill be on vacation for a week starting in a day and a half, i wanted to get this out before i went)

You should have thought about it more.

> 1. A subset of the control is currently stored in the status file even if
> the package isnt installed, however ive read there are plans to remove it in
> future version of dpkg

This is correct.



Reply to: