[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: organisation of our metadata



On Sat, 23 Feb 2002 03:11:59 -0600 (CST)
"Adam Heath" <doogie@debian.org> wrote:

> > The worst offender is the control file from the control.tar.gz, it
exists> >  - In the /var/lib/dpkg/status file if the package is
installed(1)> >  - In the /var/lib/dpkg/available file (i think apt-get
uses an alternative to this file)> 
> * status contains the info for the currently installed file.
> 
Yes, the only information needed to perform this task is the Package:,
Version:, and Status: fields for any package that doesnt have Status:
"purge ok not-installed" All other fields are static, by seperating our
static information it makes it easier to look after what is unique to the
Status file, i.e. the Status field. The Status of a pacakge is critical,
it cant be easily replaced, all the other metadata can.

> * available contains what is available for installation.  available can
>   contain a more recent record than status.
> 
Yes, every <package_name>_<package_version> is unique, so there is no
reason why different package metadata files cant be stored on one system,
of course once it is no longer needed it can be removed.

> The real way to do this, is to have available used only by dselect, and
dpkg> would just use status.  See your (1).
> 
A packages metadata can be used by lots of other tools as well, not just
for installing and removing packages, making metadata more easily
accessable would be a good thing.  
> >  - In possibly multiple /var/lib/apt/lists/*Packages.gz files
> 
> Again, no duplication.  There is one entry for each download location
for a> package.  Apt does merge these internally, but for saftey(?) sake,
keeps the> original downloaded files around, for caching.
> 
I have more than one entry for main in my /etc/apt/source.list (i assume
lots of other people do as awell) that way if a Package isnt found on the
first mirror it will look for it on the second mirror rather than give up.
The Package.gz on every upto data mirror should be exactly the same, but
it is downloaded from every mirror and the individual package entries
merged internally. What i am suggesting is to working out if its new
information before downloading it.

> >  - In the binary package
> >  - In the source diff or (.tar.gz if debian native package)
> 
> >
> > The control file is the most complete, but doesnt have all metadata in
it,> > there are piece of metadata in the copyright (poorly formated),
> > changelog(poorly formated), .changes (discarded after use), and .dsc
files,> > but it is harder to get to.
> 
> Which control file are you talking about above?
> 
The control file within a binary packages control.tar.gz

> What metadata is there in the copyright?
> 
Normally the following data, granted its not important information, but it
may be usefull if it was more easily accessable. - Upstream author
 - Copyright license
 - Source location
 - Name of first package creator
 - Date/time of first package creation

> Changelog is not poorly formatted.
> 
It is poorly formatted for parsing by machine as its inconsistent with the
way other files are formatted, it is pleasing to the eye though. If the
only unique information in the Changelog was the changes it wouldnt
matter, but there is other information which could be usefull in there as
well, i.e. - Target Distribution
 - Urgency
 - Uploader
 - Creation time of this version of unpacked source.

> .changes are not discarded.
Well, they are archived and arbitrarily removed, its not intended to be a
file thats kept around as long as package is used (as far as i know). It
does some of the info from Changelog in a more convenient format though.

> 
> > If we put _all_ static metadata into a seperate file for each package
and> > make that file generally avaibale, then we could avoid duplicating
it, eg.> >  - The /var/lib/dpkg/status file would only need the Package:
Version: and> >    Status: fields, the status file would be reduced to
information that is> >    essential, removing information that can be
easily replaced, though again> >     see (1)
> >  - The /var/lib/dpkg/available file would just be an index, listing
the> >    package names and versions,
> >  - Packages.gz files could also be just an index of package names and
> >    versions.
> 
> Part of dpkg's slowness is the statting of lots of little files.  These
files> are the .list files that reside in /var/lib/dpkg/info.  The reading
of> /var/lib/dpkg/status is actually quite fast.
> 
Arghh, i had wondered about that... cool

> You are suggesting that we use lots of little files, for status.  This
would> be a huge slowdown.
> 
Im suggesting that the local authoratitive place for each packages
metadata be a little files, they could be grouped together to create a
status file like we have now, but that would be just for convenience.

> > These seperate package metadata files could be downloaded seperately
which> > voids downloading the same information for each common entry in
your> > multiple _*_Pacakges.gz files.
> >
> > Downloading 9000 seperate files instead of a single 1.6MB file could
be a> > burden on servers initially, but on an update only NEW files would
be> > downloaded. It would reduce bandwidth substantially at the expense
of more> > open connections.
> 
> Each update would require statting all of those 9000 files.  It would be
a> burden every time.  Plus, the protocol just to stat each file(think
about all> the http headers sent in both directions, for example) would be
high.> 
No, each update would require downloading an index file that has the name
and version (nothing else) of every package in the release, that file can
be compared with a list of locally stored files and only new or updated
packages would have to be downloaded.

> > Also worth noting, in this way the metadata for the ENTIRE
distribution> > doesnt need to be downloaded, the debian installer could
for example just> > download the metadata for essential and base packages.
Handling smaller> > amounts of metadata would reduce memory requirements
for the installer.> 
> You can't do directory listings with http.  So, there would still have
to be> an index file that listed all possible separate files.  Again, more
overhead.> 
In that example yes, there would have to be a list somewhere of essential
and base packages, probably only a few kB. The package name and version
might be obtained another way, from the dependency list of another package
for example, in that case no list is required.

> > A possible location for all the metadata would be loading it into the
.dsc> > file, increasing its size wouldnt be a problem as its only used
for source> > packages at the moment, and doesnt have much metadata. using
it for binary> > and source package could get messy. it would have to be
appended with extra> > information after every autobuild, which would
require the revision number> > to change, to ensure that the dsc revsision
number didnt get out of sync> > with the other package components we could
use a sub revision number, i.e.> > -1.1 -1.2 all relate to revision 1 or
the source and binary packages.> 
> There can be binaries(debs) for which there is no source upload.  This
occurs> when a package is recompiled against new libraries(this can happen
when an> abi(or something) changes).
> 
Still, i dont see why metadata wouldnt be available.

> > Im not sure that thats a good idea, but im not sure how else to handle
> > seperate metadata in the archive, anyone else have an idea ?
> 
> This idea will never fly.  Attempting to Xu-ise the metadata is a sure
fire> way to get yourself ignored, as trying to apply that technique to
this is the> wrong approach.
> 
Well, i wasnt convinced it was a good idea, but i wasnt thinking of
X-fields, i was thinking other info could be appended, (signed by the
appender) to the .dsc file How would you suggest it be done ?

> > Maybe this approach would be usefull for handling translated metdata
as well.> 
> It has nothing to do with that.  I already have ideas on how to handle
it, it> will just take time to implement.
> 
It would allow metadata to be updated without updating the package
binaries and source, so that opertunity could be used to update translated
descriptions for example.

> > I did consider that it could be usefull ot change the format of our
> > metadata to XML/SGML/RDF/WHATEVERML, the advantages would be that
there is> > pre-existing libraries to handle the metadata, and it would
make the metdata> > more accessable to future tools (other advantages?),
however i think the> > format we currently have is much more readable to
people, simpler to parse,> > and more compact.
> 
> Wearing my dpkg-developer hat:
> 
>   I will never link dpkg to an xml library.  I will write dpkg in c++. 
Dpkg>   absolutely MUST be kept simple, because of it's critical nature to
the>   system.
> 
I guess it has been suggested before to make a dpkg-bin package that just
has dpkg and nothing else (no dselect etc).

Wearing my busybox dpkg hat:
 The status and pacakges files are about 7MB, i was only parsing select
fields from these files and got down to as little as 350kB of memmory and
no temp disk space, but trying to do a busybox apt-get made that aproach
unviable, so im trying to store the whole lot, using hash tables i can
store the info in about 4.5MB, but that is too much. Busybox dpkg still
doessnt do lots of stuff that it should.. growing pains trying to
implement apt-get as well. b.t.w. i think i can do a cut down busybox
apt-get in aobut 70kB, but i keep falling down with proper dependecy
checking

> > Weighing it all up, i dont think any other format would be worth the
effort.> > Anyone who's good with markup languages care to comment ?
> 
> XML can be rather space-inefficient.  What we have is rather efficient,
> parseable and edittable by humans, and has existed for decades(rfc822).
> 
Yea, my thoughts also, i dont know much about it, but looked into it
recently just becasue it seems to be so popular everywhere else. 


Glenn



Reply to: