Re: Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)

To: debian-med@lists.debian.org
Cc: Debian QA List <debian-qa@lists.debian.org>
Subject: Re: Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)
From: Andreas Tille <andreas@an3as.eu>
Date: Thu, 22 Oct 2009 09:49:10 +0200
Message-id: <[🔎] 20091022074910.GA20846@an3as.eu>
In-reply-to: <20091021153006.GA28743@kunpuu.plessy.org>
References: <4A7309D8.9010706@gmx.de> <20090803182130.GC975@an3as.eu> <20090805023802.GA22086@kunpuu.plessy.org> <20090805072124.GB16430@an3as.eu> <4A795981.7000606@gmx.de> <20091021153006.GA28743@kunpuu.plessy.org>

[debian-qa in CC because here we are discussing UDD issues.]

On Thu, Oct 22, 2009 at 12:30:06AM +0900, Charles Plessy wrote:
> First of all, let's summarise the situation. We want to integrate some metadata
> in our 'web sentinels', like 'http://debian-med.alioth.debian.org/tasks/bio'.

I would like to add that most probably there might evolve even other use
cases for this kind of data.  Keeping this in mind we might consider
moving the topic to debian-devel in the next stage of development.

> What I propose is to have a special file in the source packages for gathering
> all possible useful informations, debian/upstream-metadata.yaml.

I have noticed this and I really like this effort very much (even if I
did not actively suported it by adding such a file for packages I
touched recently).

> In contrary to
> debian/control, this file would not contribute data to the Packages.gz files of
> the Debian archive. I think that there are enough source packages managed in
> version control systems that we can use them as the main source of our data.

I'm not really happy about this "we ignore packages which are not
maintained in VCS" attitude but it sounds reasonably to assume that in
practice all those package that potentially contain such kind of
information are actually maintained in a VCS.  An alternative way to
gather the information popped up in my mind:  There is some code that
checks the translation status of upstream sources by unpacking all
source packages and checking for <lang>.po files.  So there is actually
some code which handles complete unpacking of Debian source packages
which might be used to fetch debian/upstream-metadata.yaml as well.
The pro is to get all packages - the con is that it only seeks in
already uploaded packages.

> This makes debian/upstream-metadata.yaml available indendantly of the Debian
> archive, and more importantly, will allow to update the metadata without
> uploading the package, but in a way that only the maintainers can do the
> update, which keeps things under control.

This has a certain advantage of flexibility over the method I suggested
above.  I'm not sure what way I would prefer.  Implementation wise
probably the VCS method is way easier to implement - so we probably
should stick to your decision - but I wanted to mention an alternative
way which IMHO might have slightly more chances to get accepted on
debian-devel for general purposes because people there might be
interested in completeness.

> The missing piece of the puzzle is then an aggregator that would collect the
> information from the source packages and prepare tables for the UDD. I am drafting
> such a program at http://upstream-metadata.debian.net/. Currently, it does
> not do much:
> 
> http://upstream-metadata.debian.net/<package>/ALL gets debian/upstream-metadata.yaml if
> the package is in a subversion server that is available to ???debcheckout???. Luckily,
> most of our packages are.
> 
> http://upstream-metadata.debian.net/<package>/<key> gives the content of the
> metadata for one key.

This sounds really good.

> For instance, http://upstream-metadata.debian.net/samtools/PMID gives the
> PubMed identification number for the article describing SamTools, 19505943.
> 
> This is the proof or principle for data retreival. Then, we need to construct
> the tables.  I plan to have the program store the results in a BerkeleyDB
> database, and to make it output tables at constant intervals, for instance
> daily. The update of the internal database would we done in two ways.

If you plan to propagate this data to UDD this might not be an optimal
solution.  UDD imports are usually a two step process:

  1. Fetch text data from whatever source in clear text.
  2. Delete table, read text data and put it into the table.

If we want to follow this scheme for our specific case IMHO it would be the
best idea to just drop a <package>.yaml file in a directory where rsync or
wget can fetch these files.  the second step to read the yaml files is quite
simple.

> First, updates could be pushed with commit hooks when package maintainers
> commit changes to debian/upstream-metadata.yaml. It could be as simple as
> having an url that triggers an update, and using wget or curl to activate the
> aggregator.
> 
> Second, normal read access could trigger an update if the record is getting old.

Currently UDD updates are time based (per cron job) and not event based
(per commit of some data).  If you gather the data by any means at
upstream-metadata.debian.net this is not really relevant for UDD import
(OK, it makes sense to synchronise the cron jobs to make sure that
upstream-metadata cron job runs before UDD cron job fetches data.  So I
would vote for the option which is safer to implement.  In this aspect I
would prefer the second method and run the job once a day.  The reason
is that if I'm not completely wrong the VCS push would require to
configure *every* VCS which *potentially* might contain
upstream-metadata.yaml files.  This is a weak aproach because you do not
have control over all VCSes and chances are very high that this will not
happen on all VCSes and it sounds quite hard to propagate changes to the
commit hooks (imagine upstream-metadata.debian.net becomes
upstream-metadata.debian.org or whatever).  In this sense I would vote
for relaying on the VCS fields in the packaging information and fetch
information via cron job using the Vcs specified in debian/control.

> In summary, I propose to store metadata in YAML format in the source pacakges,
> retreive and store it in a central place using a web agent through the VCS in
> which the source packages are stored, and periodically output tables for the
> UDD, which keeps a central role for the generation of our web sentinel pages.

I like this approach.  But there is one thing I'm not really sure about:
How should we design the UDD table?  There are two options: 

CREATE TABLE upstream-metadata (
    package text,
    key1    text,
    key2    text,
    ...
    keyN    text,
    PRIMARY KEY package
);

with a defined set of keys allowed in upstream-metadata.yaml and exactly
one row per package.  Every unknown key will be ignored.  The
advantage of this approach is that tools *know* what keys to expect and
can just relay on how to handle these.

Alternatively we could do

CREATE TABLE upstream-metadata (
    package text,
    key     text,
    value   text,
    PRIMARY KEY (package,key)
);

with an arbitrary number of rows per package but no duplicated keys for
one package.  This is more flexible in case you need some new kind of
data you do not need to touch the UDD table structure but it restricts
the keys to only one per package.

The thir option is to leave out the PRIMARY KEY constraint at all which
allows maximum flexibility (for instance there might be more than one
citation records).

BTW, I'm a bit concerned about mixing different database formats: On one
hand you are using yaml on the other hand BibTeX.  Well, for sure having
a BibTeX record is very valuable.  But on the other hand the tools who
are working with this data will need a BibTeX parser.  I did not dived
into this and for sure it is doable - but I just wanted to raise this
topic here to hear opinions.

> The proof of principle presented above is only a few lines of code, but I would
> prefer discuss further the idea before putting more time on it.

Thanks for pushing this foreward!

> Lastly, I have accumulated a dozen of debian/upstream-metadata.yaml files in
> the packages I maintain, so that meaningful tests are doable for table
> generation later. I do not remember the list by heart, but it contains seaview,
> bwa, clustalw, clustalx, perlprimer, samtools, and most of the packages I have
> updated recently.
> 
> Since I am quite unexperienced in programming, help is of course most welcome.

As I said above: IMHO most of the work is done if you can provide a set
of <package>.yaml files at a freely accessible place.

Kind regards

       Andreas.

-- 
http://fam-tille.de

Reply to:

Follow-Ups:
- Re: Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)
  - From: Olivier Berger <olivier.berger@it-sudparis.eu>
- Using RDF and ontologies for such metadata (combined DOAP and other ontologies) Was: Re: Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)
  - From: Olivier Berger <olivier.berger@it-sudparis.eu>
- Re: Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)
  - From: Charles Plessy <plessy@debian.org>

Prev by Date: Bug#551521: [UDD] please expose a list of RC-buggy and/or ANY-buggy packages
Next by Date: Re: Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)
Previous by thread: Bug#551655: UDD: add pseudo-packages information
Next by thread: Re: Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)
Index(es):
- Date
- Thread