[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)



> On Thu, Oct 22, 2009 at 12:30:06AM +0900, Charles Plessy wrote:
> > First of all, let's summarise the situation. We want to integrate some metadata
> > in our 'web sentinels', like 'http://debian-med.alioth.debian.org/tasks/bio'.

Dear Andreas and Olivier,

thank you for your encouraging comments. I have made one more step forward, and
upstream-metadata.debian.net now stores its information in a Berkeley database,
refreshing only the data when it is older than a given age when it is accessed.

For the moment, we only have 17 source packages that have an
upstream-metadata.yaml file in their debian directory that is accessible
through a public VCS. Nevertheless, I think that it is enough for a proof of
principle.

After resetting the database, I ‘loaded’ the data by accessing it:

for package in bioperl clustalx mummer seaview perlprimer samtools dicomscope clustalw r-cran-combinat r-cran-haplo.stats r-cran-qvalue r-cran-randomforest r-cran-rocr r-other-bio3d mira bwa infernal ;
do wget http://upstream-metadata.debian.net/$package/DOI -O /dev/stdout 2> /dev/null;
done

After loading, the resulting table are available here:
http://upstream-metadata.debian.net/table/DOI

Obviously, not all packages contain programs that have been described in an
academic article (http://dx.doi.org/)…

For the moment, one has to access an arbitrary key, but later the best would be
to have a special key, for instance YAML-UPDATE, that would force the update.
If it is possible to have a per-file commit hook, then each time a
upstream-metadata.yaml is modified, the debian.net site can updated.

Next step is to feed the UDD. For the moment, the site produces one table per
keyword. The rationale is that for many keywords, the data will be too sparse
to be interesting for the UDD. My current idea is to generate the tables for a
limited set of curated keywords, assemble them (with the unix join command?),
and give leave this in a public place that the UDD can read.

In parallel, as Olivier suggested, each table could be exprorted in RDF format.
But I am not sure I undersand it. Olivier, could you suggest a Perl module to
use?

As long as we are in a draft phase, I think that we can live with the currently
biggest limitation: the lack of support for packages that are not stored in a
VCS. One possible way to solve the problem is to provide repository, for
instance in collab-maint on Alioth, where people can drop one yaml file per
source packages. We could also unpack source files, as Andreas suggested.

For the UDD import, what would be the most suitable among the two propositions
of Andreas?

> CREATE TABLE upstream-metadata (
>     package text,
>     key1    text,
>     key2    text,
>     ...
>     keyN    text,
>     PRIMARY KEY package
> );
 
> CREATE TABLE upstream-metadata (
>     package text,
>     key     text,
>     value   text,
>     PRIMARY KEY (package,key)
> );

Since the addition of more meta-data to our source packages is a frequent issue
raised on debian-devel, I think that there is a general interst for
standardising ‘field’ names, whichever the technical solution that will be
adopted. I will try to find a proper place on wiki.debian.org to let pepole document
the fields they create, and if necessary discuss them.

Have a nice day,

-- 
Charles Plessy
Debian Med packaging team,
http://www.debian.org/devel/debian-med
Tsurumi, Kanagawa, Japan


Reply to: