Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)
- To: firstname.lastname@example.org
- Subject: Gathering package upstream meta-data in the UDD. (was: Re: more formally indicating the registration URL)
- From: Charles Plessy <email@example.com>
- Date: Thu, 22 Oct 2009 00:30:06 +0900
- Message-id: <20091021153006.GA28743@kunpuu.plessy.org>
- In-reply-to: <4A795981.firstname.lastname@example.org>
- References: <4A7309D8.email@example.com> <20090803182130.GC975@an3as.eu> <20090805023802.GA22086@kunpuu.plessy.org> <20090805072124.GB16430@an3as.eu> <4A795981.firstname.lastname@example.org>
Le Wed, Aug 05, 2009 at 12:05:53PM +0200, Steffen Moeller a écrit :
> Andreas Tille wrote:
> > On Wed, Aug 05, 2009 at 11:38:02AM +0900, Charles Plessy wrote:
> >> I have been thinking a bit on the issue. How about the following workflow:
> >> - Create a new file with a ???Name: contents??? field syntax in the Debian source
> >> packages, for ???online meta-data??? that typically require internet access to
> >> be useful.
> > Sounds reasonable.
> I agree.
> Could we somehow prototype what we want to achieve?
> Could you pair that with an incremental implementation plan? And ask for help were you
> want help?
it took some time, but I have now a more concrete proposal.
First of all, let's summarise the situation. We want to integrate some metadata
in our “web sentinels”, like ‘http://debian-med.alioth.debian.org/tasks/bio’;.
The simplest for creating these pages is to centralise all the information in
the Ultimate Debian Database (http://udd.debian.org/). Typical metadata is
bibliographic information or registration URL. The UDD is fed with tables that
have to be deposited in a trusted location. The issue is how to prepare the
tables with data collected by multiple package maintainers.
What I propose is to have a special file in the source packages for gathering
all possible useful informations, debian/upstream-metadata.yaml. In contrary to
debian/control, this file would not contribute data to the Packages.gz files of
the Debian archive. I think that there are enough source packages managed in
version control systems that we can use them as the main source of our data.
This makes debian/upstream-metadata.yaml available indendantly of the Debian
archive, and more importantly, will allow to update the metadata without
uploading the package, but in a way that only the maintainers can do the
update, which keeps things under control.
The missing piece of the puzzle is then an aggregator that would collect the
information from the source packages and prepare tables for the UDD. I am drafting
such a program at http://upstream-metadata.debian.net/. Currently, it does
not do much:
http://upstream-metadata.debian.net/<package>/ALL gets debian/upstream-metadata.yaml if
the package is in a subversion server that is available to ’debcheckout’. Luckily,
most of our packages are.
http://upstream-metadata.debian.net/<package>/<key> gives the content of the
metadata for one key.
For instance, http://upstream-metadata.debian.net/samtools/PMID gives the
PubMed identification number for the article describing SamTools, 19505943.
This is the proof or principle for data retreival. Then, we need to construct
the tables. I plan to have the program store the results in a BerkeleyDB
database, and to make it output tables at constant intervals, for instance
daily. The update of the internal database would we done in two ways.
First, updates could be pushed with commit hooks when package maintainers
commit changes to debian/upstream-metadata.yaml. It could be as simple as
having an url that triggers an update, and using wget or curl to activate the
Second, normal read access could trigger an update if the record is getting old.
In summary, I propose to store metadata in YAML format in the source pacakges,
retreive and store it in a central place using a web agent through the VCS in
which the source packages are stored, and periodically output tables for the
UDD, which keeps a central role for the generation of our web sentinel pages.
The proof of principle presented above is only a few lines of code, but I would
prefer discuss further the idea before putting more time on it.
Lastly, I have accumulated a dozen of debian/upstream-metadata.yaml files in
the packages I maintain, so that meaningful tests are doable for table
generation later. I do not remember the list by heart, but it contains seaview,
bwa, clustalw, clustalx, perlprimer, samtools, and most of the packages I have
Since I am quite unexperienced in programming, help is of course most welcome.
Have a nice day,
Debian Med packaging team,
Tsurumi, Kanagawa, Japan