UDD gatherer for DDTP translations
On Fri, 13 Feb 2009, Lucas Nussbaum wrote:
It depends on your data source. I'm not familiar with DDTP. If (package,
version) is enough as a primary key, let's just use that.
I commited a ddtp importer to collab-qa/udd. It is based on DDTP translation
files which are enriched by the package version (compared to those which are
populated on all mirrors) and are awailable at
http://ddtp.debian.net/Translation_udd
(see svn://svn.debian.org/svn/collab-qa/udd/config_ddtp.yaml).
I met last weekend with Grisu in person and the explanation for the issue
is the following:
Originally Grisu had the concept that the MD5 sum of an English description
is sufficient as a key to assign a translation to a certain package. He
considered the version number of the package as uninteresting because a
description might be constant over several versions of a package. The
tools which are working with the Translation files are adopted to this
philosophy. There are several of them. Grisu told me that he provides
only unzipped Text files which are grabbed by some ftpmaster tools which
check the contents of the files first (so adding an extra field would not
pass this test for the moment) and propagate compressed versions to the
Debian mirrors. Tools like apt and others might relay onto this format.
I'd regard it probably cheap to provide a patch which just ignores an
additional field - perhaps everything might work out of the box - but
currently we do not know this.
When I started with the UDD gatherer for DDTP I learned that there are
several translations for the same package in sid. The reason is that
some architectures might not catch up that quickly as others and if
the description of such a package has changed you end up with two or
more translations for one package and have to make a reasonable assignment
to the packages which are inside UDD.
To tackle this I tried to calculate MD5 sums of the package descriptions
which turned out quite error prone. The code became hard to read hacky
and not really reliable (perhaps it is just me - but anyway). So it
turned out to be the best idea to add the version information directly
to the Translation files. There was some arguing with Grisu about
redundance. It first I think redundance is not bad per se - there
might be reasons where it makes sense - for instance if code becomes
more robust and reliable (and in additioon avoids expensive calculations -
compare calculating an MD5 sum *and* compare the result against just
comparing a version string). Moreover it is not redundant inside the
DDTP table - it just adds the extra information about the version which
actually *is* in the package pool (as I explained above a MD5 sum might
be true for several versions).
The result of these considerations was that Grisu now runs the very
same job to export of the DDTP database twice: one into the established
format without version information and one into the version enriched
format for a simple import into UDD. If you agree I will try to make
this the "single official" format because I'm not really happy about
having an extra service for UDD - sooner or later things might diverge
and it is better to have a single default.
This is the current situation and the things I describe below are
based on these version enriched DDTP files.
Commits to svn://svn.debian.org/svn/collab-qa/udd/
1. config_ddtp.yaml
Configuration file to set path, location of the ddtp files and
the releases we consider. We import all packages which are
supported by ddtp - so no need to explicitely specify the
languages
2. sql/ddtp.sql
Create the table in UDD. Some fields contain comments. I wonder
whether we should relay on the inline comments in this file or
whether we should implement "COMMENT ON TABLE ddtp IS ...".
Just tell me what you prefer.
3. scripts/fetch_ddtp_translations.sh
Fetch the Translation-<lang>.gz files from DDTP server via
http using curl. I did not found a better method to obtain
"all files in a web directory" (we want all supported languages
safely even if some additions might occure) than using curl in
connection with the contributed script
http://cool.haxx.se/cvs.cgi/curl/perl/contrib/getlinks.pl.in
I'm not perfectly happy to use a not yet packaged script and
perhaps I should implement the fetching script using perl
LWP::UserAgent - just tell me if you see the current method
as drawback and I'll change this.
4. scripts/getlinks.pl
The script from curl contrib mentioned above.
5. udd/ddtp_gatherer.py
The actual gatherer which parses the Translation-<lang>.gz
files fetched previosely and injects the information into the
table ddtp of UDD. The table is deleted before every import
completely and than imports the content of all fetched Translation
files.
Remark: I have some "more or less working hackish" code for
gathering the information of Translation files without versions.
Just tell me whether I should commit this for comparison.
The gatherer works if you try:
python udd.py config_ddtp.yaml update ddtp
python udd.py config_ddtp.yaml run ddtp
Please tell me what steps have to be done next to finally let this work as
official UDD gatherer in the regular cron job.
Kind regards
Andreas.
--
http://fam-tille.de
Reply to: