[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

UDD gatherer for DDTP translations



On Fri, 13 Feb 2009, Lucas Nussbaum wrote:

It depends on your data source. I'm not familiar with DDTP. If (package,
version) is enough as a primary key, let's just use that.

I commited a ddtp importer to collab-qa/udd.  It is based on DDTP translation
files which are enriched by the package version (compared to those which are
populated on all mirrors) and are awailable at
   http://ddtp.debian.net/Translation_udd
(see svn://svn.debian.org/svn/collab-qa/udd/config_ddtp.yaml).

I met last weekend with Grisu in person and the explanation for the issue
is the following:

  Originally Grisu had the concept that the MD5 sum of an English description
  is sufficient as a key to assign a translation to a certain package. He
  considered the version number of the package as uninteresting because a
  description might be constant over several versions of a package.  The
  tools which are working with the Translation files are adopted to this
  philosophy.  There are several of them.  Grisu told me that he provides
  only unzipped Text files which are grabbed by some ftpmaster tools which
  check the contents of the files first (so adding an extra field would not
  pass this test for the moment) and propagate compressed versions to the
  Debian mirrors.  Tools like apt and others might relay onto this format.
  I'd regard it probably cheap to provide a patch which just ignores an
  additional field - perhaps everything might work out of the box - but
  currently we do not know this.

  When I started with the UDD gatherer for DDTP I learned that there are
  several translations for the same package in sid.  The reason is that
  some architectures might not catch up that quickly as others and if
  the description of such a package has changed you end up with two or
  more translations for one package and have to make a reasonable assignment
  to the packages which are inside UDD.

  To tackle this I tried to calculate MD5 sums of the package descriptions
  which turned out quite error prone.  The code became hard to read hacky
  and not really reliable (perhaps it is just me - but anyway).  So it
  turned out to be the best idea to add the version information directly
  to the Translation files.  There was some arguing with Grisu about
  redundance.  It first I think redundance is not bad per se - there
  might be reasons where it makes sense - for instance if code becomes
  more robust and reliable (and in additioon avoids expensive calculations -
  compare calculating an MD5 sum *and* compare the result against just
  comparing a version string).  Moreover it is not redundant inside the
  DDTP table - it just adds the extra information about the version which
  actually *is* in the package pool (as I explained above a MD5 sum might
  be true for several versions).

  The result of these considerations was that Grisu now runs the very
  same job to export of the DDTP database twice: one into the established
  format without version information and one into the version enriched
  format for a simple import into UDD.  If you agree I will try to make
  this the "single official" format because I'm not really happy about
  having an extra service for UDD - sooner or later things might diverge
  and it is better to have a single default.

  This is the current situation and the things I describe below are
  based on these version enriched DDTP files.

Commits to svn://svn.debian.org/svn/collab-qa/udd/

  1. config_ddtp.yaml
     Configuration file to set path, location of the ddtp files and
     the releases we consider.  We import all packages which are
     supported by ddtp - so no need to explicitely specify the
     languages
  2. sql/ddtp.sql
     Create the table in UDD.  Some fields contain comments.  I wonder
     whether we should relay on the inline comments in this file or
     whether we should implement "COMMENT ON TABLE ddtp IS ...".
     Just tell me what you prefer.
  3. scripts/fetch_ddtp_translations.sh
     Fetch the Translation-<lang>.gz files from DDTP server via
     http using curl.  I did not found a better method to obtain
     "all files in a web directory" (we want all supported languages
     safely even if some additions might occure) than using curl in
     connection with the contributed script
      http://cool.haxx.se/cvs.cgi/curl/perl/contrib/getlinks.pl.in
     I'm not perfectly happy to use a not yet packaged script and
     perhaps I should implement the fetching script using perl
     LWP::UserAgent - just tell me if you see the current method
     as drawback and I'll change this.
  4. scripts/getlinks.pl
     The script from curl contrib mentioned above.
  5. udd/ddtp_gatherer.py
     The actual gatherer which parses the Translation-<lang>.gz
     files fetched previosely and injects the information into the
     table ddtp of UDD.  The table is deleted before every import
     completely and than imports the content of all fetched Translation
     files.
     Remark: I have some "more or less working hackish" code for
     gathering the information of Translation files without versions.
     Just tell me whether I should commit this for comparison.

The gatherer works if you try:

 python udd.py config_ddtp.yaml update ddtp
 python udd.py config_ddtp.yaml run ddtp


Please tell me what steps have to be done next to finally let this work as
official UDD gatherer in the regular cron job.

Kind regards

      Andreas.

--
http://fam-tille.de


Reply to: