[UDD] Fixing (most) email addresses in upload_history table
Hi,
I was aksed to do some investigation in who is active in maintainer
teams and thus I looked deeper into upload_history to find out who is
responsible for the changelog entries in the uploaded packages. When
doing so I noticed that a lot of email addresses in this table are
broken. This basically concerns:
udd# SELECT DISTINCT changed_by, changed_by_name, changed_by_email, regexp_replace(changed_by, E'^[^<]\+<\([.\\w]\+@[.\\w]\+\)>.*', E'\\1') AS changed_by_email_new, regexp_replace(changed_by, E'^[^\\w]*\([^<]\+[.\\w]\) *<[.\\w]\+@[.\\w]\+>.*', E'\\1') AS changed_by_name_new FROM upload_history WHERE changed_by_email NOT LIKE '%@%' AND changed_by LIKE '%<%@%.%>%' ;
changed_by | changed_by_name | changed_by_email | changed_by_email_new | changed_by_name_new
---------------------------------------------+-----------------+------------------+------------------------+-----------------------
Thomas Bushnell, BSG <tb@becket.net> | | Thomas | tb@becket.net | Thomas Bushnell, BSG
Ivan, Wong Yat Cheung <email@ivanwong.info> | | Ivan | email@ivanwong.info | Ivan, Wong Yat Cheung
Thomas Bushnell, BSG <tb@debian.org> | | Thomas | tb@debian.org | Thomas Bushnell, BSG
Adam C. Powell, IV <hazelsct@debian.org> | | Adam | hazelsct@debian.org | Adam C. Powell, IV
RISKO, Gergely <risko@debian.org> | | RISKO | risko@debian.org | RISKO, Gergely
Eric Gillespie, Jr. <epg@progeny.com> | | Eric | epg@progeny.com | Eric Gillespie, Jr.
TransNexus, Inc. <support@transnexus.com> | | TransNexus | support@transnexus.com | TransNexus, Inc.
Eric Gillespie, Jr. <epg@debian.org> | | Eric | epg@debian.org | Eric Gillespie, Jr.
John H. Robinson, IV <jaqque@debian.org> | | John | jaqque@debian.org | John H. Robinson, IV
$"Jaldhar H. Vyas <jaldhar@debian.org>" | | $ | jaldhar@debian.org | Jaldhar H. Vyas
And it turns out that basically the comma (',') in the maintainer name
has caused the parser to fail. UDD contains more than 700 entries of
this type (here hidden by the DISTINCT). I'm quite interested in
getting this fixed - at least the email field - because otherwise I
have no good chance to reliable get the stats for Debian Science Team
(Adam C. Powell is quite active here and he would simply vanish from
the stats if this is not fixed).
The question is now: As you can see there is a quite simple fix with
regular expressions which probably can easily ported to the importer
code and which helps in > 700 cases reliably (there are four cases which
can not catched by this means (see bottom of the mail) but we just fetch
the vast majority of broken but easy to guess e-mail addresses fixed.
Is there anything wrong in my arguing that this should be fixed. If
not I will provide a patch for the importer.
Kind regards
Andreas.
udd=# SELECT changed_by, changed_by_name, changed_by_email FROM upload_history WHERE changed_by_email NOT LIKE '%@%' AND changed_by NOT LIKE '%@%' AND changed_by != 'N/A' ;
changed_by | changed_by_name | changed_by_email
--------------------------------------+-----------------+--------------------
Jaldhar H. Vyas <jaldhar.debian.org> | Jaldhar H. Vyas | jaldhar.debian.org
ericvb | | ericvb
Jaldhar H. Vyas <jaldhar.debian.org> | Jaldhar H. Vyas | jaldhar.debian.org
ljlane | | ljlane
--
http://fam-tille.de
Reply to: