Re: [UDD] Fixing (most) email addresses in upload_history table
On 22/01/11 at 15:19 +0100, Andreas Tille wrote:
> Hi,
>
> I was aksed to do some investigation in who is active in maintainer
> teams and thus I looked deeper into upload_history to find out who is
> responsible for the changelog entries in the uploaded packages. When
> doing so I noticed that a lot of email addresses in this table are
> broken. This basically concerns:
>
> udd# SELECT DISTINCT changed_by, changed_by_name, changed_by_email, regexp_replace(changed_by, E'^[^<]\+<\([.\\w]\+@[.\\w]\+\)>.*', E'\\1') AS changed_by_email_new, regexp_replace(changed_by, E'^[^\\w]*\([^<]\+[.\\w]\) *<[.\\w]\+@[.\\w]\+>.*', E'\\1') AS changed_by_name_new FROM upload_history WHERE changed_by_email NOT LIKE '%@%' AND changed_by LIKE '%<%@%.%>%' ;
> changed_by | changed_by_name | changed_by_email | changed_by_email_new | changed_by_name_new
> ---------------------------------------------+-----------------+------------------+------------------------+-----------------------
> Thomas Bushnell, BSG <tb@becket.net> | | Thomas | tb@becket.net | Thomas Bushnell, BSG
> Ivan, Wong Yat Cheung <email@ivanwong.info> | | Ivan | email@ivanwong.info | Ivan, Wong Yat Cheung
> Thomas Bushnell, BSG <tb@debian.org> | | Thomas | tb@debian.org | Thomas Bushnell, BSG
> Adam C. Powell, IV <hazelsct@debian.org> | | Adam | hazelsct@debian.org | Adam C. Powell, IV
> RISKO, Gergely <risko@debian.org> | | RISKO | risko@debian.org | RISKO, Gergely
> Eric Gillespie, Jr. <epg@progeny.com> | | Eric | epg@progeny.com | Eric Gillespie, Jr.
> TransNexus, Inc. <support@transnexus.com> | | TransNexus | support@transnexus.com | TransNexus, Inc.
> Eric Gillespie, Jr. <epg@debian.org> | | Eric | epg@debian.org | Eric Gillespie, Jr.
> John H. Robinson, IV <jaqque@debian.org> | | John | jaqque@debian.org | John H. Robinson, IV
> $"Jaldhar H. Vyas <jaldhar@debian.org>" | | $ | jaldhar@debian.org | Jaldhar H. Vyas
>
> And it turns out that basically the comma (',') in the maintainer name
> has caused the parser to fail. UDD contains more than 700 entries of
> this type (here hidden by the DISTINCT). I'm quite interested in
> getting this fixed - at least the email field - because otherwise I
> have no good chance to reliable get the stats for Debian Science Team
> (Adam C. Powell is quite active here and he would simply vanish from
> the stats if this is not fixed).
>
> The question is now: As you can see there is a quite simple fix with
> regular expressions which probably can easily ported to the importer
> code and which helps in > 700 cases reliably (there are four cases which
> can not catched by this means (see bottom of the mail) but we just fetch
> the vast majority of broken but easy to guess e-mail addresses fixed.
>
> Is there anything wrong in my arguing that this should be fixed. If
> not I will provide a patch for the importer.
Yes, please fix this in the importer.
L.
Reply to: