[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[UDD] Fixing (most) email addresses in upload_history table



Hi,

I was aksed to do some investigation in who is active in maintainer
teams and thus I looked deeper into upload_history to find out who is
responsible for the changelog entries in the uploaded packages.  When
doing so I noticed that a lot of email addresses in this table are
broken.  This basically concerns:

udd# SELECT DISTINCT changed_by, changed_by_name, changed_by_email,                                                                                                                             regexp_replace(changed_by, E'^[^<]\+<\([.\\w]\+@[.\\w]\+\)>.*', E'\\1') AS changed_by_email_new,                                                                                          regexp_replace(changed_by, E'^[^\\w]*\([^<]\+[.\\w]\) *<[.\\w]\+@[.\\w]\+>.*', E'\\1') AS changed_by_name_new                                                                        FROM upload_history                                                                                                                                                                      WHERE changed_by_email NOT LIKE '%@%'                                                                                                                                                       AND changed_by       LIKE '%<%@%.%>%'                                                                                                                                                  ;
                 changed_by                  | changed_by_name | changed_by_email |  changed_by_email_new  |  changed_by_name_new  
---------------------------------------------+-----------------+------------------+------------------------+-----------------------
 Thomas Bushnell, BSG <tb@becket.net>        |                 | Thomas           | tb@becket.net          | Thomas Bushnell, BSG
 Ivan, Wong Yat Cheung <email@ivanwong.info> |                 | Ivan             | email@ivanwong.info    | Ivan, Wong Yat Cheung
 Thomas Bushnell, BSG <tb@debian.org>        |                 | Thomas           | tb@debian.org          | Thomas Bushnell, BSG
 Adam C. Powell, IV <hazelsct@debian.org>    |                 | Adam             | hazelsct@debian.org    | Adam C. Powell, IV
 RISKO, Gergely <risko@debian.org>           |                 | RISKO            | risko@debian.org       | RISKO, Gergely
 Eric Gillespie, Jr. <epg@progeny.com>       |                 | Eric             | epg@progeny.com        | Eric Gillespie, Jr.
 TransNexus, Inc. <support@transnexus.com>   |                 | TransNexus       | support@transnexus.com | TransNexus, Inc.
 Eric Gillespie, Jr. <epg@debian.org>        |                 | Eric             | epg@debian.org         | Eric Gillespie, Jr.
 John H. Robinson, IV <jaqque@debian.org>    |                 | John             | jaqque@debian.org      | John H. Robinson, IV
 $"Jaldhar H. Vyas <jaldhar@debian.org>"     |                 | $                | jaldhar@debian.org     | Jaldhar H. Vyas

And it turns out that basically the comma (',') in the maintainer name
has caused the parser to fail.  UDD contains more than 700 entries of
this type (here hidden by the DISTINCT).  I'm quite interested in
getting this fixed - at least the email field -  because otherwise I
have no good chance to reliable get the stats for Debian Science Team
(Adam C. Powell is quite active here and he would simply vanish from
the stats if this is not fixed).

The question is now:  As you can see there is a quite simple fix with
regular expressions which probably can easily ported to the importer
code and which helps in > 700 cases reliably (there are four cases which
can not catched by this means (see bottom of the mail) but we just fetch
the vast majority of broken but easy to guess e-mail addresses fixed.

Is there anything wrong in my arguing that this should be fixed.  If
not I will provide a patch for the importer.

Kind regards

     Andreas.


udd=# SELECT changed_by, changed_by_name, changed_by_email                                                                                                                                  FROM upload_history                                                                                                                                                                      WHERE changed_by_email NOT LIKE '%@%'                                                                                                                                                       AND changed_by       NOT LIKE '%@%'                                                                                                                                                       AND changed_by       != 'N/A'                                                                                                                                                          ;
              changed_by              | changed_by_name |  changed_by_email  
--------------------------------------+-----------------+--------------------
 Jaldhar H. Vyas <jaldhar.debian.org> | Jaldhar H. Vyas | jaldhar.debian.org
 ericvb                               |                 | ericvb
 Jaldhar H. Vyas <jaldhar.debian.org> | Jaldhar H. Vyas | jaldhar.debian.org
 ljlane                               |                 | ljlane

-- 
http://fam-tille.de


Reply to: