[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UDD contains names where spaces are not stripped



Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:
> On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
> > Hi,
> > 
> > by chance I realised that the uploaders table contains some names where names
> > are not stripped:
> > 
> > udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
> >      name_with_spaces     |                 uploader                  
> > --------------------------+-------------------------------------------
> >  " Mehdi Dogguy"          |  Mehdi Dogguy <mehdi@debian.org>
> >  " David Paleino"         |  David Paleino <dapal@debian.org>
> >  " Stéphane Glondu"      |  Stéphane Glondu <glondu@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> >  "Andreas Tille  "        | Andreas Tille   <tille@debian.org>
> >  " LI Daobing"            |  LI Daobing <lidaobing@debian.org>
> >  " David Paleino"         |  David Paleino <dapal@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> >  "Colin Tuckley "         | Colin Tuckley  <colint@debian.org>
> >  "Colin Tuckley "         | Colin Tuckley  <colint@debian.org>
> >  "Colin Tuckley "         | Colin Tuckley  <colint@debian.org>
> > (20 rows)
> > ...
> >    UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
> > 
> 
> Uploaders is refreshed every few hours from archive data, so a one-time
> UPDATE would not help. UDD usually tries to preserve inaccuracies, so
> those might be interesting for QA work.

OK.

> In your case, why don't you use the email address to identify uploaders?

Since this also does not work:

udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP BY uploader;
 count |              uploader              
-------+------------------------------------
     1 | Andreas Tille   <tille@debian.org>
     1 | Andreas Tille <andreas@an3as.eu>
  8785 | Andreas Tille <tille@debian.org>
(3 Zeilen)

> (possibly combining it with the carnivore data to identify different emails
> belonging to the same person ?)

I could fiddle around with carnivore but that's overkill for thst
purpose and I insist that not stripping blanks from names does not make
any sense, IMHO.  (1 Zeile)


BTW:  I found 

udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%'  group by name;
 count |     name      
-------+---------------
 16524 | Andreas Tille
(1 Zeile)

So why do I have 8707 uploads per uploaders but 16524 per upload_history?

Is my assumption wrong that both values should match (modulo some wrongly
spelled names)

Kind regards
    Andreas.

-- 
http://fam-tille.de


Reply to: