[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UDD contains names where spaces are not stripped



On 07/12/23 at 20:24 +0100, Andreas Tille wrote:
> Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:
> > On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
> > > Hi,
> > > 
> > > by chance I realised that the uploaders table contains some names where names
> > > are not stripped:
> > > 
> > > udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
> > >      name_with_spaces     |                 uploader                  
> > > --------------------------+-------------------------------------------
> > >  " Mehdi Dogguy"          |  Mehdi Dogguy <mehdi@debian.org>
> > >  " David Paleino"         |  David Paleino <dapal@debian.org>
> > >  " Stéphane Glondu"      |  Stéphane Glondu <glondu@debian.org>
> > >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> > >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> > >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> > >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> > >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> > >  "Andreas Tille  "        | Andreas Tille   <tille@debian.org>
> > >  " LI Daobing"            |  LI Daobing <lidaobing@debian.org>
> > >  " David Paleino"         |  David Paleino <dapal@debian.org>
> > >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
> > >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> > >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> > >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> > >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> > >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
> > >  "Colin Tuckley "         | Colin Tuckley  <colint@debian.org>
> > >  "Colin Tuckley "         | Colin Tuckley  <colint@debian.org>
> > >  "Colin Tuckley "         | Colin Tuckley  <colint@debian.org>
> > > (20 rows)
> > > ...
> > >    UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
> > > 
> > 
> > Uploaders is refreshed every few hours from archive data, so a one-time
> > UPDATE would not help. UDD usually tries to preserve inaccuracies, so
> > those might be interesting for QA work.
> 
> OK.
> 
> > In your case, why don't you use the email address to identify uploaders?
> 
> Since this also does not work:
> 
> udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP BY uploader;
>  count |              uploader              
> -------+------------------------------------
>      1 | Andreas Tille   <tille@debian.org>
>      1 | Andreas Tille <andreas@an3as.eu>
>   8785 | Andreas Tille <tille@debian.org>
> (3 Zeilen)
> 
> > (possibly combining it with the carnivore data to identify different emails
> > belonging to the same person ?)
> 
> I could fiddle around with carnivore but that's overkill for thst
> purpose and I insist that not stripping blanks from names does not make
> any sense, IMHO.  (1 Zeile)
> 
> 
> BTW:  I found 
> 
> udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%'  group by name;
>  count |     name      
> -------+---------------
>  16524 | Andreas Tille
> (1 Zeile)
> 
> So why do I have 8707 uploads per uploaders but 16524 per upload_history?
> 
> Is my assumption wrong that both values should match (modulo some wrongly
> spelled names)

If you look at the uploaders table, there are three columns:
- 'uploader', than contains the raw data
- 'name' and 'email' that contain the parsed (and trimmed) data

udd=> select uploader, name, email, count(*) from uploaders where uploader ilike '%tille%' group by 1,2,3;
              uploader              |      name       |      email       | count 
------------------------------------+-----------------+------------------+-------
 Andreas Tille <tille@debian.org>   | Andreas Tille   | tille@debian.org |  8785
 Andreas Tille <andreas@an3as.eu>   | Andreas Tille   | andreas@an3as.eu |     1
 Andreas Tille   <tille@debian.org> | Andreas Tille   | tille@debian.org |     1

So, just use name and/or email?

Lucas


Reply to: