[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UDD contains names where spaces are not stripped



On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
> Hi,
> 
> by chance I realised that the uploaders table contains some names where names
> are not stripped:
> 
> udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
>      name_with_spaces     |                 uploader                  
> --------------------------+-------------------------------------------
>  " Mehdi Dogguy"          |  Mehdi Dogguy <mehdi@debian.org>
>  " David Paleino"         |  David Paleino <dapal@debian.org>
>  " Stéphane Glondu"      |  Stéphane Glondu <glondu@debian.org>
>  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
>  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
>  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
>  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
>  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
>  "Andreas Tille  "        | Andreas Tille   <tille@debian.org>
>  " LI Daobing"            |  LI Daobing <lidaobing@debian.org>
>  " David Paleino"         |  David Paleino <dapal@debian.org>
>  " Stefano Zacchiroli"    |  Stefano Zacchiroli <zack@debian.org>
>  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
>  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
>  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
>  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
>  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yoush@debian.org>
>  "Colin Tuckley "         | Colin Tuckley  <colint@debian.org>
>  "Colin Tuckley "         | Colin Tuckley  <colint@debian.org>
>  "Colin Tuckley "         | Colin Tuckley  <colint@debian.org>
> (20 rows)
> 
> 
> This causes slight errors when counting uploads of people.  My guess is this
> is due to some old importer code (I've checked the hit for my name which
> is a pretty old upload).  Thus I wonder whether it might be the easiest
> fix to simply fix this with some proper UPDATE statement to remove unneeded
> spaces.  This statement is doing the trick in my local clone:
> 
>    UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
> 
> If I'm not misleaded historic uploads will not importet from scratch so
> this would cure the situation.  Otherwise users need to always remember
> adding some trim(name) when dealing with the uploaders.name column not
> to mention that it gets even harder to deal with the uploader column
> that might feature extra spaces in the middle.
> 
> What do you think?

Hi,

Uploaders is refreshed every few hours from archive data, so a one-time
UPDATE would not help. UDD usually tries to preserve inaccuracies, so
those might be interesting for QA work.
In your case, why don't you use the email address to identify uploaders?
(possibly combining it with the carnivore data to identify different emails
belonging to the same person ?)

Lucas


Reply to: