Re: UDD contains names where spaces are not stripped
On 07/12/23 at 20:24 +0100, Andreas Tille wrote:
> Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:
> > On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
> > > Hi,
> > >
> > > by chance I realised that the uploaders table contains some names where names
> > > are not stripped:
> > >
> > > udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
> > > name_with_spaces | uploader
> > > --------------------------+-------------------------------------------
> > > " Mehdi Dogguy" | Mehdi Dogguy <mehdi@debian.org>
> > > " David Paleino" | David Paleino <dapal@debian.org>
> > > " Stéphane Glondu" | Stéphane Glondu <glondu@debian.org>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > > "Andreas Tille " | Andreas Tille <tille@debian.org>
> > > " LI Daobing" | LI Daobing <lidaobing@debian.org>
> > > " David Paleino" | David Paleino <dapal@debian.org>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > > "Colin Tuckley " | Colin Tuckley <colint@debian.org>
> > > "Colin Tuckley " | Colin Tuckley <colint@debian.org>
> > > "Colin Tuckley " | Colin Tuckley <colint@debian.org>
> > > (20 rows)
> > > ...
> > > UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
> > >
> >
> > Uploaders is refreshed every few hours from archive data, so a one-time
> > UPDATE would not help. UDD usually tries to preserve inaccuracies, so
> > those might be interesting for QA work.
>
> OK.
>
> > In your case, why don't you use the email address to identify uploaders?
>
> Since this also does not work:
>
> udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP BY uploader;
> count | uploader
> -------+------------------------------------
> 1 | Andreas Tille <tille@debian.org>
> 1 | Andreas Tille <andreas@an3as.eu>
> 8785 | Andreas Tille <tille@debian.org>
> (3 Zeilen)
>
> > (possibly combining it with the carnivore data to identify different emails
> > belonging to the same person ?)
>
> I could fiddle around with carnivore but that's overkill for thst
> purpose and I insist that not stripping blanks from names does not make
> any sense, IMHO. (1 Zeile)
>
>
> BTW: I found
>
> udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
> count | name
> -------+---------------
> 16524 | Andreas Tille
> (1 Zeile)
>
> So why do I have 8707 uploads per uploaders but 16524 per upload_history?
>
> Is my assumption wrong that both values should match (modulo some wrongly
> spelled names)
If you look at the uploaders table, there are three columns:
- 'uploader', than contains the raw data
- 'name' and 'email' that contain the parsed (and trimmed) data
udd=> select uploader, name, email, count(*) from uploaders where uploader ilike '%tille%' group by 1,2,3;
uploader | name | email | count
------------------------------------+-----------------+------------------+-------
Andreas Tille <tille@debian.org> | Andreas Tille | tille@debian.org | 8785
Andreas Tille <andreas@an3as.eu> | Andreas Tille | andreas@an3as.eu | 1
Andreas Tille <tille@debian.org> | Andreas Tille | tille@debian.org | 1
So, just use name and/or email?
Lucas
Reply to: