Re: UDD contains names where spaces are not stripped
Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:
> On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
> > Hi,
> >
> > by chance I realised that the uploaders table contains some names where names
> > are not stripped:
> >
> > udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
> > name_with_spaces | uploader
> > --------------------------+-------------------------------------------
> > " Mehdi Dogguy" | Mehdi Dogguy <mehdi@debian.org>
> > " David Paleino" | David Paleino <dapal@debian.org>
> > " Stéphane Glondu" | Stéphane Glondu <glondu@debian.org>
> > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > "Andreas Tille " | Andreas Tille <tille@debian.org>
> > " LI Daobing" | LI Daobing <lidaobing@debian.org>
> > " David Paleino" | David Paleino <dapal@debian.org>
> > " Stefano Zacchiroli" | Stefano Zacchiroli <zack@debian.org>
> > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yoush@debian.org>
> > "Colin Tuckley " | Colin Tuckley <colint@debian.org>
> > "Colin Tuckley " | Colin Tuckley <colint@debian.org>
> > "Colin Tuckley " | Colin Tuckley <colint@debian.org>
> > (20 rows)
> > ...
> > UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
> >
>
> Uploaders is refreshed every few hours from archive data, so a one-time
> UPDATE would not help. UDD usually tries to preserve inaccuracies, so
> those might be interesting for QA work.
OK.
> In your case, why don't you use the email address to identify uploaders?
Since this also does not work:
udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP BY uploader;
count | uploader
-------+------------------------------------
1 | Andreas Tille <tille@debian.org>
1 | Andreas Tille <andreas@an3as.eu>
8785 | Andreas Tille <tille@debian.org>
(3 Zeilen)
> (possibly combining it with the carnivore data to identify different emails
> belonging to the same person ?)
I could fiddle around with carnivore but that's overkill for thst
purpose and I insist that not stripping blanks from names does not make
any sense, IMHO. (1 Zeile)
BTW: I found
udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
count | name
-------+---------------
16524 | Andreas Tille
(1 Zeile)
So why do I have 8707 uploads per uploaders but 16524 per upload_history?
Is my assumption wrong that both values should match (modulo some wrongly
spelled names)
Kind regards
Andreas.
--
http://fam-tille.de
Reply to: