[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#1014908: ITP: gender-guesser -- Guess the gender from first name



On Thu, Jul 14, 2022 at 08:14:13AM -0700, Russ Allbery wrote:
> Edward Betts <edward@4angle.com> writes:
> 
> > I've been writing some code to work out the gender balance of speakers
> > at a conference. It parses the pentabarf XML of the schedule and feeds
> > the speaker names to this module.
> 
> > Here's the results for Debconf 22.
> 
> > 72 speakers
> 
> > male              48   66.7%
> > unknown           16   22.2%
> > female             4    5.6%
> > mostly_male        2    2.8%
> > andy               1    1.4%
> > mostly_female      1    1.4%
> 
> I fear this may be an example of statistics that look meaningful but
> probably aren't because the error bar is much higher than the typical
> consumer of the statistic intuitively thinks it is.  Although maybe that's
> not a worry in this case since the program itself says that it totally
> failed to make a guess about a quarter of the time.

So instead of making a knowingly-bad guess it says it doesn't know?
That's an upside in my book.

> I don't really have any objections to the package being in the archive;
> this is certainly something that a lot of people seem to want to do and
> thus seem to find some utility in doing.  But unless one has a
> higher-quality source of data than just names (preferred pronouns, direct
> self-identification, etc.)

Real people who want to switch their visible gender (ie, how others view
them) do pick a name that matches the gender they want to present to the
world.


As of actually using first names for statistics:
Several years ago, I did stats on who does uploads in Debian.
My methodology was:
1. limit packages to "key packages" (RT meaning, ie popcon/d-i/{b-,}deps)
2. take the last changed-by of every package (this avoids maintainers
   who haven't been seen in 20 years, etc)
3. for every unique name, manually:
   a. do I recognize that person?  If so, use gender I know.
   b. is the first name gender-specific? (I know western and slavic names)
   c. ~60 seconds of web search using DDG (I seemed to extend suspected
      females to >15 minutes somehow...)
   d. if none of the above gave an answer, say '?'
4. weight every name by the # of packages from 2. (ie, give count of
   packages)

Obviously every step introduces inaccuracies; eg. I used first-[mid...]-last
name combinations, merging distinct spellings only when I spotted them by
hand.  I seem to recall there are two DDs with the same name (I don't
remember who though), they'd be unified by this methodology.  Of course
there'll be no error if they're of the same gender but that's not the case
for other uses of the input data.

Thus, my stats are _not perfect_.  But, as long as I divulge my methodology,
it is sound science.

A famous example is one of first phone surveys, that worked by randomly
selecting phone numbers.  The results turned out to be totally wrong -- with
individual-owned phones being still a quite new thing, phone owners tended
to be affluent and tech-friendly people, and their responses were not
representative of the population at large.

Thus, to be valid science, any use of statistics should disclose the
methodology used.  But, that doesn't make the results any less valid,
it merely attaches a caveat.  Barring some other error (eg. bogus random
generator, ignoring people who hang up, etc), that survey still provided
accurate info on the population of phone owners.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Ash nazg durbatulûk,
⣾⠁⢠⠒⠀⣿⡁   ash nazg gimbatul,
⢿⡄⠘⠷⠚⠋⠀ ash nazg thrakatulûk
⠈⠳⣄⠀⠀⠀⠀   agh burzum-ishi krimpatul.


Reply to: