Bad science journalism: Gay facial recognition

Journalistic accounts of soon-to-be-published study called “Deep neural networks are more accurate than humans at detecting sexual orientation from facial images” (by Michal Kosinski and Yilun Wang) have gone viral and already prompted some outraged reactions from LGBT groups GLAAD and the Human Rights Campaign. The study primed a deep neural network face recognition program on photos of white homosexual and heterosexual adults obtained on a dating website, and used it to create a “classifier” that rates which photographs were most distinctively those of gay or lesbian people. This classifier’s ability to distinguish gays and lesbian individuals was compared with human observers on test samples from the data, and on Facebook profile pictures with a stated sexual orientation.

This is all a vaguely interesting computer science project about self-presentation (all of the images were curated by the people involved and put on profiles stating an “interest in” one sex or  the other), machine learning, and perception. Interesting, that is, until it is attached to fears about artificial omniscience and ubiquitous surveillance, and debates about nature and nurture. Then it becomes at turns frightening and polemical.

Before we get there (and I’ll update this post with some comments about the authors’ dubious understanding of the many social layers that separate, say pre-natal hormones and early adult physical presentation, the fluidity of sexual orientation, and the presumed future capacity of artificial intelligence to make omniscient predictions), we have to ask whether the results of this study justify this kind of grand implications. In other words, we first need to know what exactly the study shows.

Let me begin with two simple asks for journalists reporting on science:

  1. Read the whole scientific paper and explain to readers what actual evidence is being presented!
  2. Also, remember that “discussion” sections of papers lack the scientific validity that is attached to results of the research method involved.
  3. Be literate in math.
  4. Never ever present a numerical result without explaining what that number means.

Unfortunately, major accounts of the paper (such as this one in the Guardian) fail to follow this simple rule. And, as is often the case, the problem starts with the headline:

New AI can guess whether you’re gay or straight from a photograph
An algorithm deduced the sexuality of people on a dating site with up to 91% accuracy, raising tricky ethical questions

Now, does the paper show that the AI can guess your sexuality from a photograph with 91% accuracy? Nope.

As the paper states:

The AUC = .91 does not imply that 91% of gay men in a given population can be identified, or that the classification results are correct 91% of the time.

Here’s the 91% claim. The AI is shown five photos from two individuals on the dating website. Based on what it has learned from other photos, it offers a guess as to which is more likely to be gay. In 91% of the cases where there is a gay man and a straight man being compared it guesses correctly. Accurate headline:

AI can distinguish gay men based on five dating profile pics 91% of the time.

When presented with just one pair of images of men, the AI guessed right 81% of the time. Human judges—recruited by Mechanical Turk and untrained on any images—guessed right just 61% of the time. For women, both were right less often: 71% for the AI and 54% for the humans. In this test, 50% is rock bottom, the equivalent of zero gaydar.)

But it gets worse. Let’s try to apply the paper to original question raised by the headline. How well can this AI judge an individual person’s sexuality? That’s the critical ability, from which dystopian surveillance fears arise. For this, the researchers seemed to have tuned the data very carefully. Remember too, this is still an operation performed on profile pics, this time from Facebook.

First, the AI classifier still seems to work, though not as well:

The classifier could accurately distinguish between gay Facebook users and heterosexual dating-website users in 74% of cases…
But when presented with the task not of telling a gay profile pic from a straight one, but of evaluating a whether given profile pic is gay, the machine’s performance fell apart:

The performance of the classifier depends on the desired trade-off between precision (e.g., the fraction of gay people among those classified as gay) and recall (e.g., the fraction of gay people in the population correctly identified as gay). Aiming for high precision reduces recall, and vice versa.

Let us illustrate this trade-off… We simulated a sample of 1,000 men by randomly drawing participants, and their respective probabilities of being gay, from the sample used in Study 1a. As the prevalence of same-gender sexual orientation among men in the U.S. is about 6–7%, we drew 70 probabilities from the gay participants, and 930 from the heterosexual participants. We only considered participants for whom at least 5 facial images were available; note that the accuracy of the classifier in their case reached an AUC = .91. Setting the threshold above which a given case should be labeled as being gay depends on a desired trade-off between precision and recall. To maximize precision (while sacrificing recall), one should select a high threshold or select only a few cases with the highest probability of being gay. Among 1% (i.e., 10) of individuals with the highest probability of being gay in our simulated sample, 9 were indeed gay and 1 was heterosexual, leading to the precision of 90% (9/10 = 90%). This means, however, that only 9 out of 70 gay men were identified, leading to a low recall of 13% (9/70 = 13%). To boost recall, one needs to sacrifice some of the precision. Among 30 individuals with the highest probability of being gay, 23 were gay and 7 were heterosexual (precision = 23/30= 77%; recall = 23/70= 33%). Among the top 100 males most likely to be gay, 47 were gay (precision = 47%; recall = 68%).
Tuned to its highest setting, the machine could find nine of the seventy gay men and threw one straight man in the gay box. Set to a broader setting, the machine found 47 of the 70 gay men, but also labelled 53 straight men as gay.
Now, we have a big technical problem: the artificial gaydar can only find most of the gay people when it produces a pool of “gay looking” people that is majority straight. So no matter how repressive and homophobic the society, it’s hard not to imagine that the “gay looking” 5% of the population will put up with this kind of system.
Of course, if we imagine that gay and straight people really have different faces and we just haven’t found the magic formula yet (and the authors seem to leap to this conclusion, for what it’s worth) then we can imagine a better AI figuring out how to tell the difference. But there are plenty of reasons to doubt that this ever has been or ever will be the case.