Bad science journalism: Gay facial recognition

Journalistic accounts of soon-to-be-published study called “Deep neural networks are more accurate than humans at detecting sexual orientation from facial images” (by Michal Kosinski and Yilun Wang) have gone viral and already prompted some outraged reactions from LGBT groups GLAAD and the Human Rights Campaign. The study primed a deep neural network face recognition program on photos of white homosexual and heterosexual adults obtained on a dating website, and used it to create a “classifier” that rates which photographs were most distinctively those of gay or lesbian people. This classifier’s ability to distinguish gays and lesbian individuals was compared with human observers on test samples from the data, and on Facebook profile pictures with a stated sexual orientation.

This is all a vaguely interesting computer science project about self-presentation (all of the images were curated by the people involved and put on profiles stating an “interest in” one sex or the other), machine learning, and perception. Interesting, that is, until it is attached to fears about artificial omniscience and ubiquitous surveillance, and debates about nature and nurture. Then it becomes at turns frightening and polemical.

Before we get there (and I’ll update this post with some comments about the authors’ dubious understanding of the many social layers that separate, say pre-natal hormones and early adult physical presentation, the fluidity of sexual orientation, and the presumed future capacity of artificial intelligence to make omniscient predictions), we have to ask whether the results of this study justify this kind of grand implications. In other words, we first need to know what exactly the study shows.

Let me begin with two simple asks for journalists reporting on science:

Read the whole scientific paper and explain to readers what actual evidence is being presented!
Also, remember that “discussion” sections of papers lack the scientific validity that is attached to results of the research method involved.
Be literate in math.
Never ever present a numerical result without explaining what that number means.

Unfortunately, major accounts of the paper (such as this one in the Guardian) fail to follow this simple rule. And, as is often the case, the problem starts with the headline:

New AI can guess whether you’re gay or straight from a photograph
An algorithm deduced the sexuality of people on a dating site with up to 91% accuracy, raising tricky ethical questions

Now, does the paper show that the AI can guess your sexuality from a photograph with 91% accuracy? Nope.

As the paper states:

The AUC = .91 does not imply that 91% of gay men in a given population can be identified, or that the classification results are correct 91% of the time.

Here’s the 91% claim. The AI is shown five photos from two individuals on the dating website. Based on what it has learned from other photos, it offers a guess as to which is more likely to be gay. In 91% of the cases where there is a gay man and a straight man being compared it guesses correctly. Accurate headline:

AI can distinguish gay men based on five dating profile pics 91% of the time.

When presented with just one pair of images of men, the AI guessed right 81% of the time. Human judges—recruited by Mechanical Turk and untrained on any images—guessed right just 61% of the time. For women, both were right less often: 71% for the AI and 54% for the humans. In this test, 50% is rock bottom, the equivalent of zero gaydar.)

But it gets worse. Let’s try to apply the paper to original question raised by the headline. How well can this AI judge an individual person’s sexuality? That’s the critical ability, from which dystopian surveillance fears arise. For this, the researchers seemed to have tuned the data very carefully. Remember too, this is still an operation performed on profile pics, this time from Facebook.

First, the AI classifier still seems to work, though not as well:

The classifier could accurately distinguish between gay Facebook users and heterosexual dating-website users in 74% of cases…

But when presented with the task not of telling a gay profile pic from a straight one, but of evaluating a whether given profile pic is gay, the machine’s performance fell apart:

The performance of the classifier depends on the desired trade-off between precision (e.g., the fraction of gay people among those classified as gay) and recall (e.g., the fraction of gay people in the population correctly identified as gay). Aiming for high precision reduces recall, and vice versa.

Let us illustrate this trade-off… We simulated a sample of 1,000 men by randomly drawing participants, and their respective probabilities of being gay, from the sample used in Study 1a. As the prevalence of same-gender sexual orientation among men in the U.S. is about 6–7%, we drew 70 probabilities from the gay participants, and 930 from the heterosexual participants. We only considered participants for whom at least 5 facial images were available; note that the accuracy of the classifier in their case reached an AUC = .91. Setting the threshold above which a given case should be labeled as being gay depends on a desired trade-off between precision and recall. To maximize precision (while sacrificing recall), one should select a high threshold or select only a few cases with the highest probability of being gay. Among 1% (i.e., 10) of individuals with the highest probability of being gay in our simulated sample, 9 were indeed gay and 1 was heterosexual, leading to the precision of 90% (9/10 = 90%). This means, however, that only 9 out of 70 gay men were identified, leading to a low recall of 13% (9/70 = 13%). To boost recall, one needs to sacrifice some of the precision. Among 30 individuals with the highest probability of being gay, 23 were gay and 7 were heterosexual (precision = 23/30= 77%; recall = 23/70= 33%). Among the top 100 males most likely to be gay, 47 were gay (precision = 47%; recall = 68%).

Tuned to its highest setting, the machine could find nine of the seventy gay men and threw one straight man in the gay box. Set to a broader setting, the machine found 47 of the 70 gay men, but also labelled 53 straight men as gay.

Now, we have a big technical problem: the artificial gaydar can only find most of the gay people when it produces a pool of “gay looking” people that is majority straight. So no matter how repressive and homophobic the society, it’s hard not to imagine that the “gay looking” 5% of the population will put up with this kind of system.

Of course, if we imagine that gay and straight people really have different faces and we just haven’t found the magic formula yet (and the authors seem to leap to this conclusion, for what it’s worth) then we can imagine a better AI figuring out how to tell the difference. But there are plenty of reasons to doubt that this ever has been or ever will be the case.

6 thoughts on “Bad science journalism: Gay facial recognition”

Thank you for commenting on my research. Naturally I do not control the press coverage and agree with you that it is often inaccurate and sensational.
I do think, however, that you underestimate the accuracy of the algorithm. The AUC of 91% is comparable with the accuracy of spectroscopy when detecting breast cancer or state-of-the-art diagnostic tools for Parkinson’s disease. We widely use and trust those diagnostic tools (which, btw, were also tested on small and biased samples of people whose disease has been previously diagnosed.)
Also, in the example that you quote you say “Set to a broader setting, the machine found 47 of the 70 gay men, but also labelled 53 straight men as gay.” and dismiss the accuracy as negligible.
But this is a nearly seven-fold improvement in precision over not using a classifier! And yes, 53 people were straight, but 30 would be straight even if the classifier was perfect – we were looking for 70 ‘targets’ among 100 cases.
I realize that the description shifted from the paper, and aimed at scientists, may not be the clearest. I tired to revise it using a clearer language. See: https://goo.gl/spkqSu
Finally, it is interesting that no one seems to be talking about the main issue here. What if it is true that technologies already widely used by companies and governments can be use to invade people privacy?
Many well-meaning people, including Gay rights organizations, dismiss the results as a junk science. The study could be wrong. But what if they are wrong when dismissing it? In such case they are putting at risk the very people that are supposed to be protecting.

LikeLike

Diagnostic tests are used by doctors on people who present symptoms or exhibit risk factors. Universal testing of everyone for, say, HIV status, would result in very high numbers of false positives. And that’s using tests with higher specificity and sensitivity than the AI described here could ever achieve. Of course diagnostic tests are cross-checked with clinical work. You can re-test for HIV, and then monitor T-cell counts, for instance.

My point in using the probabilities that are provided in the paper is to evaluate the Big Brother scenario where a government surveils the public and deploys this tech to apply discriminatory policies to gay men. This is exactly the scenario that Michal Kosinski and coauthor reportedly raise with the Guardian editors [“the authors argued that the technology already exists, and its capabilities are important to expose so that governments and companies can proactively consider privacy risks and the need for safeguards and regulations.”]. Any any level (including the appropriate level of 70 selected out of 1,000), it produces a huge constituency of false positives. If Big Brother plus AI ever got deployed, numerous government officials would be falsely accused and suppress the program.

What is more plausible is person-to-person face recognition, i.e., do you look like one of the people on gay dating sites? These could actually pose problems for closeted LGBT people (and all kinds of other people engaged in clandestine, but electronically recorded behavior) and deserve closer scrutiny. But they don’t rely on a putative physical difference between gays/lesbians and heterosexuals, but on the actually existing problem of Big Data aggregation and advancing biometric technology.

LikeLike

[…] Carwil Bjork-James publishes a step-by-step way for journalists (and others) to evaluate scientific claims, and also a useful critique both of the : “Bad Science Journalist: Gay Facial Recognition.” […]

LikeLike

[…] are altered by complex interactions with environment and behavior over the life course. [Bloggers Bjork-James, Cohen, and Gelman offer important critiques of the data and its […]

LikeLike

[…] they scrape what they see as “public” data. The second has been from quantitative social scientists who find the Kosinksy study lacking by the standards of rigorous social science. Again, you’ll […]

LikeLike