Book Review: Cathy O’Neil’s Weapons of Math Destruction
"Big Data has plenty of evangelists, but I’m not one of them," writes Cathy O’Neil, a blogger (mathbabe.org) and former quantitative analyst at the hedge fund DE Shaw who became sufficiently disillusioned with her hedge fund modelling that she joined the Occupy movement.
Early in “Weapons of Math Destruction” she describes the case of Sarah Wysocki, a popular fifth grade teacher at the MacFarland Middle School in Washington DC, who received a bad score on her IMPACT assessment. IMPACT, a teacher assessment tool, has been developed with the intention of finding underperforming teachers and firing them.
Wysocki’s poor score was based on a new scoring system known as value-added modelling, which had been developed by Mathematica Policy Research, a consultancy based in Princeton. However, “there are so many factors that go into learning and teaching that it would be difficult to measure them all,” Wysocki says. "What’s more," O'Neil continues "attempting to score a teacher’s effectiveness by analysing the test results of only twenty-five or thirty students is statistically unsound, even laughable. The numbers are far too small given all the things that could go wrong."
But there is another issue here, which is that statistical systems need feedback to tell them when they are off track.
When Mathematica’s scoring system tags Sarah Wysocki and 205 other teachers as failures, the district fires them. But how does it ever learn if it was right? It doesn’t. The system itself has determined that they were failures, and that is how they are viewed. Two hundred and six “bad” teachers are gone. That fact alone appears to demonstrate how effective the value-added model is. It is cleansing the district of underperforming teachers. Instead of searching for the truth, the score comes to embody it.
O’Neil uses the term a “weapon of math destruction” (WMD) to describe the characteristics of the worst kinds of mathematical models, of which IMPACT is an example. To qualify, the model must have three distinct elements: Opacity, Scale, and Damage. During the course of the book she goes on to look at a whole variety of systems that are impacting the lives of large numbers of people as they are going to college, borrowing money, getting sentenced to prison, or finding and holding down a job. Examples abound: credit scores being used to evaluate potential hires on the misguided assumption that bad scores correlate to bad job performance; US based for-profit colleges using data and aggressive re-targeting though on-line advertising to prey on vulnerable people trying to improve their life chances but often getting plunged into debt; crime predictive software causing police to focus on minor nuisance crime in poor neighbourhoods and ignoring more serious crimes in more affluent neighbourhoods.
In general, O’Neil argues, the models negatively impact the poor and disadvantaged whilst similarly making the lives of the well-off easier. Central to her thesis is the idea that people have a tendency to put blind faith in algorithms, imagining that since the model is mathematical, it must somehow be objective and fair. But of course this isn’t the case. For one thing an algorithm can simply be poorly designed. Moreover, in the case of machine learning systems, the data that is used to train the model can have inherent biases. For example, if a start-up in Silicon Valley has historically hired few or no women as engineers, and then, as it scales, creates an algorithm that tries to hire engineers based on the firm’s historical data of engineers who have performed well at the firm, the resulting algorithm will have an in-built bias against hiring women.
In our notional algorithmic hiring system the gender bias was not intentional. But if the measure of success is hiring smart engineers who stay with the firm for more than say two years, the firm wouldn’t necessarily be aware of the problem. As with IMPACT, under its own terms the system may well appear to be performing correctly. Moreover, if the model is hard to reason about - as machine learning models often are - then the firm may not even be aware that its model has inherent gender bias. It certainly won’t realise without, at the very least, specially testing the model using identical resumes for a notional female and male candidate.
The problem of judging the success of algorithms is one that O’Neil repeatedly returns to throughout the book. Later, for example, she takes a look at employee scheduling software used by companies such as Starbucks, McDonalds and Walmart. The model used by the software is optimised for efficiency and profitability with scant regard for the justice or good of the employees. So much so that
Workers at major corporations in America recently came up with a new verb: clopening. That's when an employee works late one night to close the store or café and then returns a few hours later, before dawn, to open it.
These, often short-notice, chaotic schedules are becoming more common, with low-wage workers, and their families, the worst affected.
The software also condemns a large percentage of our children to grow up without routines. They experience their mother bleary eyed at breakfast, or hurrying out the door without dinner, or arguing with her mother about who can take care of them on Sunday morning. This chaotic life affects children deeply. According to a study by the Economic Policy Institute, an advocacy group, “Young children and adolescents of parents working unpredictable schedules or outside standard daytime working hours are more likely to have inferior cognition and behavioral outcomes.” The parents might blame themselves for having a child who acts out or fails in school, but in many cases the real culprit is the poverty that leads workers to take jobs with haphazard schedules - and the scheduling models that squeeze struggling families even harder.
The problem, O’Neil argues, is the model’s choice of objectives: here efficiency and profitability. Since the model generally increases the profit per employee these kinds of working practices will usually only change when the company in question is publicly criticised for them.
O’Neil also explores the issue of opaqueness in some detail. One of the most striking and fascinating accounts focusses on the work done by a New York data company called Sense Networks. Ten years ago Sense began analysing anonymised cell phone data showing where people went.
The team fed this mobile data on New York cell phone users to its machine-learning system but provided scant additional guidance. They didn’t instruct the program to isolate suburbanites or millennials or to create different buckets of shoppers. The software would find similarities on its own. Many of them would be daft - people who spend more than 50 percent of their days on streets starting with the letter J, or those who take most of their lunch breaks outside. But if the system explored millions of these data points, patterns would start to emerge. Correlations would emerge, presumably including many that humans would never consider... "We wouldn’t necessarily recognize what these people have in common," said Sense’s cofounder and former CEO, Greg Skibiski. "They don’t fit into the traditional buckets that we'd come up with."
...Sense was sold in 2014 to YP, a mobile advertising company spun off from AT&T. So for the time being, its sorting will be used to target different tribes for ads. But you can imagine how machine-learning systems fed by different streams of behavioral data will soon be placing us not just into one tribe but into hundreds of them, even thousands.
In other words, models trained to solve one particular problem can jump across to other fields increasing the risks from their scale. Generally, O’Neil suggests, data scientists don’t get paid to think about this. Moreover "in the era of machine intelligence, most of the variables will remain a mystery... [The models] will be highly efficient, seemingly arbitrary, and utterly unaccountable."
The book does a remarkable job of describing, both in general terms and with concrete examples, the kinds of problems that the pervasive use of Big Data has the potential to cause. I felt it was weaker however, on suggesting solutions to these problems. O’Neil argues convincingly that regulation is likely to be required, starting with regulations for the modellers themselves. But building such a regulatory model would undoubtedly be challenging since, as O’Neil herself highlights, it would have to try to measure hidden costs.
This is already the case for other types of regulation. Though economists may attempt to calculate costs for smog or agricultural runoff, or the extinction of the spotted owl, numbers can never express their value. And the same is often true of fairness and the common good in mathematical models. They’re concepts that reside only in the human mind, and they resist quantification. And since humans are in charge of making the models, they rarely go the extra mile or two to even try.
Auditing is also required, O’Neil suggests:
The first step, before digging into the software code, is to carry out research. We’d begin by treating the WMD as a black box that takes in data and spits out conclusions. This person has a medium risk of committing another crime, this one has a 73 percent chance of voting Republican, this teacher ranks in the lowest decile. By studying these outputs, we could piece together the assumptions behind the model and score them for fairness.
I longed to see more details here, even proof of concept ones, but perhaps reflecting the relative youthfulness of the field, there wasn’t much detail. That being said O'Neil does highlight that techniques for auditing algorithms are starting to emerge from the academic community.
At Princeton, for example, researchers have launched the Web Transparency and Accountability Project. They create software robots that masquerade online as people of all stripes - rich, poor, male, female, or suffering from mental health issues. By studying the treatment these robots receive, the academics can detect biases in automated systems from search engines to job placements sites.
The book also focusses exclusively on US case studies; I would have liked to see examples from Europe or elsewhere to give a sense of whether the problem is US only, or is also present in markets that do tend to have more regulation already.
These are minor criticisms though. Whilst biases in models and many of the other issues that she covers are also being discussed elsewhere, this is the first book I’m aware of that sets out the potential scale of the problem with such detailed research and with concrete examples. O’Neil writes with great clarity and passion, and this is an urgent and important book.
About the Book Author
Cathy O'Neil is the author of the blog mathbabe.org. She was the former Director of the Lede Program in Data Practices at Columbia University Graduate School of Journalism, Tow Center and was employed as Data Science Consultant at Johnson Research Labs. Cathy earned a mathematics Ph.D. from Harvard University and currently lives in NYC.
Typo? Swapped SES!
Overall, the problems cited are not necessarily mathematical, but are classic errors in performance-based management, survey instruments and research. If you measure the wrong thing(s), you will end up with bad conclusions.
Re: significant typo!
Re: Typo? Swapped SES!
I have the sentence as I intended it but you may have misunderstood the point. If you over police a poor neighbourhood you’ll make more arrests in that neighbourhood for minor “nuisance” crimes that would otherwise go unreported. If you then feed that data into a predictive policing system of the kind increasingly used by US police forces you end up with a system that tells the police to go to these neighbourhoods because that’s where the crimes are. More arrests are made for minor offences and that then creates a feedback loop.
Crimes in more affluent areas (assuming they aren’t being reported) don’t end up in the data in the same way even if they are more serious. As a consequence you are more likely to end up with a criminal record for a minor offence if you happen to live in a poor neighbourhood as opposed to a more affluent one.
Cathy and I discuss this in more detail in this week’s podcast: