Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations How to Prevent Catastrophic Failure in Production ML Systems

How to Prevent Catastrophic Failure in Production ML Systems



Martin Goodson describes the unpredictable nature of artificial intelligence systems and how mastering a handful of engineering principles can mitigate the risk of failure. He talks about the kinds of errors artificial intelligence systems make, how to build systems that protect against common errors, and why evaluation can be much harder than it seems.


Martin Goodson is Chief Scientist/CEO of Evolution AI, specialist in natural language processing (NLP). He is the organizer of the London Machine Learning Meetup, Europe's largest community of machine learning practitioners. He blogs on data science:

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Goodson: First of all, I want to set the scene a little bit. The title of this track is "Machine Learning without a Ph.D." So, I want to talk about a particular class of error that I see in production deployments and machine learning systems that often occur, in my experience, when the people who have deployed it don't necessarily have a formal education in machine learning research. Also, to be fair, it has to be said that it also happens a lot when people do have a formal education in machine learning research.

Just a little bit about me. I'm Martin Goodson, I'm the chief scientist and the CEO of a company called Evolution AI. We offer a platform for building production-quality natural language processing systems in enterprises like banks. These are typically organizations that kind of resist Machine Learning. We really need to convince decision makers that the systems are robust and they're not going to fail. So we've put a lot of effort into making a platform which makes it easy to detect the kind of errors that I'm going to talk about today.

This class of problems is called data leakage. Data leakage, it doesn't really have a very precise definition. It's a kind of a vague term. It covers a lot of different things. But this is the closest I could come to a definition. Data leakage is when a machine learning model uses information that it shouldn't have access to in order to learn. And are many different types. And we'll talk about four different types today, which I think are the most important.

Leaking Test Data into Training Data

The first type, this is the simplest type of data leakage, I think. Probably we've all seen something like this. This is leaking test data into the training data, or the other way around. So, this is based on some work that I did for an advertising company. They had a need to classify millions of web pages according to topic. The only information that they had for classifying the web pages was the URL. So it's just some strings, some characters, and some words in them. So the first two examples are science-related topics in the "Daily Mail" and "The Independent". And the last one is about health. So, it's just a simple multi-class classification problem.

They had this classifier that they'd already built. This company, when they approached me, they had this classifier. They'd evaluated it, withheld out test data like all the textbooks say. Everything looked great, really fantastic performance, really good precision and recall. Things looked perfect. The problem is, when they tried to deploy this into production, it just didn't work. It couldn't classify any new web pages with a good degree of accuracy. It was better than random, but it wasn't good. It wasn't good enough for production and it led to lots of customer complaints. This is, again, this is a couple of URLs from the "Daily Mail". They're both about science and technology. You could probably figure out that this word "sciencetech" occurs at both of them, this string "sciencetech." And the machine learning classifier figured this out as well. It was very dependent on these kind of weird words, not really English words, they're kind of strings, which are part of the URLs for these different publications.

That wasn't really the problem then. The problem was that the training data and the test data weren't segregated properly. The "Daily Mail", URLs from the "Daily Mail" were in both the training data and the test data. That's really bad because the system was effectively just overfitting to the "Daily Mail" data. But this overfitting wasn't detectable because of this problem that URLs were both in the training data and the test data. So I asked the data scientists to segregate the data to make sure the publishers who are in the training data couldn't be also in the test data. When they did that, and they reran the evaluation, the results looked poor. But actually, this is much more reflective of real-world performance. Effectively, this system was massively overfitted to the training data. It couldn't generalize. When given new data, new URLs from new publishers, it just couldn't. It didn't have a clue what was going on and it couldn't give you very good results.

That's quite a simple example. But sometimes, this whole data leakage thing is really, really difficult to detect. This is an example from an image. This is an open dataset called CIFAR-100, which has been around for about 10 years. This dataset, it was discovered this year, has duplicates, and which are in both the training data and the test data. In fact, it has about 10% of the data is duplicated. No one realized this. Thousands of academic groups and industry groups have been using this dataset for training their image classification systems. And it turns out that the data is really, really heavily afflicted with this kind of data leakage. A new version of the dataset had to be created. That's a common theme throughout this talk, is that in many of these cases, it took a long time for people to realize that something was wrong. It's really not easy. These things are really insidious.

Leaking Data Temporarily into Training Data

The next example is data leaking temporally into the training data. What do I mean by that? This is an example from cargo. It's a cargo competition for detecting or predicting prostate cancer for various biochemical markers. So, you give someone a blood test, and you look at some biomarkers, and you feed that into your machine learning classifier and try and predict whether this person has prostate cancer. So, there was a cargo competition. You know what cargo competitions are like? You basically get a data table with lots of data points in there, you don't necessarily know what these data points are. In this case, each biomarker field has an abbreviation associated with it. It turned out that there was one field which was really, really predictive, and it was very, very useful to predict whether someone had prostate cancer. There's this field which is abbreviated PROSSURG. So, none of the contestants knew what this actually reflected, what it represented. It turns out, this is actually an abbreviation for prostate surgery.

Normally, you get prostate surgery after the diagnosis has already happened. The problem with this is that if you try to train this classifier and then deploy it into a real-world setting, this PROSSURG variable is always going to be false, because when people come for diagnosis, they don't have surgery. They haven't had the surgery, right? So it's false. In the end, you'd end up with a system that had a massive rate of false negatives. You'd be sending people home who actually did have prostate cancer. That's quite a severe one. But there were many cases and competitions, actually, if you look into the literature.

Leaking Predictions into Training Data: Feedback Loops

This third variety is when you leak predictions from your machine learning classifier back into the training set. We call this a feedback loop. This is the marketing material from this company called PredPol, it's a predictive policing software company. You've probably heard of this type of thing already. PredPol uses artificial intelligence to help you prevent crime by predicting when and where crime is most likely to occur, allowing you to optimize patrol resources and measure effectiveness.

Some statisticians took a look at this in the journal "Significance" in 2016, and they realized that one of the most important pieces of input data for PredPol was historical records of arrests. They were basically sending police to wherever arrests had happened in the past. And what these statisticians realized was the arrest records in places like Oakland in California were very, very concentrated, particularly for drugs, for areas and into areas which had a large African-American population. So, this is just a heat map of where the arrests actually took place in Oakland. And what they figured out was that actually drug uses is much more spread out. It's much more spread out. It's really not concentrated in African-Americans areas at all. But the arrest record didn't reflect that. They reasoned that if you use this data, and you input it into PredPol, PredPol is just going to send more police into these areas. You're going to end up with more arrests. And then that is going to feedback into the data, and then that's going to feed into even more resources being deployed, and then you end up with this feedback loop.

This is just theoretical. This is what they theoretically predicted would happen if this kind of feedback loop was set up. But that's academic work. And you're probably thinking, "No one would be stupid enough to actually put this kind of system into operation?" This year, in the "New York Law Review" was published this really nice article where they showed that in Maricopa County in Arizona, in 2016, they set up a contract with PredPol. They fed it all of their data. And they were able to show that actually, in this particular case, the Department of Justice had already investigated Maricopa and had been able to show that their policing practices were racist and that their data was completely racist and completely biased, and would lead to exactly the kind of situation that I've just outlined. That's a really nasty one.

Leaking Labels into Training Input Data

This one, this last type is leaking information about labels into the training input data. To talk about this, I'm going to need to tell you about natural language inference. So, this is a task in natural language processing. The idea is that you get two sentences, a premise and a hypothesis. And you try and train the system that can tell you whether the premise implies a hypothesis, or if it contradicts the hypothesis, or isn't neutral with respect to the hypothesis.

I'll just give you an example to make it clearer. In this example, a man inspects the uniform of a figure in some East Asian countries, is the premise. The hypothesis is the man is sleeping. And the correct label is this a contradiction. You can't inspect stuff if you're sleeping. Unfortunately, some researchers have been looking into some of the most important datasets in this area. There was, in particular, two datasets that stand for natural language inference dataset and the multi-natural language inference dataset, but both of them come out of Stanford. These are used by everyone in the field. They've been around for about four or five years. And what these guys were able to show is that by training a classifier on the hypothesis, they could predict quite well whether the hypothesis was implied, was neutral, or was contradicted by the premise. Notice I said that they just looked at the hypothesis; they didn't look at the premise at all. So, how do you know whether the hypothesis is implied by the premise if you don't even look at the premise? That's the question that they raised.

Their results are just on this table here. The first row is just their classification results. On SNLI, for instance, they get 67%. This is accuracy here. And if they just predicted by random, you'd expect them to get something like 34%, roughly a third. So, something strange was going on. Probably by now, you have figured out that this is a data leakage problem. I'm going to explain to you how this happens. It turns out that all of these datasets are created using crowdsourcing. So, effectively, you just give someone a Mechanical Turk premise. A woman selling bamboo sticks talking to two men on a loading dock. And you ask the human being to come up with an example entailed sentence, and a neutral sentence or a contradictory sentence, or a sentence that's contradicted.

It turns out that humans are very, very biased in this task. For instance, if you ask them to make a contradiction, very, very often, they will use words like “not” or “no”. If you ask them for something that's entailed, they'll start to use “was like at least”. If you ask them for a neutral sentence, well, number one, they will give you a much longer sentence than for the other two. And they also super-purpose quite a lot. It's just easier for them to come up with a sentence that's neutral if they use purpose-based clauses. There isn't a particular reason for it.

How Widespread Is This Problem?

This is a really bad problem for this whole field of natural language inference. It's, again, four or five years, these datasets have been around, lots of groups have been using them, no one realized that this was going on until now. This kind of problem has led to some researchers start to ask, how bad a problem is this? Is it just a few isolated cases like this, where it slipped through the net, or is it something a bit more serious? Bear in mind here, I'm talking about academic research. This is completely transparent, so it's thousands of groups or hundreds of groups all around the world looking at this data. Well, I work in industry, so we don't have this kind of level of transparency. We can only assume that this problem is happening more in industry than it's happening in academia.

This is really nice work by a guy called Bob Storm at Queen Mary's College in London. This is looking at genre classification, so musical genre classification. Given a clip of audio of a piece of music, is it a tango, is it salsa, is it bolero, whatever? I don't know why this is all Latin-American music but it is. So, this guy looked at a really important dataset in this field. Trained the classifier. And he just asked three simple questions. He said, "Okay, so this classifier has really good results, so it can classify the genre of lots of pieces of music really, really well. What happens if we change the tempo of these pieces of music very, very slightly?" Because for humans, that doesn't make any difference, you still know a piece of salsa is a piece of salsa even if the tempo has changed.

Just look at the solid lines here, don't look at the dotted lines. The y-axis is a measure of accuracy, it's F-score. And the x-axis here is the amount of tempo that Bob Storm changed the audio before feeding it back to the classifier. These are small percentage changes here. They're just percentage points. A human can't really tell the difference in a tempo change of less than about 4%. So, most of these are imperceptible to a human being. But obviously, you can see, even after a 1% change, that the format of this machine learning classifier plummeted. The F-score has come down by a lot in many cases from 0.3 to 0.1 or less than 0.1.

He had a look at this and tried to figure out what's going on. The first thing he did was just plot the data. This is just the dataset, just arranged by index, by file number on the x-axis and then tempo on the y-axis. It's pretty clear what's going on. The data was just really, really constrained by tempo. If you look at quickstep there, you can see the tempo is on this very tight, narrow band. It's even worse if you look at cha-cha-cha. This is a very artificial dataset, it doesn't really reflect real-world music. What happens is, if you try and use this data to train a classifier, you're just building a tempo detector, you're not building a genre detector at all.

This kind of analysis led Bob Storm to look at the whole field. He looked at everything, everything that was published on genre classification in music for over a decade. He really seriously looked and he did this kind of analysis and tried to figure out exactly what were these things really classifying. Was it really genre detection? And this is his conclusion. That none of the evaluations in these many works is valid to produce conclusions with respect to recognizing the genre. So, the whole field was tainted with this problem, data leakage being a very important reason for that situation. The conclusion is that it's a severe problem.

A Recent Example That Caused Me Some Problems

I'm going to talk about an example now that I came across. I've chosen this example because it's now going to start to give you some clues as to how you can go about detecting this problem before it happens. This is a dataset that we came across, or my company came across, for training a sentiment detector on Twitter data. It actually comes from one of our competitors. This is what the data looks like. It just has four columns of data, tweet_Id, sentiment, author, and contents of the tweet. Obviously, the label is sentiment, you try to build a sentiment classification system. And the content is the input data that you're interested in, the text.

Actually, this is the Evolution AI platform. This is just a screenshot of one of our QA interfaces. All this is showing you here is after building a classifier on this data, which columns were most important? Which variables are most important for drawing conclusions? So, content was the most important column, which kind of makes sense because you're looking at the text of the tweet. But weirdly, tweet_Id also had lots of information that could give you clues as to the label of the tweets. Which is odd; tweet_Id should be irrelevant to whether the tweets was positive or negative, or happy or sad, whatever.

That's the first hint that something weird was going on. And this took a lot of sleuthing. Our data science team took a very close look at this dataset. Well, this is one of the analyses that they performed. Each of these lines represents tweets of a different label. If you look at the red line, you're looking at all of the tweets which are labeled with the sentiment love. This is just on the x-axis, so these are just lined up by tweet_Id. So effectively, they're just lined up by time, the time that they were tweeted. And then, on the y-axis, we've just got the proportion.

What they figured out was that this data, I think it was 30,000 tweets, or maybe I want to say like 40,000 tweets, all of these tweets, they didn't just come from a year of data or from a long region of the data. They actually all came from three distinct days. That wasn't mentioned in any of the information about the dataset. The timestamp wasn't even in the data. We had to go to the Twitter API in order to figure out what the timestamp of these tweets were. But they all came from three different days, non-consecutive days. You can kind of see that on the red line, you can see they have three phases, these are three different days. So why was there so much love in the second day? It turns out this was Mother's Day. This dataset was actually really incredibly biased in a really tricky way. It meant that if you train a classifier on this data, the classifier starts to associate Mother's Day or words like Dave, with love in a very unhealthy way.

This, again, is just a screenshot of our platform, the QA interface. I'm just looking at an example tweet here. Actually, I just made this example up, just not a huge fan of Mother's Day. And what we're doing here is that I'm trying to show, or the interface is trying to show, which words were most useful to make the prediction. In this case, the prediction is “love”. The classifier is saying that the label is “love”. And it's really heavily leaning on words like “day” and “mother”, which it shouldn't be doing.

There's a few things going on here. One is the label of this tweet shouldn't be “love”. Not a huge fan of Mother's Day, so it's got nothing to do with love. But also, it's really revealing that it's using the word “day”. And it's showing that it's using the wrong kind of words. It's associating the wrong kind of words with love. That's because this dataset was so incredibly biased. So this dataset is basically useless for building a sentiment classification engine for tweets. It's useless. I guess the moral here is downloading a random dataset off the internet without doing really thorough due diligence is going to end in some problems.

How Can You Be Sure You Got Any of This Right?

I'm going to finish off here. I just want to be a bit more explicit about this. I want to just say a couple of things about what you can do to detect this kind of problem, in reality, because you need to detect this kind of thing happening before you move into production. This is really important, and I hinted at this in the last section. You need to understand the decision-making basis of your model. So in the case of natural language processing, you want to understand which words are being used to make a decision, for instance, in the example that I just gave.

In the case of this PROSSURG example that I spoke about earlier, if you use a really simple classification method like logistic regression, for instance, you have really easily interpretable coefficients that you can use to tell you something weird is going on. In this case, this is just an imaginary coefficient, but the idea is that PROSSURG has a ridiculously high-coefficient, much higher than everything else. Then it's as an indication that something's gone wrong, or is at least an indication that you should do some due diligence on to this field and figure out what this field really means.

That's logistic regression. Hope nobody really wants to use logistic regression, now everyone wants to use deep learning. This is a deep learning example. In the case of image classifications, this is an example coming out of Facebook. This example is all about visual question answering. So, the task here is you give the system an image, you ask it a question. In the first case, the question is what is covering the windows? And then it gives you an answer, which is blinds. It needs to look both at the question and the image to figure out the answer. You can do eye-tracking experiments in humans, and you can see what kinds of things they look at when they answer these questions. So in the first question, what's covering the windows, the human looked at the windows. That's the first column.

And the next question, what's the man doing, the answer is playing Frisbee. A human looks at the man in the middle, and he also looks at the Frisbee. They also looked at the Frisbee. That's the second column. But when these researchers looked at a machine learning system which actually gave very good results on this task, they also asked it to tell it what it was attending to, what it was looking at in these images. You can see the hotspots here on the third column, and they're basically random, it's just looking at random stuff. So even though it had really good results, it really wasn't looking at the kinds of things that humans were looking at. This is an indication that some kind of data leakage has happened, that it's looking at something that really has nothing to do with what you're asking. And it's just picking up on some kind of characteristic of the dataset that you don't really want it to.

I've already spoken about explainability in NLP. This is, I guess, a good example. This is a classifier that looks at the titles of news articles and tries to classify them into different topics. This is "Central Bank Chief Suspended in Latvia Corruption Scandal." The label at the bottom here is job changes. Which is correct, I guess. But the key thing is they were highlighting, again, the words which we think are really important to make this decision. This seems kind of reasonable. "Chief suspended" were the most important words here. That looks reasonable. I've just given you a subjective. I'm just telling you that it seems reasonable, but that's just not very scientific, right? But it is useful to make this kind of subjective judgment. And obviously, you can imagine more objective ways of doing it as well that we can talk about in questions.

But really, the most important thing that you can do is to take advantage- or not really take advantage, but really take on board the central theme of what I'm saying here. Because I've given you load and loads of examples but all of these examples had one characteristic in common. They all had really good results in evaluation on test data. All of the results, PredPol, everything, on test data, they worked really, really well. Everything looked great. In production, they failed. The answer is to use real-world data as quickly as possible, as early on as possible, and make sure that your systems are set up so that you can use real-world data as early as possible in order to reveal these kinds of problems, rather than leaving it right to the end, when it's probably too late. So with that, I'm going to finish. Thanks for listening. I'll take questions.

Questions & Answers

Participant 1: Thank you very much. Do you have any principles on how you can make sure that your data that you use for testing and then your data that you use for training are separate and won't conflict with each other?

Goodson: I don't think that there are any principles. I gave you the example of the CIFAR-100. There was no principle that anyone could use to figure out that there were duplicates in the data. In fact, I'm speaking quite loosely here, because in many cases, they weren't actual duplicates; they were near-duplicates. For instance, things like the contrast had been changed, or the brightness have been changed from those images. So, what principle could you use to figure out that they were near-duplicates? Unless you think, "Okay, so I'm going to need to come up with some kind of metric that shows you the distance between images," which are going to find near-duplicates even though they may not have any pixels in common, for instance. They may be shifted by two pixels. There are no pixels in common. And there's no principle that's going to help you to detect that. You actually need to think about it and think what's the duplicate going to look like, how am I going to detect them, and figure it out for yourself. I don't think there's any other way.

Participant 1: Just to follow up on that, can you use clustering across the test and the training dataset and see what's in, and then look at examples in those clusters and see ...?

Goodson: In some cases, that's going to work. But in the example that I just gave you, if the images have been shifted by two, what would you cluster on, I guess is the question.

Participant 1: But, well, I was thinking about the URL one, that might jump out that you've seen that there's

Goodson: Clustering is definitely a method that you could use in practice. But I don't think there is such a thing as a principle that's always going to work for you. You really have to think about what are you going to cluster, what are the variables that you're going to cluster on?

Participant 2: Things like looking at the sentiment analysis and text, etc., is very interesting. But how long does it take, how much effort does it take to actually investigate something like that, and to investigate, also, for example, the similarities in the images? I would expect it takes quite a long time.

Goodson: I have to admit that I don't do a lot of work on images. Most of my work is on natural language processing. In my experience, it takes a long time. If you're trying to do this on the command line, and you're trying to figure out which are the words which are most relevant to making this decision, which words are most relevant to making this classification decision? And then go back into the data and figure out where were these words coming from? What were the original training examples that had influenced the classification system? You have to do a lot of forensic work using command line tools. Personally, I find it quite difficult, which is why we created the platform. This is why we created the products of our company. We wanted to make it as easy as possible to browse data and figure out where the errors were and just do this kind of investigative work as quickly and painlessly as possible.

Participant 2: Are we talking about [inaudible 00:32:01]?

Goodson: I mean, who knows? Maybe you'll figure this out. In the case of the tweets, the sentiment stuff, that was a few weeks. There was one person for a few weeks looking into that.

Participant 3: Thank you very much. What percentage of published articles would you think that would have these kinds of leakage problems?

Goodson: Well, the only case that I could find when someone has really, really exhaustively looked at the research literature was the Bob Storm case for music genre. He found out that it could have been present in all of the published cases. So, he didn't prove that it did take place. He showed that the investigators had not ruled out sufficiently this kind of problem in any example case.

Participant 4: You mentioned the dangers of a feedback loop. How would you get any reliable data that you can use to train on after you have something like that in production?

Goodson: What we have done in the past in production systems is that we just make sure that we're very careful with tracking the provenance of data. Because sometimes, this is a very baroque system that we're working with. We're working with enterprises, we might be dealing with data from a mainframe that kind of comes into our servers. We built the classifier. It goes back to them. And then comes back to us like six months later. It's kind of changed a bit, but we're not really sure what's changed in it. So, we just need to make sure that we track the provenance.

We've realized that, in one particular case that I'm thinking of, the data that they were sending back to us was the data that we'd already sent to them in the past. The enterprise is very often a real challenge to change the schema for a mainframe. So we have to track on our side and make sure that we are not being misled by data that we ourselves have created by just keeping track of provenance. Where has this data come from? Where has this label come from? Has it come from us? Has it come from another supplier, for instance?


See more presentations with transcripts


Recorded at:

May 11, 2019