Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Accuracy as a Failure

Accuracy as a Failure



Vincent Warmerdam talks about cautionary tales of mistakes that might happen when we let data scientists on a goose chase for accuracy. It may surprise us, but highly accurate models are more damaging than the inaccurate ones. He shares some work that his team is doing to make sure that chatbots don't fall into this trap.


Vincent Warmerdam works at Rasa. He has been evangelizing data and open source for the last 6 years.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Warmerdam: I've been doing data science for a couple years. What I want to do is explain how I got this role at Rasa, and also, some stuff I learned over the last couple years. Because I think, maybe, if you're optimizing for accuracy, you're actually doing it wrong. I want to explain why because it seems counterintuitive. I think it's something we should talk about. Also, this will be a presentation with lots of references to Pokémon. I've got this blog called I have also been organizing PyData Amsterdam. I'm also the YouTuber for the spaCy project. The video just went live yesterday. I have a couple of open source projects that I started. There are some other projects that I start on the side. Recently, I started working at Rasa. I'm this open source community guy. It's what I do. It's what I intend on doing. It's my job.

I got here, and the story was eight years old now. Eight years ago, I graduated with a degree in operations research. I taught myself how to code. I was part of that first batch of the first MOOC, Sebastian Thrun, Andrew Ng, Peter Norvig. That first thing was a part of it. That made me consider, I really got to do this programming thing. What happened was, I taught calculus at a university for about half a year. Then the other half a year I would travel. What I did was I called myself an independent contractor back in the day when no one called themselves a data scientist. Something I found out was you can actually do most of that work remotely. I did. Life was actually pretty good as long as there was Wi-Fi. It was also around that time that I became an internet meme. I started sending this photo around. People noticed. Then you start trending on Reddit a bit.

Then I figured I should start this blog, because at some point, you're done with world travel. You're settling down. I wanted to figure out what this data science thing was going to be. Six years ago, data science was super new. I figured whenever there's something non-obvious, I should write about that and that should go to the blog. Then I went to this big data conference in London, Strata. This was my first time at a big data conference. It was also, literally, a big conference. I met this guy. A guy in a suit, nerdy, but he wouldn't shut up about it. He kept on bragging about how he calculated the optimal portfolio of bonds using big data. The guy wouldn't shut up about it. I thought it was going to be funny if I wrote a blog post that was going to calculate the optimal portfolio of Pokémon, as a joke. I figured it might be funny.


You research, what is Pokémon? I've never played it. It sounds like something that's funny to check. Turns out it's this video game where you select a couple of animals and they battle some other animals. I figured, let's try to find the best Pokémon. Around that time, there was actually a research paper called, "Classic Nintendo Games are Computationally Hard." I think this was Cambridge. I'm not exactly sure. It was an actual fundamental 10 page proof that Pokémon is NP-hard, just like Zelda, Mario Brothers, and Super Mario World, and Donkey Kong. I knew I was solving a hard problem. The funny thing here was that there were actually these fan websites. These fan websites, they were people who were really enthusiastic about the game. They had all of these formulas that would explain if you have one Pokémon, and it deals damage to the other Pokémon, how much damage does it do? It depends if they're a fire Pokémon, or a water Pokémon. I still don't know what Pokémon is, but I found the formula. Then there was also this other website, which was the Pokéapi, you can download any statistics you want from Pokémon. I have no domain knowledge about Pokémon, but I do have data. The game that I'm playing here is going to give me a competitive edge, nonetheless.

The first thing I did was I did what a BI person would do, because if you're doing stuff with data back in those days, you would do the business intelligence thing, which usually means that you're making a giant visualization. What you're looking at here is a matrix where all the Pokémon are listed, and they're battling it out against another Pokémon. This is the result of the simulation. It's 150 by 150 matrix. You should see something symmetrical here. There's this red line at the bottom here, that's Magikarp. Magikarp loses against everyone and everyone wins against Magikarp. The same thing with Diglett. Using BI you quickly find out that Diglett and Magikarp are the worst Pokémon to have, which is great. It's not what you're interested in. You want to find the best combination of Pokémon. This is one of the moments where BI has its limits. If you have a good BI tool, then what you can do is you can click a button and the fancy thing happens where all the Pokémon are clustered together. You can see that all the fire Pokémon are weak against the water Pokémon. All the grass Pokémon are weak against the fire Pokémon. This was I think seven years ago, this is what people did with data. This was BI. This is what they did.

Financial Math

I figured, let's do a data science thing with this. What I can do is I can look at an array of all these numbers, and say, this is a time series that's like a bond. If one row here is super green most of the time, that's good. That's like a bond. That's a return on the investment of a Pokémon. If the variance is super high, that's a bad thing. That's risk. I figured, we can do math here. I can calculate the average and I can calculate the standard deviation, these two formulas. What you can then do is you can plot all the Pokémon, where the expected return is listed on the y-axis and the variance or the risk is listed on the x-axis. Then you can say, let's select these couple of Pokémon. I'm using a little bit of math, a little bit of data to figure out, which Pokémon might be best? Which are these five Pokémon. I later found out that you're supposed to have six Pokémon. These are the Pokémon I found. At this point in time, you're here at a professional conference. You might wonder, is this guy going to talk about Pokémon the entire time? The answer is yes.

Lesson Learned

There's also an interesting lesson here, because what happened next? Can you see when the blog post was written? No. You can see when Reddit picked it up. Apparently, there's a very active Reddit community for Pokémon fans. I would not have guessed it. They found my blog post and they were discussing it on Reddit, which is this international forum. I looked at the forum. It turns out that the Pokémon fans did not agree with my modeling methods whatsoever. They did agree that the output of the model made sense. They did say, yes, the Pokémon he found, those are actually some of the pretty good ones, even though the way that you simulated this was completely not how the world is supposed to work. The conclusions are actually sensible. This happened maybe six, five years ago. This was something that I thought really mattered because Pokémon is a relatively complicated domain heavy video game. I was able to make a better decision without having to learn Pokémon whatsoever. I still don't know all the Pokémon, to be honest. My girlfriend plays the game more than I do. The only thing I did here was trying to learn from data, and then using a couple of algorithms and some math tricks. That was the idea. This to me was evidence that this whole data science thing was fundamentally different than Business Intelligence. I'm not going to give you insights, I'm going to help you make better decisions. It's that part that seemed very interesting to me.

Business Intelligence vs. Machine Learning

On one end of the spectrum, we've got the BI stuff. On the other end of the spectrum is the ML stuff. I figured five years ago, the only thing I got to do to compete with all these BI and these other consultancies is tell them this story of, "This is what we used to do. This is what we're going to do." That was the transition that made me think, I should invest in this career path. Because I do think that there's a lot of good stuff to be done here.

The main takeaway was we can quickly bootstrap solutions given data, even if you like domain knowledge. If you're a consultant in IT, you don't have any domain knowledge. You can go to a supermarket and know nothing about a supermarket. If you can do stuff with data, maybe you can still help them make better decisions. That's super interesting. That means that you have a solid background. This video game example seemed like a nice analogy for something that I might do in business. It's a really silly example because this is a Japanese movie or something. This is a really silly example. It is an example that made a lot of business people go, "I get what his data is about. This is different than BI." The future was looking bright. AI was going to replace BI, and AI was going to be the thing. It also meant that the old ways of doing things were going to go away. Data going in, we have some rules, and then we have some labels and decisions, that was going to go away a bit. Instead, what we were going to do is say, let's put some labels and data into a model, and then have rules come out on the other side. That felt like the path we were supposed to take. This was five years ago.

My opinion on this has fundamentally shifted. I actually think it's less of a good idea now. What I hope to do today is explain why it's ok to be a little bit skeptical. I think the future looked so bright five years ago that we got blinded by it. We may be at risk of solving the wrong problem, but we're also at risk of solving it the wrong way. What I hope to explain with a few of these examples is that we should be very cautious. At the end, I will also give a remedy and a glimpse of recent work that I'll be working on, hopefully to inspire some conversation afterwards. What I'll tell you is a tale about a textbook example that's going bad. I'll give an example of how most classification algorithms will feel out of the get-go. I'll tell you a story about hormones and pregnancy, one about elephants. I will discuss something about the current solutions out there that aren't really helping. I will discuss a little bit what the consequences are for chatbots.

How Do We Do ML?

How do we do ML? Typically what we do is we get our data, we get our labels, and we put that into some machine learning thing, and out come the rules. We can, either in a Jar or in a Python pickle, put that in a server somewhere. Then that call is in production. The idea is that then new data comes in, and we can make a prediction. Then, hopefully, profit happens. That's the idea. We make a prediction about something. Hopefully, it allows us to make a better decision as well. This will lead to more money and happier customers. This is the flow. You start with data. Then you have machine learning. That's this. This is the data and labels going into the model and out come the rules. The interesting thing, if you look at the books, and also lots of conference talks, is that most of the attention go into either, which algorithm are we going to choose and how are we going to go about deciding which algorithm to use? How do we get that algorithm then into production? That's where most of the attention is spent.

A machine learning algorithm usually consists of a couple parts. We have this preprocessing part. We've got this algorithm part. Usually, both of these parts have many different settings that you can try. Some people like to call them hyperparameters, but there's all these different settings that people go for. Scikit-learn is an example of an API that really adheres to this as well. It's quite a good library. This is actually computationally quite expensive, because you've got some settings for this preprocessing part and some settings for this algorithm. If there's 2 settings for this guy and 5 settings for this one, then you've got 10 settings you got to figure out. If there's another setting you want to figure out, this thing starts growing exponentially. If we're going to try all of these different settings, you have to be aware, it's going to be very costly. We don't mind doing that because we want to have the best algorithm. That's the thing we're trying to focus on. The best algorithm is the thing that's going to give us the most profit. Not only are we going to loop over all of these settings, but we're also going to cross-validate that.

The Train, Test Split

I'm assuming this crowd has heard of this train, test split before. There's this dataset that you learn from and this dataset that you test on. Typically, what you would do is you would say, let's not have one train and one test set. Let's instead say, we're going to take our entire dataset, split it up into different segments. Then the idea is the first segment is going to be the test set, then the second segment is going to be a test set, then the third, and the fourth. Then you have a couple of these test sets. What you're going to do is for each and every one of these test sets, you also have a train set. You can train a model. Then you can get some scores that come out of this test set. Then you can average that and summarize that. Then, hopefully, for every single setting that you make, you have this number that says how well that setting for that model is doing on your dataset for your task. You get this test score performance thing that comes out. You do this for all sorts of different settings.

There's a whole lot of compute power going on here. It better be worth it what we're doing. If this is something we shouldn't be doing it would be a bad thing because a lot of resources are being spent on this. Because for every single cross-validation, for every single setting, we got to train a model. Especially, if we have a big dataset, this is going to get expensive, quick. A lot of engineering effort actually goes into scaling this on massive machine clusters. It better be worth it that we're doing all of this. Because the compute time is proportional to the settings and the number of cross-validations that we're doing.

At the end, what you typically get is you get this grid search result. I'll do a bit of live coding to show you what this looks like. Typically, what you then have is this long list of, I got all these settings. I got all these numbers that I maybe care for. Some people care about accuracy, other people care about precision. Typically, what you do is you report on the mean test statistic. There's all these settings that you check. You can determine yourself which metric is appropriate. Maybe you like precision that says, if I say that you're pregnant, how sure am I that you're actually pregnant? You can also say something about recall. I want to have all the pregnant people. Those are two different metrics. You can pick a metric that fits your business needs very well. In general, this is the way that you do things and this is what people typically do.

Methodology Is Important

The goal here is that we want to prevent our model from overfitting on a dataset. I think that this idea has merit to it because we want our models to generalize. By splitting up into many test sets, we hope to get a robust single number that tells us how well our predictions are. Because we don't want to put something in production that's going to be making wrong decisions. The whole reason why we're doing this grid search and cross-validation, and all that, is because we don't want to have something in production that will make wrong decisions on our behalf. Then, if that's the case, I got to ask the question, is this grid search enough? Is this grid search really going to guarantee that no bad things will happen? It really seems to be the industry focus at this point. I meet a lot of people who basically say this, "I looped over a bunch of settings, and now I solved disease diagnosis by looping over all these settings. I have a number that says my machine learning model is great." I think it's good to then show you what a potential failure scenario might look like.


What I've got here is a notebook called what could go wrong. Hopefully, there's something ominous about that that we're seeing. What I'm doing here in the beginning is I'm importing a whole bunch of datasets. For those of you that are a little bit less familiar with Python, this is an interactive playground. It's easy to make plots in. If you're a data scientist, this is typically a work environment that you're used to. What I'm doing here is I'm importing a dataset which is The Boston housing dataset. It's a dataset that's super frequently used in books to explain how machine learning works. It's a dataset that contains information about houses. Then you have to predict the house price. Square meters, how big is your garden? How far away from schools are you? What you typically then do is you say, there's these things I want to predict, that's something we like to call Y. It's from the math books. There's these things that we're allowed to use to make this prediction, that thing is called X. That's usually a pipeline. In Python, I think scikit-learn is the most elegant way to declare that. I can say, I'm going to do some scaling. The data going in. I'm going to make sure that it's normalized before I give that to the algorithm. I say, here's a scaling step. Then there's this ridge regression step that I'll be doing.

What you can then do is you can say, this pipeline, what are all the settings that I can tweak? There's a tolerance for the model. Do I scale with the mean or scale with standard deviation? There's all things that we can try. What scikit-learn also offers us is a very convenient API to say these are the settings I want to change. Could you maybe spread that out over multiple cores and run that for me, please? That's what happens here. I have my dataset, my X and my Y. I've got my pipeline of steps, so I'm going to normalize first and then do the prediction. I'm saying, let's do five cross-validations. Split the dataset into five bits. First, that's the test set, then these. This is grid. We can say, I want you to scale with the mean, true, false. I want you to scale with the standard deviation, true, false. There's this alpha parameter in the model, I want you to try out all those. Mean squared error is the thing that I mainly care for. The mean squared error, that's something I want you to optimize for. That's also something I can give.

What's going to happen is it's going to loop over all of these different settings, loop over all these different cross-validations. Figure out, which of those configurations has the best mean squared error. Then this is going to retrain on the entire dataset. Then I have the best model. This is what people typically do. A lot of data scientists, they heavily focus on this, because this is the methodology that research papers like to adhere to. You really have to cross-validate your data quite well. I have to run the whole thing then, because it just restarted this. You call fit, and then this model is fit. There's this property in a grid search model called cv_results. This will give you a nice little table with all the grid search results. I'm taking the top five rows, and I'm sorting. One thing that's interesting is always looking at this to say, which parameters seem to be doing well? The scaling doesn't really matter much. The main thing that seems to work is this parameter alpha. That's something you can read from this grid search. You can zoom in on this, and focus on this. Then you can convince yourself that what you're doing is pretty good.

If you're a pretty good data scientist, what you then also do is make plots like this. You have your x-axis, you've got your y-axis. Then what you do is you check if there's a little correlation here. This is the predicted price. This is the actual price. We see something interesting, the model actually makes this very stupid decision here. It predicts a negative price. Something's still wrong with the model but that's a reasonably straight line. We can convince ourselves that we're doing the good thing. This is the flow a lot of data scientists have. They say, "We're doing the good thing. It's optimizing. The number we care about is high or low. That's what we like." What most data scientists and what most books on the matter don't do, is check the column names that we're using to make this prediction. Because to me, it feels, if I'm going to make a prediction, and it's my responsibility what data goes out of the model, then maybe the data that goes into the model is also something I got to be concerned about.

Let's have a peek at which columns are in here and that we're using. I'd like to zoom in on this one. There's this weird formula where there's a parameter going in where one of the parameters is the proportion of blacks by town. This dataset is from the '70s. I'm having a really hard time coming up with a use case for this particular dataset, where the model is not data laundering pre-existing bias. Here's the tricky thing, if you're a data scientist, and you're focusing on this. You're focusing on, I've got the best summary statistics. You're taught that way. You're taught to really think about it in a quantitative way. This is one of those things that you're just going to miss. You're missing a big one, because this is a pretty good recipe for a racist algorithm, if you think about it. There's some other stuff in here that I also don't really like, like lower status of the population. I'm not too sure if I like that. One thing I do like about having these two variables in there is I can also apply some fairness type algorithms to compensate for this. If I didn't know these two columns, then there will be no way for me to compensate for them either. To throw everything into a model called fit, predict, and then hoping everything's great, feels extremely naive. It's actually embarrassing how often this dataset is used in books, even though this property is in here. Recently, I've been talking about that one thing for a while, but other people have also been talking about it. This dataset is now going away from scikit-learn. They're removing it. They're looking for a new dataset to replace this. The open source committers are actually aware of this and they're looking to change this. This is one of those things where machine learning is great, but maybe we shouldn't listen to this guy.

Classification Algorithms

Fairness is one obvious use case where a lot of machine learning fails. It's a fair thing to look at. There's also other areas where I think it might be harmful. I'll show you another example, which is my favorite example of artificial stupidity. This one concerns classification algorithms. I think I know of no algorithms that don't have this problem. Typically, you might say, we have some algorithm that outputs a prediction, predict_proba if you use scikit-learn. That prediction is usually between 0 and 1. There's two classes, if one class is A, the other one is B. We can say, it's Class A if the probability is lower than 50%, and it's going to be the other class if it's higher than 50%. It's something that you can do to classification algorithms. You can convert a probability to a class. You typically do it this way.

If you're a little bit more concerned, what you can do is you can say, I'm going to maybe move the threshold a bit. I really care about getting my class B right, so I'm going to be more strict, only when the algorithm outputs a super high percentage, only then am I going to care about it. What you can also do, and unfortunately, most people don't do this, is you can also say, how about I have this won't predict thing. The idea here is that you say, only if I'm super sure that it's class A, or if I'm super sure that it's class B, then I'm going to automate the decision. I'm going to raise a, won't predict flag in other cases, which is a reasonable thing to do. You don't want to automate something that you're reasonably unsure about. Having that boundary in the middle to say, "Let's not automate this thing," seems like a really reasonable thing. It'll look like this if it's on a one-dimensional thing. Probability of A goes down, B goes up. The stuff in the middle, don't touch it.

Example Dataset

Let's see where this idea goes horribly wrong. I've got this example dataset, and it's a fake dataset. It doesn't represent anything besides blue, green, and red dots. Let's train a classification algorithm on it. This is the actual dataset. What I'm now going to do is I'm going to train a K-nearest neighbor algorithm. Then we're going to see how the decisions are being split across the three groups. This is the output of the classification model. You can actually see there's clear boundaries now. There's this moment where your class A and then your class B. What I can now do is I can say, let's remove all the points where there's uncertainty. The stuff that's in the middle, that's areas where we're uncertain about. Hopefully, these points, those are points that we can actually say, confidently, those are actually belonging to the red dots, the blue dots, or the green dots. Again, not a bad idea, but something's going to go wrong. This is the starting point. What I'm now going to do is I'm going to draw all these points black, and the background is going to represent the color that the classification algorithm is going to be making. Where it's more light, that's where the uncertainty limit strikes. We're not going to make any predictions there. Then we have the green, the red, and the blue parts. This is the decision the algorithm is making.

To show how incredibly stupid this is, let's zoom out a bit. Notice that you can be so far away, you don't hear the bell curve ringing anymore, so you can be way out there. The algorithm is going to say, "Definitely red." Even though it's nowhere near any of the data points that we've seen before. To me, that feels super risky to make a decision about that. The model is not going to help you here. Logistic regression, random forests, also deep learning, they all get fooled this way. I get why, because an algorithm always has to say it's either this class or that one. Distance-wise, this point over here would be much closer to the red points than to the blue ones. I would prefer if the algorithm is also able to say, I'm super uncertain here. I also hope it's clear that this whole predict_proba thing is not the same as certainty. It simply isn't. It's a proxy for it. It simply doesn't mean the same thing. It's approximation of a probability. It's not the same thing as certainty.

The model is not going to save you, and neither is this grid search. This grid search is just going to tell you, on the data that we actually have, this is how well the model is performing, which should be regarded as a proxy. It's not ground truth. It doesn't tell you anything about what's going to happen in production. It is something we're overfitting on. Grid search is not going to help you. I also don't think this consultancy guy is going to help you. I do hope that we recognize maybe the story where the guy has a PhD and he's wearing a suit. If you're a senior person, you're going to go, "I got to trust this. There's a chart and everything." I cannot blame Nimrod here, because Nimrod was probably taught as a PhD to really focus on his benchmarks. Because if you're writing an article or a paper, that's the thing you got to prove your algorithm against.

Thinking Back

Remember, five years ago, that people said machine learning was going to replace business rules? I think these people were wrong. I think that ML is super useful to be used inside of business rules. How cool is this? How about you say, I've got my X, it's going to go in. If we say that X is maybe an outlier, it's too far away from data points that we've seen before, then we introduce a fallback scenario. We do the same thing for when the model's uncertain about a class, then and only then we'll take an action. To me, this feels like really saying system design. Detecting outliers, it's not necessarily a solved problem, but we do have algorithms for it. It's not like we can't do that as an extra step in our pipeline. We're taught to not do this. We're taught to really focus on the machine learning grid search. I hope that these two examples paint a clear picture of why that might be dangerous.

What Is Happening?

Algorithms, I think, are designed for interpolation and extrapolation. It's really easy to calculate the middle number. The mean is accurate. The stuff that's on the sides of the distribution, that's usually quite hard to estimate. I think that constraining the algorithm by denying certain predictions seems like a really nice idea to have in production. Having a fallback scenario is probably the first thing you should think about before you start worrying about the algorithm. The algorithm is usually a small cog in your system anyway. It's all the stuff around the algorithms you got to concern yourself with.

I'm going to tell you a story that I heard at PyData Poland. It was actually bad. It was a nurse in the hospital who did this. When you're pregnant, while you're delivering the baby, what they sometimes do is they give you this hormone that helps with the delivery. The amount of hormone that they give you, there's been lots of research on that, but I believe it was done in the U.S. in the '70s on women between 19 and 21, or something like that. Then, if you try an algorithm on it, you're going to get something like this. You might plausibly wonder if you have a little bit of domain knowledge that if you're older then maybe the way that your body reacts to hormones is different. Maybe we shouldn't pretend that this is a good idea, even though an algorithm says so. I think in general, especially in the medical field, it's pretty proven right now that most of the tests that have happened have happened on men. If they've happened on women, it's really dated data. It's an area of research to get that data in, in the first place. It's super naive to assume that the grid search has saved you here. It's not enough. We need more. I think artificial accuracy is going to become a term this year. I think the problem here is that there's this ivory tower where there's this person saying, I solved the problem by looping over a bunch of settings. That there is this distance between the actual hospital. I think the person who is doing the stuff in the notebook is not the same person who's on the floor as a doctor trying to help patients. It's that distance, I think, that's part of the problem as well. I cannot completely blame Nimrod for only being in an ivory tower. I do think it's a good idea if he sometimes goes to the hospital. This year, I started wondering if constraining is enough. It seems like a super good idea. It doesn't completely fix the ivory tower problem.

Rangers and Elephants

Imagine that you're a ranger and it's your job to protect elephants from poaching. These elephants are precious. I'm an animal fan. Elephants are cool. The thing with elephants is they're not like dogs. They don't like it if you stick a microphone or a sensor on them. You don't disagree with an elephant. You're going to get killed. Imagine that you're a ranger and it's your job to protect these elephants from poaching. Then a large part of protecting the elephant herds is understanding the movement patterns in the national park. The national parks especially in Africa, they can be the size of the Netherlands. These are huge national parks. You can't have a camera or eyes and ears everywhere. What you can do then instead is you can have this mic array. You have all these different microphones across the entire park. Then what you do is you have these microphones try and detect elephants. That's the idea. If there's an elephant sound, detect that, and then tell the rangers about it. The cool thing here is all this audio is going to generate so much audio, we cannot really listen to this by hand. Instead, what we do is we go to a PhD researcher, who happens to be an expert on audio analysis. We give him some elephant noises. Then he trains an algorithm. He says, "It's super accurate. We can totally push this to production now." The rangers get excited, they invest a ton of money. Then all these microphones in the field can detect elephants.

Then when they're trying it out, they figure that whenever they hear an elephant and they go there, the elephant's gone. It's not really working for some reason. The thing that went wrong here is that what the PhD person did, the PhD person just said, I've got this algorithm I'll just apply it. What the PhD person didn't do was actually listen to the audio clip, because then you would have heard gunshots. If you would have listened to the gunshots, you would have gone, "That's probably more interesting to detect." This is one of those moments where we have an actual problem. We translate that to an analytical problem. We optimize the analytical problem, which is great. Then the translation error, back to the original problem, is not going to be compensated by the optimization you did on this side.

The Problems

I've been a consultant for a couple years and I've seen this happen a whole bunch. I think a lot of it is this. You have an actual problem and you want the production results to be good. You translate that to a data problem. Then you have these grid search results. Then the grid search is the thing that people optimize for. The thing that really matters is this. It's great if you can squeeze 1% out of your grid search results. That's great. Let's do that. If there's a 10% translation error happening anyway, then spinning up your giant compute cluster to optimize whatever is happening here feels like a farce. It's probably a better idea to make sure you're not in the ivory tower in the first place and make sure you understand the problem really well, before you make this translation step anyway. We do have a tendency in data science, I think, to say if it doesn't fit, predict, it's not data science. We tend to see everything as nails because we've got this hammer. Maybe all we need is a, for loop.

McKinsey wrote this article about this analytics translator role, that's going to be something that organizations need. I get what they're trying to do there. I think they've really done the wording wrong there. I get that analytics is a buzzword, but we don't need people to translate the analytics to the problem. We need people to translate the problem to the analytics instead. That's the thing we should be going for. I think that's also a more rare skill set. Because you've got the domain knowledge people, you've got the data expert people. It's probably a better idea to give a little bit of tech to the domain guy than to assume that the quantitative expert PhD guy is going to understand all of your domain knowledge in the short term. I think we may be overfitting on the qualifications artificial intelligence here.

Another part of the problem is that people don't think ahead. If you make a prediction that in turn is going to cause a decision that in turn is going to change the world that is going to lead to more profit, this will be assume, we have a prediction, we can then anticipate the prediction better and make a better decision. That's great. We should not assume that the world doesn't change, that then the predictions are going to remain as accurate. If you change the world, it also means that your training dataset that you're using now is no longer relevant because the world has changed. What you're learning on is a different version of the world than what you've caused. If you're not thinking about this feedback mechanism where you're deploying, there's this thing called drift in machine learning. If you're giving a different recommendation, people are going to click on different things and you want to prevent the self-fulfilling prophecy. Not thinking about this ahead of time is going to lead to a lot of pain. If you deploy machine learning to production, and the accuracy keeps getting up, be really worried because you're hitting the self-fulfilling prophecy, I think.


I hope at this point, some of you think, this is actually a little bit cause of concern. These are valid points. We should also talk about remedies and solutions, I think. It's ok to point the finger and saying stuff is wrong. It's also important to say, how can we fix some of these things? There are lots of solutions out there. I didn't make the full diagram of all the vendors out there. When people talk data science, they typically name some of these things, so there's Keras, and scikit-learn, and PyTorch. Some of the stuff you see here is from the cloud providers, some of the stuff is open source. When people think, let's do machine Learning, let's do production, and all these things. Then these are the names and tools that you might see. To me, they sound really familiar, Pandas and PyTorch. Could have been a Pikachu, or a Ditto, or Snorlax, or something like that. We got to catch them all. We got to have all the tools in there.

At some point in my career, I noticed, there are really a whole lot of tools and they keep introducing new ones. This is actually becoming a joke. Pokémon Blue came out and there's six new Pokémons. At some point, I made a joke about it, and it became a meme again. You might have seen this picture on LinkedIn. This was at a tech conference at PyData Berlin. This was my LinkedIn profile as well. I do R, Python, JavaScript, Shiny, Dplyr, Ditto, Purrr, Canvas, Spark, Sawk. I basically interchange the open source projects with the Pokémon names. Then I typically ask recruiters to point out which of these are Pokémon before they attempt to hire me. If you now Google, meme recruiter Pokémon, this thing is trending. I've been contacted by the biggest tech recruitment conference in the United States, if I want to be a keynote speaker. Supposedly, all the recruiters now know who I am. Which is the exact opposite of what I was trying to achieve with that.

There's a Market for These Tools

I do think that there's this sense of we got to collect all the tools because all these tools solve a different problem. What do these tools actually do? Because if I think about all the problems that I've just alluded to, do these tools actually solve all the problems that I've just shown you? Because some of these problems occur because we're automating something and optimizing religiously for one single number. Maybe what I can do is I can plot all these tools on an axis and wonder where they stand. As I was looking at this, I figured, there's machine learning production cycle, where on one side of the spectrum, you've got total production. That's cloud provider tools and that thing. In the middle, you get analysis, exploration, data science, algorithm stuff. You got production on that end. Then I started noticing there's a gap there at the beginning. Because all of these tools, they solve stuff that you can automate. You want to do a grid search. There's a, for loop. Can we distribute that? Yes. Gift wrapping around it? It's a product, we could sell it to people. Understanding the problem in your business, that's super hard to automate unless you have a business that scales very well and is really common. A recommender for Netflix is fundamentally different than a recommender for a grocery store. It's really hard to come up with a good recommendation software that works for everyone as well. If I think about the original problems that we have, like this translation thing that's going bad, I don't see many tools that really try to address that. Maybe if you want to prevent artificial stupidity, maybe we shouldn't worry too much about the stuff that's over here. Maybe we should worry a bit more about the stuff that's missing over here.


I also try to make some meaningful contributions here and there. If you're interested in anything that has to do with fairness, or prevention mechanisms for scikit-learn, and that thing, come talk to me. I've started this project together with a couple of collaborators called scikit-lego. I'm really interested to hear if you folks need more features. It's a mere remedy. We need something there. Something on that dataset quality part. That's something that's currently totally missing. It's not just datasets, but there's loads of stuff that's missing there. I will say there's this one tool that is actually super cool. They are buddies of mine who started that, that are spaCy people. There's this labeling software called prodigy. It really is quite good. Because the idea behind prodigy is let's empower the data scientists to label data themselves. In doing so they're already a little bit less in the ivory tower. It also makes this labeling a shared responsibility. It's no longer the data scientist going, I only do algorithms. Now everyone in the organization has to label. You get really good discussions about what makes a good label. It's not enough because I also think stuff like ethics, and how will the world react? Governance, and data quality, in general, it's got to be in there.

If I were starting a new data science project, sure, there's production stuff, but let's think about stuff we can do there. I think, if you're a data scientist, you want to take yourself serious in this day and age, you want to call yourself a senior or whatever. That's the thing you got to worry about. Because the moment that your data quality is off, you can have the best algorithm. You can have an algorithm that totally kicks XGBoost's ass. That's totally cool. If your labels are wrong and if your data quality is wrong, in general, it's all going to come crumbling down in your application layer. All this stuff builds on top of each other. If your analysis is wrong, production is going to fail. If your data quality is off, everything will fall like a house of cards.


That brings me to where I am now. I've been doing consultancy for the last six years. After a while you should really do something else, I think. I joined this company called Rasa. What I now do is I try to focus more on natural language processing. I think it's a very interesting field. I think language in general is the most awesome invention we as a human race have ever made. Turns out it's really tricky though. It's also this thing where my dad became a skeptic. My dad actually came up to me during Christmas, saying you're going to continue your crusade against hype at a company that specializes in chatbots? I would like to explain why the answer to that question is, yes. There's this company called Hugging Face and they make cool stuff. I want to say that first and foremost. They have this blog post called, how to build a state of the art conversational AI with transfer learning. What they're doing here is genuinely quite impressive.

I'll show you what a typical conversation with a typical deep learning algorithm can be like. I say, "Hi, there." I get the reply, "Hey, how are you doing?" "I'm doing well. I just fed the cats." Then the reply is, "That is great. I just got done dressing up out to go clubbing." "To where?" "To a bar near my apartment." Then I ask, "What kind of music?" "Rap, country, and rap." "Come again?" "I hope to get a few more procedures done." "Like, at the dance floor?" "I hope to be happy with myself." "Why go out then?" "I'm saving money for surgery." A few things I want to point out. This is really hard stuff they're trying to do here. These are proper sentences that already on its own is actually not bad. This is not a conversation. Especially, in a chatbot setting, this is not ready for prime time. You can have the best article about this. You can be number one on the [inaudible 00:43:45] conference, but this is not going to go to production.

What might be the best way of solving this? One way of preventing this stupidity thing from happening, is to constrain it. Maybe the problem here is that the algorithm has a bit too much freedom. The bot can probably talk about whatever it feels like. As an algorithm designer, I didn't really have a button to say you're only allowed to talk about these certain things. This bot has a lot of freedom. There's a reason why you put a fence near a ravine. You don't want people to go over it.

How do we do that over at Rasa? We do this. We basically say, there's this notion of a story, when someone wants to be greeted. Then we say, "Hello." Then we ask for features. This is what we call an intent. There's only a limited set of those and you have to provide them as a user. Explicitly, at the moment, you can only have a chatbot that does only what you tell it to. The only intents that it's able to detect are the ones that you give it. It will not do anything with anything else that is being uttered. What we then do is we say, "For every single intent, we need you to provide us with lots and lots of examples." Then we'll have a classification algorithm say, "This is the intent right now." We then check if that intent matches a predefined story. Because certain questions should follow up on each other. Other questions don't. Only when there's a proper match, then we say, "This is the conversation. This should be the reply." That allows me to build any sorts of mechanisms that say, "That's not clear. Could you maybe check? I want to verify. Pardon?" All those things. You can say, in this case, I'm definitely constraining the algorithm. I would argue, for production purposes, this is a good thing. What's also a good thing, this is also how you design the chatbot in general. As a programmer, you have to actually have these examples put in yourself. This is a markdown file. That's the basis of how we make chatbots. You can customize it with your own algorithms, and that's fine, but you are definitely touching this.

Good Chats Are Hard

There's also another reason why I think chatbots are hard. This was a secret photo taken at the first demo of the PayPal chatbot. Someone says, "Hi, I'm PayPal's virtual agent. To get started, simply ask me a question. I'm still learning. If I can't help you, I'll direct you to additional resources." There's a guy called Brady, he says, "I got scammed." PayPal says, "Great." What happened here? There can be two things. Probably, one of the things that is at the core of all this is that the person who designed this chatbot did not think that maybe someone would say something about being scammed. It's well possible that you have lots of intents here, but not all the ones that people are going to be asking for. That's possible. There's two ways of thinking about that. One way to think about this is we should have something of an outlier detection system here. Something that we show before, like that's out of scope. Or, what we should do is we should think about this angle. What Rasa currently does as an open source product, is we connect with TensorFlow and the messengers, and all these different databases. That stuff that we could handle for you. Something that we've been working on recently is this thing called Rasa X. What is that thing? It also does CI/CD. It also does labeling. It does live labeling, too. This is a screenshot of the user interface. You can have all sorts of conversations, and you can label where they went wrong. If you were a data scientist, you can just log into this thing. You can look at the conversation where it went wrong, relabel it, re-train the algorithm. In fact, I think this is the best way to go about it. Because that's the way that you're going to fix this distance. As a data scientist, you have to maybe listen to these people.


I hope that we're now a little bit more skeptic about these accuracy numbers. If we trust them too much, artificial stupidity is going to happen. I also hope that I've been able to convince you that maybe the focus should be moving away from the algorithm, and maybe more towards the data and the use case. If the predictions going out of our systems are our own responsibility, then so is the data going in? It's part of it. You're unprofessional if you consider this not true. I do think we should start considering fairness a bit more. It's getting concerning. I also think that we should be doing stuff with outliers. I'm also in the research team at Rasa, so we are thinking about this. Outlier detection in chatbot settings is hard. This is definitely an area of research for us. Maybe we should appreciate the quality more than the quantity. Has there ever been an end user come up to you and complain about a low mean squared error or a bad ROC curve? Or, have you ever heard anyone say this website sucks because they're not using Vue? Because usually what they say is this website doesn't do what I expect. That's the problem. It's not that you're not you using the latest tech, it's just the website is not doing what you want it to. Maybe for building dialog systems, I shouldn't predict what to say. Maybe we should learn how to listen. I think that's an interesting vantage point. I'm building this chatbot for Pokémon. The only thing it does is ask users, what would you like me to be able to do?

Questions and Answers

Moderator: Is there a difference between the chatbot that you talked with and what Rasa does? Is there this language of open domain chatbots and closed?

Warmerdam: I got to be careful here because this is my third week at Rasa. I'm not super aware of everything that's in there. What I will say, there is this research effort that tries to do end-to-end chatbots. It's just the deep neural network that you're trying to talk to. That's not the current flow that we adhere to at Rasa. What we say is it's probably a better idea to just constrain it a little bit, and say, these are the things that we're classifying for. You can have something like BERT to detect the intent, that stuff that we do allow. We do like this notion of having stories because, A, it constrains the algorithm a bit, but it's also nicer for domain people to get a grip of what's happening. It's a nice metaphor for how a conversation should go. You want some structure there. Just throwing that into a black box seems like it's not going to be ready for prime time anytime soon.

Participant 1: You talked a lot about basically crap in, crap out, which makes a lot of sense. I know some places that are actually doing research into data drift to make sure that you know when you need to retrain your model. Do you have any insights on good projects that are going on right now, which we should pay attention to?

Warmerdam: Over at scikit-lego, we are trying to figure this out as well. I've heard that there's this project called scikit-multiflow that might have stuff in it. I've not looked at the project. It's just some dude at a meetup told me about it. I think SciPy has support for some of this stuff, if you have one-dimensional drift. A one-dimensional number, if that's changing too much as a distribution, they have support for some of those metrics. The hard part is what if you have instead a giant dataset and continuous variables, discrete variables, and times, and all sorts of things like that? I've not seen anything that makes me go, "That's the obvious solution."

Participant 1: I might know someone you might want to talk to.

Warmerdam: I'm always interested in implementing other people's algorithms. That's always great.

Participant 2: Thank you for pointing out how the UX'ers should be closer to the data scientist. I've been preaching that at my consultancy company.

How would you aggregate a company's data from different sources being a consultancy company in a secure way and more convince the tech leads or IT directors to let you do it, because they're so afraid of digital breaches?

Warmerdam: Being afraid of digital breach, I think that's normal. That's probably a sign of sanity, especially if you're in a regulated industry like healthcare, or whatnot. Most of my clients have not had this issue. One thing that you might want to think about is your average data scientist is not the biggest security expert, typically. Typically, the person who's really good at security, they're not going to be in a data science team. These engineers are typically someplace else. What you can maybe think about is saying, how about we have this master table? One thing you can say is the engineers, they consolidate this one dataset that's encrypted, and privatized, and that's the thing the data scientists can play with. The idea there being is that you scope down where the breach might happen to this one single dataset. Then you can tell the security people, there's only one table you got to worry about. That might be the nicest way to communicate. A security practice, or something you have to take serious, and it's not really my domain expertise, is this idea of let's have one table be consolidated by the engineers, not just for security practices, but also for data quality practices. I've seen that work relatively well. That might be a thing to push for.


See more presentations with transcripts


Recorded at:

Jul 02, 2020