Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Counterfactual Evaluation of Machine Learning Models

Counterfactual Evaluation of Machine Learning Models



Michael Manapat discusses how Stripe evaluates and trains their machine learning models to fight fraud.


Michael Manapat manages Stripe’s Conversion Products group. He was previously a Software Engineer at Google, a postdoctoral fellow in applied mathematics at Harvard, and graduate student in mathematics at MIT.

About the conference is a AI and Machine Learning conference held in San Francisco for developers, architects & technical managers focused on applied AI/ML.


Online Fraud

So I'm sure many of you know Stripe. It's a company that provides a platform for e-commerce. And one of the things that everyone encounters when conducting commerce online is, unsurprisingly, fraud. So before I get into the details of how we address fraud with machine learning, I want to talk a little bit about the fraud life cycle. So what typically happens in fraud is that you have an organized crime ring install malware on point-of-sale devices. For example, there was this famous breach at Target about five years ago. They'll exfiltrate those numbers and then they'll sell them online in sort of so-called dumps. So you can actually go online, if you go to the deep web and buy credit card numbers that were taken from personal devices, ATM machines and so forth.

What's kind of surprising and funny is that these criminals who are selling credit card numbers to smaller time criminals are quite customer service oriented. So you can say, "I want 12 credit card numbers from Wells Fargo or Citibank. I want credit card numbers that were issued in the zip codes in 94102 to 94105 and so forth." Some of them are in fact so customer serviced oriented that they guarantee you that if you are unable to commit fraud with the cards you buy, they'll give you your money back.

Let's say, five years at Stripe was enough for me. I decided to leave and become a criminal, using all my knowledge. So I'll go onto this website. I'll buy, say, a 100 cards at $10 or $20 a pop. They tend to go between $5 and $20 dollars depending on how good the card is. More premium cards go for more. And then, I'll go and purchase something from a retailer. Let's say, I'll buy a TV from Best Buy online. The TV will arrive at my house. I'll go sell it on Craigslist or on eBay and then I'll pocket the proceeds. So as a fraudster I've made say half the dollar value of the TV. Eventually the real card holder will discover this transaction on his or her credit card statement and will dispute it. "No, I didn't buy the TV from Best Buy." And then when that charge is disputed, Best Buy's credit card processor has to go and say, "Hey, Best Buy, this was a fraudulent transaction. We have to claw the money back." So this is the life cycle. Large, sophisticated fraudster got the numbers, smaller time fraudster buys the numbers, purchases goods, sells them, pockets the money. Eventually the card holder disputes them and then the business is left holding the money or holding the bag.


So Stripe has hundreds of thousands of businesses working with it and all of them would be exposed to this. So we obviously want for their own good, and for the positivity of their experience, to have some system for addressing online credit card fraud. So about a year-and-a-half ago we launched Radar, our fraud product. Radar has three components. There's a rules engine. We even specify rules to say things like block this transaction if the amount is greater than a $100. There's a mail review facility. So you can say, "I want to review all transactions of intermediate risk and queue them up for a review by our risk team." And then most importantly, there's a machine learning system which scores every payment for fraud risk and assigns a risk level and a default action.

Machine Learning for Fraud

So the setup we have here, for our machine learning problem, is there's a target we want to predict and that target is, will this payment be charged back for fraud? And to build a model that predicts this, you want to have a bunch of features or signals that you think are indicative or predictive of this target.

To give some examples, you can say, "Is the country of the IP address equal to the country in which the card was issued?" Or, "Is this IP address the IP of a known anonymizing or proxy service?" Or as another example, "How many transactions were declined on this card in the past day?" So we actually use thousands of features unsurprisingly, but it gives you a sense of the kinds of inputs you want into your model to predict this target of will this payment be charged back for fraud? I'm not going to spend too much time talking about feature engineering. That's a whole other topic, but I want to to give you some sense of the framing of the problem.

Model Building

So here's instead what I'm going to talk about. Let's say it's December 31st, 2013 and we're going to train our ML model for fraud. So we're going to use or train a binary classifier that predicts true or false for every new payment using data from all Stripe's payments from between January 1st and September 30th, and then we're going to validate our model using cross-validation on data from October 1st to October 31st. So why don't we use data all the way until December 30th? The reason is that disputes take time to come in. So even if a transaction that's fraudulent happens today, it might take a week or two weeks or two months for the card holder to notice it and dispute it. So you have to give some time or buffer to allow the labels to come in. But it's why we're not training or validating up until the present, we sort of give or allow for the 60-day buffer.

So we'll pair our model, validate it, and then based on the validation data, we'll pick some score to which some policy for actioning our model up. For example, we'll say why don't we block every transaction if the fraud score is above 50? So we do this and our validation data looks really good. You know, the ROC curve is sort of an encouraging looking curve for our model. So we decide let's put it into production and we will begin blocking transactions with a score above 50.


So a few quick questions that I want to answer now but you'll see why I'm asking them at this point. So the validation data. Remember, we validated on October 1st, October 31st, it's already two months old at the time you put the model into production. So how is the model doing in production? What are the precision and recall, not in the validation data but on the production data? And let's say a business on January 10th complains to us ”We know there's a high false positive rate here because our known good consumers are complaining to us that their transactions are getting blocked”. What if we say, "Okay, why don't we increase the blocking score to 70? What's going to happen then?"? So I won't answer these just now, but keep them in mind.

Next Iterations

So next iteration. Let's say it's a year later. It's December 31st, 2014 and we're going to do exactly the same thing we did a year earlier but just now shifted in time by a year. We're going to train a model with data from September 1st to September 30th and then we're going to validate it again on October data. When we did this, the validation results looked much worse. The ROC curve is significantly poorer than it was a year earlier. So nothing changed here. Same amount of data but shifted in time by a year, same features, same model, algorithm and so forth.

So why was there this degeneration in performance? A frequent hypothesis is that fraudsters adapted. Now they know what you're doing and then they've changed their behavior. Changing fraudulent behavior definitely is a thing but it's not the reason for such a dramatic change in performance. So we put them all into production and in fact the bad validation results are confirmed by bad production results, and we sort of determined that this model is worse. So what's going on here?

Fundamental Problem

So it turns out that all of the things I mentioned are facets of the same problem. The things I mentioned are you have a model in production that's actioning or changing the world, by say blocking charges. How do you know what the precision and recall are? You retrain a model some time later using data from the world in which your previous model was making changes. These are all the same thing and what I mean is for evaluation of models in production, for changes to your policy, say from blocking a score of 50 to blocking at a score of 62, and for retraining models as time passes you want the same thing, and that thing is roughly an approximation of the distribution of transactions that you would have in the absence of any blocking at all.

So by putting a model into production, you're changing the world, right? You are blocking transactions that would previously have gone through unabated. But if you want to be able to evaluate your models and retrain new ones and consider policy changes, what you want to obtain is a statistical estimate for what the world would've looked like if you hadn't been actioning transactions with your model. Does that make sense?

First Attempt

So what do we do here? The most basic you can do is the following. You can say for all of the transactions that we were going to allow anyway allow them through, obviously. And then for all the transactions that you would've blocked, i.e. all the ones with a score above 50, let's let a small fraction of them go through uniformly at random. See the probability is 0.05, right? So for all the transactions we previously would've blocked we'll say, "There's a 5% chance we'll allow them to proceed and then we can see the outcome as a transaction fraud or not." It's relatively straightforward to compute precision here because what you do is you say of all the transactions we would've blocked, we're letting through 5%. And of those 5%, what fraction were actually fraudulent? And that's your precision.


So how do you compute recall in this world? So let's actually look at some our fake numbers. Let's say you have a million transactions that are processed by a production system, and 900,000 have a score below 50 and a 100,000 have a score above 50. For all of the ones with a score below 50, you could observe the outcome because you're allowing the transactions to proceed, so you have information on all of them or conversely you lack information on none of them. Of the 100,000 with a score above 50, remember you're letting 5% through, but on the other 95% you have no idea what would happen to them so we have no outcome on 95,000 of the 100,000 with a score above 50.

So again, let's pretend for the 900,000 with a score below 50, 890,000 were legitimate. They were not fraudulent. And of the 5,000 that we let through with a score above 50, let's say a 1,000 were legitimate. And then lastly, 10,000 of the ones with a score below 50 were fraudulent, and 4,000 of the 5,000 you let through with a score above 50 were fraudulent.

It's a lot of numbers. So how do we compute the recall here? What percentage of actual fraud is our model catching now that we have this 5% hold back on the high risk or high scoring transactions? So how many cases of fraud did our model actually catch? Well, we let through 5%. We let through 5,000 transactions. Of those 5, 4,000 were fraudulent. We were sampling at a rate of 1 over 20. So you can, in reality, statistically, expect that they're actually 80,000 cases of fraud among the transactions that scored above 50. So the total fraud is your estimate for all the transactions that score above 50 that are actually fraudulent, plus the 10,000 that you know were fraudulent that you let through because they had a score below 50. So the number of total cases of fraud -- again, this is an estimate, but the estimated number is 90,000. So you've caught 80,000 of 90,000. So your recall here is 0.89, which is pretty good. So this is how you do the simple recall calculation if you have this policy of letting through uniformly at random 5% of the transactions that you would've ordinarily blocked.


So we can now ask what do we do in training? So in training you will only train on transactions that you allow to go through. So every transaction that was blocked by our model doesn't exist as far as the next iteration of training goes. You are just ignoring it. But you're now going to not train uniformly or with uniform weight. You're going to say, "For every transaction that would've been blocked that was allowed through, whether the final outcome was fraud or not, we're going to give it a weight of 20. Because every transaction that would've been blocked is in some sense a representation of 20 total transactions that were scoring above 50, of which only one was allowed to go through. So if you have simple psychic [SP] code, what you would do is you would pass in all of your data samples, but now for every transaction, as one of these reversals, this weight vector's weights will have a weight of 20. So it will be sort of one, one, one, weights of one, for every transaction that had a score below 50, and then a weight of 20 for every transaction that was allowed through with a score above 50. Similarly, you do the exact same thing when validating. On your holdout set, you would have weights for transactions that were passed through of 20 and weights of one for everything else.

Policy Curve

So the way you can depict this policy, pictorially or graphically, is by showing a mapping from the original score of the transaction which is on the x-axis to the probability we allow it through as part of this counterfactual evaluation policy. So if the transaction has a score below 50, we're always going to allow it to go through. So the probability that we allow it is one. And then if the score is above 50, we are allowing it with a uniform probability of 0.05. So this works, but it's not ideal. And what do I mean by that?

First, if your classifier is giving you a score of 100 for a transaction, you're quite certain, hopefully, if your model is good, that that transaction is fraudulent. On the other hand, if the score is, say, 49 or 50 or 51, you're less certain, you're at a decision boundary. So if you would've blocked at a score of 51 or 50 before, that sometimes has a score of 51, that doesn't mean it's definitely a fraud, right? You're really uncertain. So with our policy we were reversing transactions for evaluation at the same rate regardless of your score, but in some sense, you have more confidence the further you are to the right here. So if you have some review budget, if you will, you want to spend it more where you're less certain about the outcome, namely, in the decision boundary. And you want to use this process very sparingly for high scoring transaction.

Better Approach

So what does that look like? It looks something like that. The red curve is a bit exaggerated. Actually, it's much flatter than that, but it gives you a sense of instead of doing a uniform policy what you'll say is, let a lot through at 51, less through at 60, even fewer at 80 and almost never let them through at the score of a 100, maybe one of every 1,000 or 10,000 times.

So a couple of things here. This red curve and this blue curve are what we call a propensity function. It's mapping the original classifier score through the probability that we're going to allow the transaction to proceed for evaluation purposes. You know, the higher the original score, the lower probability we allow the transaction because, again, we want to get information precisely where we're most uncertain and, we're letting through less of the obvious fraud, i.e., the stuff with a high score. You know, a way to think of this- maybe someone on your risk team says, "We're going to allow you to spend $10,000 a month incurring fraud intentionally to evaluate your model." So you have this budget for evaluation. You don't want to spend your budget uniformly. You want to spend it mostly around transactions with a score of 50 and you want to spend as little of your budget around transactions with the score of 100, and having this decreasing propensity curve allows you to do that.

So what does this look like really briefly. You know, you have your original score and the original decision to block or not is this original bot boolean. This was the score above the policy threshold of 50. Then, the action you decide to take is whether this randomly sample of number or probability is less than the propensity score, i.e., the probability you allow the transaction. And then you behave according to the selected action, not the original one and you log information. You log the idea of the transaction, the original score, the propensity score, whether you would've blocked it originally and whether you are choosing to block it with this new policy.

So what you end up getting is a table something like this. For every transaction in production, you have the score, the probability that we allowed it, the original action and the selected action, what you actually didn't. And you can see here, is that we get to observe the outcome. Was the transaction fraudulent or not for all of the transactions for which the selected action is allowed? So just to go over this in some detail: if the score is 10 you're definitely going to allow it, and original and selected actions are both allowed, we observe the outcome; it was not fraudulent. Same thing with the score of 45, definitely going to allow it, did allow it, observed it and it was fraudulent. On the other hand, let's say we have a score of 55 and that maps a probability transaction of 0.3. So we picked our random number, compared it to 0.3, it was higher than 0.3, so we continued to block the transaction and therefore had no observed outcome. But in row 4 the score was 65. The propensity score was 0.2, again, sample at random, it was less than 0.2, so even if we would have blocked it originally, we're now going to allow it. And because we've allowed it, it becomes the outcome. It was fraudulent. So what we're going to do, you're assembling each row of this table, every time you have a transaction come in in production.


So how do we use this to analyze our model? So just as before, we're only going to consider transactions that were allowed. This includes everything with a score below 50 and it includes all the transactions for which the selected action was to allow based on this propensity score sampling methodology. Now, on the other hand we're going to weight each transaction that we allowed by one over the probability that we allowed it. Remember, this probability was 0.05 in the original example. Now, it's varying on a per transaction basis. The rough idea here is that you have what you call a geometric series of probability. So if I were to say something like "I flipped a fair coin an unknown number of times, but I'll tell you that there were 10 heads. How many times do you think I flipped the coin?” Your best guess is 20. Now, one over the probability of that I'm saying. So if I say, "You know, here's a transaction. We are allowing it through with a probability 0.01. And it was fraudulent. How many similar cases of fraud were there that were not allowed through?” 99, because this one transaction represents a 100 from the series.

Again, this is just the analogue of weighting by 20 in the uniform case. So what I have done here is I've taken our table from before and I've dropped all the rows for which we did not allow the transaction. So this only includes the rows where we got to observe the outcome because we allowed the payments to go through and I have added a column here for weight, where the weight is just one over the probability that we allow the charge. So let's say this is our data from production and my first question is, how is our original policy of blocking if the score is above 50 doing? And I mean, what are the precision and recall? So how do we compute that? We're blocking with a score above 50; how much total weight has a score above 50? It's five plus four, or it's nine, and then of that total weight with a score above 50, how much is actually fraud? Well, only charge four is fraud, so five of those nine units of weight are fraudulent. So your precision is 5 over 9, or 0.56.

So still more questions for a recall now. So recall is the fraction of all fraud you're catching. So we have two cases of fraud, IDs two and four. They have a total weight of six. Of that total weight of six, what fraction had a score above 50? Just row four. So your recall is going to be 5 out of 6, or 0.83. So we're going to do a few more examples. What if we have a policy of blocking if the score is above 40? So what's the precision? The total weight above 40 is 10. Of that weight above 40, 6 units are fraudulent, so your precision is 6 out of 10. And similarly you have a total weight of 6 that's fraudulent; what fraction of that is above a score of 40? All of it is, so your recall is one. And not to belabor the point, but just one more example. This is a harder one where we're raising the score for blocking, right? We're saying, "We were blocking at 50. Now, we're going to block at 62. We weren't previously observing any of the outcomes at a score of about 50, so how is this policy doing?" Well, there are exactly five units of weight blocked at a score above 50, of which all of it was fraud, so the precision is one. And, again, there are six units of total fraud of which five have a score of above 62, so the recall is again 0.83.

So you're letting through transactions, each with its own weight. And then you are doing all of your calculations exactly as you would do them before but now you're dealing with weights for each transaction and not just counts, and this allows you to compute production metrics. […]

Analysis of Production Data

So the key thing to remember here is that these are all statistical estimates, of course, right? We're applying statics to get an estimate on what our precision and recall are. The variance of these estimates or the confidence interval around them is going to decrease the more you allow through. So if I say, "We'll allow through, 0.05 of transactions at a score of 50 and 0.0001 at a score of a 100." That could get you some estimate and some error bars. But if I say, "Let's let through half of all transactions of the score of 50 and a tenth of all the score of a 100." You're letting a lot to go through. So your error bars will be much narrower at the expense of incurring additional fraud cost. This is an example of the so-called exploration-exploitation trade-off, if any of you are familiar with bandits, you know.

So if you're not doing this at all, you're not letting through any transactions, you are fully exploiting your current modelling, but you are essentially inhibiting your ability to explore new models or policies. The more you let through, the more this propensity curve moves up the y-axis, the more you're shifting from exploitation to exploration. You're giving yourself more room to explore, a new model and possibilities and so forth.

One way to compute error bars on these estimates is to do bootstrapping. What I mean by that is the example of using the rows in this table with the various weights to compute precision and recall. So if you want to ask something, like, how accurate is this estimate? What you will do is -- our table had four rows --, you'll pick four rows from this table uniformly at random with replacement and then do the exact same computation for precision and recall, and you'll do it again. Pick four rows with replacements. You might pick the same row multiple times. Reiterate this procedure, say, a 1,000 times. That will give you a variety of numbers for precision and recall. And then that distribution is your estimate and then you'll look at how wide it is, and so forth.

It doesn't really makes sense obviously when you have only four samples within a table, but you'd imagine if you have tens of millions of rows here, bootstrapping will give you a pretty nice distribution so you can have a sense of how accurate your estimates are.

Training New Models

So when you train new models, this won't be that novel, but you'll train on the weight of data just as in the uniform case and you will evaluate, unsurprisingly, using the weight of data. And the interesting thing to note here is, you're going to solve some of these original problems by, say, A/B testing models. I'm going to put the champion as the A, we're at 90%, and the challenger model as B with 20%. What this procedure allows you to do is to test arbitrarily many new candidate models, new policies and so forth on your table of data. So you are A/B testing one model against another. You have your incumbent model and you have this hold-back against which you can evaluate any number of new models using the same data. So in some sense, it's much more flexible than just binary tests.

All this was inspired by a paper that came out of Microsoft and Facebook a few years ago on counterfactual evaluation, not of fraud models, but of advertising models. And their problem was you have a model for ranking ads, and obviously, what you decide to show impacts what people click on. So how can you continue to retrain models on click-through rate in some statistically sound way? And we adapted that to the case of fraud. But unsurprisingly, this methodology applies anytime you have an ML model that is actively changing the universe or the world.


So I’m going to wrap up with a few technicalities. One, the events that you are dealing with here have to be independent for this analysis to be sound. So each of these rows in the table can't be correlated in any way or everything that I just said will actually not be correct. So is that assumption sound in the credit card payment space? It probably isn't. You know, if I'm a fraudster I might try a card 10 times in a row, right? It gets declined, I'll try again. It gets declined, I'll try again. These transactions are surely not independent. They're being made by the same person who is attempting to achieve some goal. So independence here is not satisfied for individual transactions. So if you're applying this methodology, you have to think what are the actual independent events you care about here. And you want to reverse that entire event and not just parts of it.

So what I mean here in the payments case, for example, maybe your event is not an individual payment but the whole payment session. I go to check out. I attempt a few times, some are declined, but we want this analysis to work and to do that you'll say, “We are going to allow all of the transactions in this session to go through or we're going to block all of them.” So it's important to think carefully about what the events are here because they need to be independent.

One other technicality. So we took this methodology and we said, "We're now going to have a tying series estimate for our production precision and recall. We'll compute it over a rolling 90-day window. So we'll have the past 90 days of transactions. We'll do exactly what I described with this using the table of data to compute precision and recall and, that way, we'll be able to monitor our precision and recall in production over time.” So our first attempt to do this looked like this. It doesn't seem very stable, you know, the numbers are jumping around pretty dramatically at some points. So what could be going on here or why are these numbers so unstable? So what turns out to be the case here is that we are observing sample size effects, and what I mean there, is, depending on your propensity curve, if it's particularly steep, for instance, maybe you're allowing through 1 out of every 10,000 transactions which have a score of a 100. And if you do that, when you have a single transaction with a score of a 100 that you allow through it has a weight of 10,000, and the presence of that one very high weight transaction in this rolling window is enough to move the metrics by a huge amount. So these huge phase transitions are essentially very high weight transactions moving into or out of the evaluation window.

So, again, to do this well, you have to think about what exactly is the shape you want for your propensity function. And then second, do you have enough data to do this so that you aren't observing the sample size effects with super high weight charges?


So that's mostly what I wanted to cover today. Just to conclude: first, it's very easy to back yourself and your models into a corner. If you put them into production and they're actioning or changing the world and you don't want to have a counterfactual evaluation policy; because after some time you're going to realize: (1) you can't evaluate the models, and (2) you can't train any new ones. So before putting a model into production for the first time, make sure you think about how you're going to evaluate the data and train new models down the road. Second, you can address this by injecting randomness into production by allowing through a small fraction of transaction that you would've blocked for fraud, for example. But doing so with propensity scores, allows you to concentrate your reversal budget where it matters most. And then finally, this methodology allows you to test arbitrarily many candidate models and policies and procedures and not just two as you would in an A/B test.

Thank you so much for taking the time. It's great to talk to you.

See more presentations with transcripts


Recorded at:

May 10, 2018