### Key Takeaways

- You can use simulations to determine your confidence around estimates of a given metric.
- When your machine learning model is being used to take actions that affect outcomes in the world, you need to have a system for counterfactual evaluation.
- You can generate “explanations” for black-box model decisions, and those explanations can help with model interpretation and debugging (even if they’re very rudimentary).

*Machine learning has long powered many products we interact with daily–from "intelligent" assistants like Apple's Siri and Google Now, to recommendation engines like Amazon's that suggest new products to buy, to the ad ranking systems used by Google and Facebook. More recently, machine learning has entered the public consciousness because of advances in "deep learning"–these include AlphaGo's defeat of Go grandmaster Lee Sedol and impressive new products around image recognition and machine translation.*

*In this series, we'll give an introduction to some powerful but generally applicable techniques in machine learning. These include deep learning but also more traditional methods that are often all the modern business needs. After reading the articles in the series, you should have the knowledge necessary to embark on concrete machine learning experiments in a variety of areas on your own.*

*This InfoQ article is part of the series "An Introduction To Machine Learning". You can subscribe to receive notifications via RSS.*

Using machine learning to solve real-world problems often presents challenges that weren't initially considered during the development of the machine learning method, but encountering challenges from your very own application is part of the joy of being a practitioner! This article will address a few examples of such issues and will hopefully provide some suggestions (and inspiration) for how to overcome the challenges using straightforward analyses on the data you already have.

Perhaps you'd like to quantify the uncertainty around one of your business metrics. Unfortunately, adding error bars around any metric more complicated than an average can be daunting. Reasonable formulas for the width of the error bars often assume that your data points are independent (which is almost never true in any business—for example, you might have multiple data points per customer or customers connected to each other on a social network). Another common assumption is that your business metric is normally distributed across users, which often fails with "superusers" or a large portion of inactive users. But never fear—simulations and non-parametric methods can almost always be used to create error bars, and all you need is a few lines of code and some computing power.

Or perhaps you're using a binary classifier in production: for example, you may be deciding whether or not to show a website visitor a specific advertisement or whether or not to decline a credit card transaction due to fraud risk. A classifier that results in action being taken can actually become its own adversary by stopping you from observing the outcome for observations in one of the classes: we never get to see whether a website visitor would have clicked an ad if we don't show it, and we never get to see if a credit card charge was actually fraudulent unless we process it since our evaluation data is missing. Luckily, there are statistical methods for addressing this.

Finally, you may be using a "black-box" model: a model that makes accurate, fast predictions that computers understand easily but that aren't designed to be examined post-hoc by a human (random forests are a canonical example). Do your users want understandable explanations for decisions that model made? Simple modeling techniques can handle that problem too.

One of my favorite things about being a statistician-turned-ML-practitioner is the optimism of the field. It feels strange to highlight optimism in fields concerned with data analysis: statisticians have a bit of a reputation for being party poopers when they point out to collaborators flaws in experimental designs, violations of model assumptions, or issues arising because of missing data. But the optimism I've seen derives from the fact that ML practitioners have been doing their very best to develop techniques for overcoming these sorts problems. We can correct expensive-but-badly-designed biology experiments after the fact. We can build regression models even if our data is correlated in surprising or unquantifiable ways that rule out standard linear regression. We can empirically estimate what could have been if we had missing data.

I mention these examples because they (and countless others like them) have led me to believe we can actually solve most of our data problems with relatively simple techniques. I'm loathe to give up on answering an empirical machine learning question just because, at first glance, our data set isn't quite "textbook." What follows are a few examples of machine learning problems that at one point seemed insurmountable but that can be tackled with some straightforward solutions.

## Problem 1: Your model becomes its own adversary

Adversarial machine learning is a fascinating subfield of machine learning that deals with model-building within a system whose data changes over time due to an external "adversary," i.e., someone trying to exploit weaknesses in the current model, or someone who benefits from the model making a mistake. Fraud and security are two huge application areas in adversarial ML.

I work on machine learning at Stripe, a company building payments infrastructure for the internet. Specifically I build machine learning models to automatically detect and block fraudulent payments across our platform. My team aims to decline charges being made without the consent of the cardholder. We identify fraud using disputes: cardholders file disputes against businesses where their cards are used without their authorization.

In this scenario, our obvious adversaries are fraudsters: people trying to charge stolen credit card numbers for financial gain. Intelligent fraudsters are generally aware that banks and payment processors have models in place to block fraudulent transactions, so they're constantly looking for ways to get around them, so we strive to stay recent with our models in order to get ahead of bad actors.

However, a more subtle adversary is the model itself: once we launch a model in production, standard evaluation metrics for binary classifiers (like precision and recall, described in the first article of this series) can become impossible to calculate. If we block a credit card charge, the charge never happens and so we can’t determine if it would have been fraudulent. This means we can't estimate model performance. Any increase in observed fraud rate could theoretically be chalked up to an increase in inbound fraud rather than a degradation in model performance; we can't know without outcome data. The model is its own "adversary" in a loose sense since it works against model improvements by obscuring performance metrics and depleting the supply of training data. This can also be thought of as a very unfortunate "missing data" problem: we're "missing" the outcomes for all of the charges the model blocks. Other machine learning applications suffer from the same issue: for example, in advertising, it's impossible to see whether a certain visitor to a website would have clicked an ad if it never gets shown to that visitor (based on a model's predicted click probability for that user).

Having labeled training data and model performance metrics is business critical, so we developed a relatively simple approach to work around the issue: we let through a tiny, randomly-chosen sample of the charges our models ordinarily would have blocked and see what happens (i.e., observe whether or not the charge is fraudulent and fill in some of our missing data). The probability of reversing a model's "block" decision is dependent on how confident the model is that the charge is fraudulent. Charges the model is less certain about have higher probabilities of being reversed; charges given very high fraud probabilities by the model are approximately never reversed. The reversal probabilities are recorded.

We can then use a statistical technique called inverse probability weighting to reconstruct a fully-labeled data set of charges with labeled outcomes. The idea behind inverse probability weighting is that a charge whose outcome we know because a model's "block" decision was reversed with a probability of 5% represents 20 charges: itself, plus 19 other charges like it whose model block decisions weren't reversed. So we essentially create a data set containing 20 copies of that charge. From there, we can calculate all the usual binary classifier metrics for our model: precision, recall, false positive rate, etc; we can also estimate things like "incoming fraudulent volume" and create weighted training data sets for new, improved models.

Here, we first took advantage of our ability to change the way the underlying system works: we don't control who makes payments on Stripe, but we're able to be creative with what happens after the payment in order to get the data we need for improvements to fraud detection. Our reversal probabilities that varied with our model's certainty reflected the business requirements of this solution: we should almost always block charges we know to be fraudulent in the interest of doing what's best for the businesses depending on us for their payments. And even though the best business solution here was not the ideal solution for data analysis, we made use of a classic statistical method to correct for that. Keeping smart system modifications in mind and remembering that we can often adjust our post-hoc analyses were both key insights to solving this problem.

## Problem 2: Error bar calculations seem impossible

Determining the margin of error on any estimate is (a) very important, since the certainty in your estimate can very much affect how you act on that information later, and (b) often terrifyingly challenging. Standard error formulas can only get you so far; once you try to put error bars around any quantity that isn't an average, things get complicated very quickly. Many standard error formulas also require some estimate of correlation or covariance—a quantification of how the data points going into the calculation are related to each other—or an assumption that those data points are independent.

I'll illustrate this challenge with a real example from the previous section: estimating the recall of our credit card fraud model using the inverse-probability-weighted data. We'd like to know what percentage of incoming fraud is blocked by our existing production model. Let's assume we have a data frame, df, with 4 columns: the charge id, a boolean fraud indicating whether or not the charge was actually fraudulent, a boolean predicted_fraud indicating whether or not our model classified the charge as fraudulent, and weight (the probability we observed the charge's outcome). Then the formula for model recall (in pseudocode) is

`recall = ((df.fraud & df.predicted_fraud).toint * df.weight) / (df.fraud * df.weight)`

(Note that & is an element-wise logical on the df.fraud and df.predicted_fraud vectors, and * is a vector dot product.) There isn't a known closed-form solution for calculating a confidence interval (i.e., calculating the widths of the error bars) around an estimator like that. Luckily, there are straightforward techniques we can use to get around this problem.

Computational methods for error bar estimation work in virtually any scenario and have almost no assumptions that go along with them. My favorite, and the one I'm going to talk about now, is the bootstrap, invented by Brad Efron in 1979. The linked paper has a proof that confidence intervals calculated this way have all the mathematical properties you'd expect from a confidence interval. The main disadvantage to methods like the bootstrap is that they're computationally intensive, but it's 2017 and we live in a world where computing power is cheap, so what made this sometimes unusable in 1979 is basically a non-issue today.

Bootstrapping involves estimating variation in our observed data set using sampling: we take a large number of samples with replacement from the original data set, each with the same number of observations as in the original data set. We then calculate our estimated metric (recall) on each of those "bootstrap samples." The 2.5th percentile and the 97.5th percentile of those estimated recalls are then our lower and upper bounds for a 95% confidence interval for the true recall. Here's the algorithm in Python assuming df is the same data frame as in the example above:

```
from numpy import percentile
from numpy.random import randint
def recall(df):
return ((df.fraud & df.predicted_fraud).toint * df.weight) / (df.fraud * df.weight)
n = len(df)
num_bootstrap_samples = 10000
bootstrapped_recalls = []
for _ in xrange(num_bootstrap_samples):
sampled_data = df.iloc[randint(0, n, size=n)]
est_recall = recall(sampled_data)
boostrapped_recalls.append(est_recall)
ci_lower = percentile(bootstrapped_results, 2.5)
ci_upper = percentile(bootstrapped_results, 97.5)
```

With techniques like the bootstrap and the jackknife, error bars can almost always be estimated. They might have to be done in batch rather than in real time, but we can basically always calculate an accurate measures of uncertainty!

## Problem 3: A black box model's decisions need to be interpreted by a human

Machine learning models are commonly described as "magic" or "black boxes"—the important thing is what goes into them and what comes out, not necessarily how that output is calculated. Sometimes this is what we want: many consumers don't really need to see inside the sausage factory, as they say—they just want a tasty tubular treat. But other times, a prediction from a box full of magic isn't satisfying. For many production machine learning systems, understanding individual decisions made by a machine learning model is crucial: at Stripe, we recently made our machine learning model decisions visible to the online businesses we support, which means business owners can understand what factors led to our models' decision to decline or allow a charge.

As we noted in the introduction, random forests are a canonical example of a black box model, and they're the core of Stripe's fraud models. The basic idea behind a random forest is the following: a "forest" is composed of a set of multiple decision trees. The trees are constructed by finding the splits, or questions (e.g., "does the country this credit card was issued in match the country of the IP address it's being used at?") that optimize some classification criterion, e.g., overall classification accuracy. Each item is run through all of the trees, and the individual tree decisions are "averaged" (i.e., the trees "vote") to get a final prediction. Random forests are flexible, perform well, and really fast to evaluate in real-time, so they're a common choice for production machine learning models. But it's very hard to translate several sets of splits plus tree votes into an intuitive explanation for the final prediction.

It turns out there's been research done on this problem, so even though it likely won't be part of an introductory machine learning course, the body of knowledge is out there. But perhaps more importantly, this problem exemplifies the lesson that a simple solution can work really well. Again, this is 2017; raw computing power is abundant and cheap. One way to get a rudimentary explanation for a black box model is to write a simulation: vary one feature at a time across its domain, and see how the prediction changes—or maybe change the values of two covariates at a time and see how the predictions change. Another approach (that we used for a while here at Stripe) is to recalculate the predicted outcome probability treating each feature in turn as missing (and non-deterministically traversing both paths whenever there is a “split” on the omitted feature); the features changing the predicted outcome probabilities the most can be considered the most important. This produced some confusing explanations, but worked reasonably well until we were able to implement a more formal solution, which we now use to surface explanations to businesses processing payments with us.

With each solution we implemented, we anecdotally experienced a marked improvement in overall understanding of why specific decisions were made. Having explanations available was also a great debugging tool for identifying specific areas where our models were making systematic mistakes. Being able to solve this problem again highlights how we have the tools (straightforward math and a few computers) to solve business-critical machine learning problems, and even simple solutions were better than none. Having a "this is solvable" mindset helped us implement a useful system, and it's given me optimism about our ability to get what we need from the data we have.

## Other reasons I'm optimistic

I remain hopeful about being able to solve important problems with data. In addition to the examples in the introduction and the three problems outlined above, several other small, simple, clever ways we can solve problems came to mind:

- Unwieldy data sets: If a data set is too large to fit in memory, or computations are unreasonably slow because of the amount of data, downsample. Many questions can be reasonably answered using a sample of the data set (and most statistical techniques were developed for random samples anyway).
- Lack of rare event data in sampled data sets: for example, I often get random samples of charges that don't contain any fraud, since fraud is a rare event. A strategy here is to take all of the fraud in the original data set, just downsample the non-fraud, and use sample weights (similar to the inverse-probability weighting discussed above) in the final analysis.
- Beautiful, clever computational tricks to calculate computationally-intensive quantities: exponentially-weighted moving averages are a nice example here. Moving averages are notoriously hard to compute (since you have to keep all of the data points in the window of interest), but exponentially-weighted moving averages get at the same idea, but use aggregates so are much faster. HyperLogLogs are a lovely approximation of all-time counts without needing to scan the entire data set, and HLLSeries are their really cool counterpart for counts in a specific window. These strategies are all approximations, but machine learning is an approximation anyway.

These are just a few data-driven ways to overcome the everyday challenges of practical machine learning.

## About the Author

**Alyssa Frazee** is a machine learning engineer at Stripe, where she builds models to detect fraud in online credit card payments. Before Stripe, she did a PhD in biostatistics and fell in love with programming at the Recurse Center. Find her on Twitter at @acfrazee.

*Machine learning has long powered many products we interact with daily–from "intelligent" assistants like Apple's Siri and Google Now, to recommendation engines like Amazon's that suggest new products to buy, to the ad ranking systems used by Google and Facebook. More recently, machine learning has entered the public consciousness because of advances in "deep learning"–these include AlphaGo's defeat of Go grandmaster Lee Sedol and impressive new products around image recognition and machine translation.*

*In this series, we'll give an introduction to some powerful but generally applicable techniques in machine learning. These include deep learning but also more traditional methods that are often all the modern business needs. After reading the articles in the series, you should have the knowledge necessary to embark on concrete machine learning experiments in a variety of areas on your own.*

*This InfoQ article is part of the series "An Introduction To Machine Learning". You can subscribe to receive notifications via RSS.*