InfoQ Homepage Articles Introduction to Machine Learning with Python

Introduction to Machine Learning with Python

Jan 28, 2017 17 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Logistic regression is appropriate for binary classification when the relationship between the input variables and the output we’re trying to predict is linear or when it’s important to be able to interpret the model (by, for example, isolating the impact that any one input variable has on the prediction).
Decision trees and random forests are non-linear models that can capture more complex relationships well but are less amenable to human interpretation.
It’s important to assess model performance appropriately to verify that your model will perform well on data it has not seen before.
Productionizing a machine learning model involves many considerations distinct from those in the model development process: for example, how do you compute model inputs synchronously? What information do you need to log every time you score? And how do you determine the performance of your model in production?

Machine learning has long powered many products we interact with daily–from "intelligent" assistants like Apple's Siri and Google Now, to recommendation engines like Amazon's that suggest new products to buy, to the ad ranking systems used by Google and Facebook. More recently, machine learning has entered the public consciousness because of advances in "deep learning"–these include AlphaGo's defeat of Go grandmaster Lee Sedol and impressive new products around image recognition and machine translation.

In this series, we'll give an introduction to some powerful but generally applicable techniques in machine learning. These include deep learning but also more traditional methods that are often all the modern business needs. After reading the articles in the series, you should have the knowledge necessary to embark on concrete machine learning experiments in a variety of areas on your own.

This InfoQ article is part of the series "An Introduction To Machine Learning". You can subscribe to receive notifications via RSS.

This series will explore various topics and techniques in machine learning, arguably the most talked-about area of technology and computer science over the past several years. Machine learning at a high level has been covered in previous InfoQ articles (see, for example, Getting Started with Machine Learning in the Getting a Handle on Data Science series), and in this article and the ones that follow it we’ll elaborate on many of the concepts and methods discussed earlier, emphasizing concrete examples, and venture into some new areas, including neural networks and deep learning.

We’ll begin, in this article, with an extended “case study” in Python: how can we build a machine learning model to detect credit card fraud? (While we’ll use the language of fraud detection, much of what we do will be applicable with little modification to other classification problems—for example, ad-click prediction.) Along the way, we’ll encounter many of the key ideas and terms in machine learning, including logistic regression, decision trees, and random forests, true positive and false positive rate, cross-validation, and ROC and AUC curves.

Target: Credit card fraud

Businesses that sell products online inevitably have to deal with fraud. In a typical fraudulent transaction, the fraudster will obtain stolen credit card numbers and then use them to purchase goods online. The fraudsters will then sell those goods elsewhere at a discount, pocketing the proceeds, while the business must bear the cost of the “chargeback.” You can read more about the details of credit card fraud here.

Let’s say we’re an online business that has been experiencing fraud for some time, and we’d like to use machine learning to help with the problem. More specifically, every time a transaction is made, we’d like to predict whether or not it’ll turn out to be fraudulent (i.e., whether the authorized cardholder was not the one making the purchase) so that we can take action appropriately. This type of machine learning problem is known as classification as we are assigning every incoming payment to one of two classes: fraud or not-fraud.

For every historical payment, we have a boolean indicating whether the charge was fraudulent (fraudulent) and some other attributes that we think might be indicative of fraud—for example, the amount of the payment in US dollars (amount), the country in which the card was issued (card_country), and the number of payments made with the card at our business in the past day (card_use_24h). Thus, the data we have to build our predictive model might look like the following CSV:

fraudulent,charge_time,amount,card_country,card_use_24h

False,2015-12-31T23:59:59Z,20484,US,0

False,2015-12-31T23:59:59Z,1211,US,0

False,2015-12-31T23:59:59Z,8396,US,1

False,2015-12-31T23:59:59Z,2359,US,0

False,2015-12-31T23:59:59Z,1480,US,3

False,2015-12-31T23:59:59Z,535,US,3

False,2015-12-31T23:59:59Z,1632,US,0

False,2015-12-31T23:59:59Z,10305,US,1

False,2015-12-31T23:59:59Z,2783,US,0

There are two important details we’re going to skip over in our discussion, but they’re worth keeping in mind as they are just as important, if not more so, than the basics of model building we’re covering here.

First, there is the data science problem of determining what features we think are indicative of fraud. In our example, we’ve identified the payment amount, the country in which the card was issued, and the number of times the card was used in the past day as features we think may be useful in predicting fraud. In general, you’ll need to spend a lot of time looking at data to determine what’s useful and what’s not.

Second, there is the data infrastructure problem of computing the values of features: we need those values for all historical samples to train the model, but we also need their real-time values as payments come in to score new transactions appropriately. It’s unlikely that, before you began worrying about fraud, you were already maintaining and recording the number of card uses over 24-hour rolling windows, so if you find that that feature is useful for fraud detection, you’ll need to be able to compute it both in production and in batch. Depending on the definition of the feature, this can be highly non-trivial.

These problems together are frequently referred to as feature engineering and are often the most involved (and impactful) parts of industrial machine learning.

Logistic regression

Let’s start with one of the most basic possible models—a linear one. We’ll attempt to find coefficients a, b, … Z so that

For every payment, we’ll plug in the values of amount, card_country, and card_use_24h into the formula above, and if the probability is greater than 0.5 we’ll “predict” that the payment is fraudulent and otherwise we’ll predict that it’s legitimate.

Even before we discuss how to compute a, b, … Z, there are two immediate problems to address:

Probability(fraud) needs to be a number between zero and one, but the quantity on the right side can get arbitrarily large (in absolute value) depending on the values of amount and card_use_24h (if those feature values are sufficiently large and one of a or b is nonzero).
card_country isn’t a number—it takes one of a number of values (say US, AU, GB, and so forth). Such features are called categorical and need to be “encoded” appropriately before we can train our model.

Logit function

To address (1), instead of modeling p = Probability(fraud) directly, we’ll model what is known as the log-odds of fraud, so our model becomes

If an event has probability p, its odds are p / (1 - p), which is why the left side is called the “log odds” or “logit.”

Given values of a, b, … Z, and the features, we can compute the predicted probability of fraud by inverting the function above to get

The probability of fraud p is a sigmoidal function of the linear function L = a x amount + b x card_use_24h + … and looks like the following:

Regardless of the value of the linear function, the sigmoid maps it to a number between 0 and 1, which is a legitimate probability.

Categorical variables

To address (2), we’ll take the categorical variable card_country (which, say, takes one of N distinct values) and expand it into N - 1 “dummy” variables. These new features will be booleans of the form card_country = AU, card_country = GB, etc. We only need N - 1 “dummies” because the Nth value is implied when the N - 1 dummies are all false. For simplicity, let’s say that card_country can take just one of three values here: AU, GB, and US. Then we need two dummy variables to encode it, and the model we would like to fit (i.e., find the coefficient values for) is

This type of model is known as a logistic regression.

Fitting the model

How do we determine the values of a, b, c, d, and Z? Let’s start by picking random guesses for a, b, c, d, and Z. We can define the likelihood of this set of guesses as

That is, take every sample in our data set and compute the predicted probability of fraud p given our guesses of a, b, c, d, and Z (and the feature values for each sample) using

For every sample that actually was fraudulent, we’d like p to be close to 1, and for every sample that was not fraudulent, we’d like p to be close to 0 (so 1 - p should be close to 1). Thus, we take the product of p over all fraudulent samples with the product of (1 - p) over all non-fraudulent samples to get our assessment of how good the guesses a, b, c, d, and Z are. We’d like to make the likelihood function as large as possible (i.e., as close as possible to 1). Starting with our guess, we’ll iteratively tweak a, b, c, d, and Z, improving the likelihood until we find that we can no longer increase it by perturbing the coefficients. One common method for doing this optimization is stochastic gradient descent.

Implementation in Python

Now we’ll use some standard open-source tools in Python to put into practice the theory we’ve just discussed. We’ll use pandas, which brings R-like data frames to Python, and scikit-learn, a popular machine learning package. Let’s say the sample data we described above is in a CSV named “data.csv”; we can load the data and take a peek at it with the following:

We can encode card_country into the appropriate dummy variables with

Now the data frame data has all the data we need, dummy variables and all, to train our model. We’ve split up the target (the variable we’re trying to predict—in this case fraudulent) and the features as scikit takes them as different parameters.

Before proceeding with the model training, there’s one more issue to discuss. We’d like our model to generalize well—i.e., it should be accurate when classifying payments that we haven’t seen before and it should not just capture the idiosyncratic patterns in the payments we happen to have already seen. To make sure that we don’t overfit our models to the noise in the data we have, we’ll separate the data into two sets—a training set that we’ll use to estimate the model parameters (a, b, c, d, and Z) and a validation set (also called a test set) that we’ll use to compute metrics of model performance (see the next section on what these are). If a model is overfit, it will perform well on the training set (as it will have learned the patterns in the set) but poorly on the validation set. There are other approaches to cross-validation (for example, k-fold cross validation), but a “train-test” split will serve our purposes here.

We can easily split our data into training and testing sets with scikit as follows:

In this example, we’ll use ⅔ of the data to train the model and ⅓ of the data to validate it.

We’re now ready to train the model, which at this point is a triviality:

The fit function runs the fitting procedure (which maximizes the likelihood function described above), and then we can query the returned object for the values of a, b, c, and d (in coef_) and Z (in intercept_). Our final model is thus

Evaluating model performance

Once we’ve trained a model, we need to determine how good that model is at predicting the variable of interest (in this case, the boolean indicating whether the payment is believed to be fraudulent or not). Recall that we said we’d classify a payment as fraudulent if Probability(fraud) is greater than 0.5 and that we’d classify it as legitimate otherwise. Two quantities frequently used to measure performance given a model and a classification policy such as this are

the false positive rate: the fraction of all legitimate charges that are incorrectly classified as fraudulent, and
the true positive rate (also known as recall or the sensitivity), the fraction of all fraudulent charges that are correctly classified as fraudulent.

While there are many measures of classifier performance, we’ll focus on these two.

Ideally, the false positive rate will be close to zero and the true positive rate will be close to 1. As we vary the probability threshold at which we classify a charge as fraudulent (above we said it was 0.5, but we can choose any value between 0 and 1—low values mean we’re more aggressive in labeling payments as fraudulent and high values mean we’re more conservative), the false positive rate and true positive rate trace out a curve that depends on how good our model is. This is known as the ROC curve and can be computed easily with scikit:

The variables fpr, tpr, and thresholds contain the data for the full ROC curve, but we’ve picked a sample point here: if we say a charge is fraudulent if Probability(fraud) is greater than 0.514, then the false positive rate is 0.374 and the true positive rate is 0.681. The whole ROC curve and the point we picked out are depicted below.

The better a model is overall, the closer the ROC curve (the blue line above) will hug the left and top borders of the graph. Note that ROC curve overall tells you how good your model is, and this can be captured with a single number—the AUC, or the area under the curve. The closer the AUC is to 1, the better the model is overall.

Of course, when you put a model into production to take an action, you’ll generally need to action the model-outputted probabilities by comparing them to a threshold as we did above, saying that a charge is predicted to be fraudulent if Probability(fraud) > 0.5. Thus, the performance of your model for a specific application corresponds to a point on the ROC curve—the curve overall again just controls the tradeoff between false positive rate and true positive rate, i.e., the policy options you have at your disposal.

Decision trees and random forests

The model above, a logistic regression, is an example of a linear machine learning model. Imagine that every sample payment we have is a point in space whose coordinates are the values of features. If we had just two features, each sample point would be a point in the x-y plane. A linear model like logistic regression will generally perform well if we can separate the fraudulent samples from the non-fraudulent samples with a linear function—in the two feature case, that just means that almost all the fraudulent samples lie on one side of a line and almost all the non-fraudulent samples like on the other side of that line.

It’s often the case that the relationship between predictive features and the target variable we’re trying to predict is nonlinear, in which case we should use a nonlinear model to capture the relationship. One powerful and intuitive type of a nonlinear model is a decision tree like the following:

At each node, we compare the value of a specified feature to some threshold and branch either to the left or the right depending on the output of the comparison. We continue in this manner (like a game of twenty questions, though trees do not need to be twenty levels deep) until we reach a leaf of the tree. The leaf consists of all the samples in our training set for which the comparisons at each node satisfied the path we took down the tree, and the fraction of samples in the leaf that are fraudulent is the predicted probability of fraud that the model reports. When we have a new sample to be classified, we generate its features and play the “twenty-questions game” until we reach a leaf, and the predicted probability of fraud is reported as described.

While we won’t go into the details of how the tree is produced (though, briefly, we pick the feature and the threshold at each node to maximize some notion of information gain or discriminatory power—the gini reported in the figure above—and proceed recursively until we hit some pre-specified stopping criterion), training a decision tree model with scikit is as easy as training a logistic regression (or any other model, in fact):

One issue with decision trees is that they can easily be overfit—a very deep tree in which each leaf has just one sample from the training data will often capture noise pertinent to each sample and not general trends—but random forests models can help address this. In a random forest, we train a large number of decision trees, but each tree is trained on just a subset of the data we have available, and when building each tree we only consider a subset of features for splitting. The predicted probability of fraud is then just the average of the probabilities produced by all the trees in the forest. Training each tree on just a subset of the data, and only considering a subset of the features as split candidates at each node, reduces the correlation between the trees and makes overfitting less likely.

To summarize, linear models like logistic regressions are appropriate when the relationship between the features and the target variable is linear or when you’d like to be able to isolate the impact that any given feature has on the prediction (as this can be read off the regression coefficient directly). On the other hand, nonlinear models like decision trees and random forests are harder to interpret, but they can capture more complex relationships.

Productionizing machine learning models

Training a machine learning model as described here is really just one step in the process of using machine learning to solve a business problem. As described above, model training generally must be preceded by the work of feature engineering. And once you have a model, you need to productionize it, i.e., make it available in production to take action appropriately (by blocking payments assessed to be fraudulent, for example).

While we won’t go into detail here, productionization can involve a number of challenges—for instance, you may use Python for model development while your production stack is in Ruby. If that is the case, you’ll either need to “port” your model to Ruby by serializing it in some format from Python and having your production Ruby code load the serialization or use a service-oriented architecture with service calls from Ruby to Python.

For a problem of an entirely different nature, you’ll also want to maintain model performance metrics in production (as distinct from metrics as computed on the validation data). Depending on how you use your model, this can be difficult because the mere act of using the model to dictate actions can result in your not having the data to compute these metrics. Other articles in this series will consider some of these problems.

Supporting materials

A Jupyter notebook with all the code examples above can be found here, and sample data for model training can be found here.

About the Author

Michael Manapat (@mlmanapat) leads work on Stripe’s machine learning products, including Stripe Radar. Prior to Stripe, he was an engineer at Google and a postdoctoral fellow in and lecturer on applied mathematics at Harvard. He received a Ph.D. in mathematics from MIT.

Machine learning has long powered many products we interact with daily–from "intelligent" assistants like Apple's Siri and Google now, to recommendation engines like Amazon's that suggest new products to buy, to the ad ranking systems used by Google and Facebook. More recently, machine learning has entered the public consciousness because of advances in "deep learning"–these include AlphaGo's defeat of Go grandmaster Lee Sedol and impressive new products around image recognition and machine translation.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Introduction to Machine Learning with Python

InfoQ Article Contest

Key Takeaways

Target: Credit card fraud

Logistic regression

Logit function

Categorical variables

Fitting the model

Implementation in Python

Evaluating model performance

Decision trees and random forests

Productionizing machine learning models

Supporting materials

About the Author

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter