## Transcript

Michailidis: My name is Marios Michailidis. Thank you very much for coming today. I know this is the last session, so people must be tired. So I'll try to make it as entertaining as possible. And I started by throwing my mouse down. Today, I would like to show you our idea of what we call automated machine learning, and I will go through various aspects of it. Hopefully you will find a lot of these elements useful, and you should be able to apply them, and get similarly good results for something that is a lot of hype right now, to say it at least.

Just a few words about me. I work as a data scientist for H2O.ai. H2O is a company that primarily creates software for predictive analytics. This is software that can help you extract information about the data, but specifically link the data with some outcomes. For example, I have historical credit history data for some customers. Can I predict whether they will default in the future? Or, I know how many products, or what products a customer has bought; what products will he or she buy in the future? So, different ways to link your data with certain outcomes, which is used a lot to drive business forward.

I've done my PhD in ensemble methods and when I say ensemble methods, I mean, how we combine many different algorithms in order to get better results. You may have heard for different algorithms in this space, like deep learning, random forest, logistic regression. There are many different techniques. I thought of instead of trying to reinvent the wheel, try to find a good way to combine all these algorithms to get a better outcome, and it worked quite well.

Something that I am quite proud of is that I have competed a lot in this platform called Kaggle. I'm not sure if people have heard of it. It's like Formula One for analytics. It's owned by Google, and they host predictive modeling competitions. So, different companies like Facebook, Google, Amazon, will give some data, and they will say, "Can you predict something out of this data?" And they make this a competition. So the one who's going to make the best predictions, they're going to win monetary prizes or get hired in these companies. They also have created a ranking system. So the more you compete, the more points you accumulate, similarly to tennis, if you know how tennis works. At some point, I was able to get ranked top out of 500,000 data scientists, after winning multiple competitions. But what I really take away from this is that because I have competed a lot and tried many different problems, I have been able to take some of this knowledge, have seen what works for different problems, and kind of incorporated into these software solutions, but generally, on processes that help make predictive analytics more efficient.

## Challenges in the Machine Learning Workflow

I think a data science problem, like a prediction problem, is quite often formulated like what I'm showing on the screen. On the left hand side, you have a data integration phase where you pull data from different data sources, for example, different tables in the database, and consider a default prediction. You might want to bring the credit history of a customer or some descriptive details about the customer, like his age, gender if you're allowed, or you can also bring some demographics data. So you have this data integration phase, you bring the data together, you create a tabular view of your data, where each role, it's a customer. Then normally, you define what you want to predict. For example, I want to predict default next month based on all this data that I have up until now. And then you have a very iterative process where you conduct different experiments. You try to see the data from different angles in order to get better results. This process nowadays is very repetitive, because there are a lot of tools out there and people experiment in order to find what really works best.

Normally, people through some insight and some visualizations, in order to understand the data, then they find a way where they want to test how well the predictions will work. They try to recreate testing environments where they can try many different things and see how well that approaches would work. And then they spend a lot of time to transform the data. This is the concept we call feature engineering, which I will go in detail later on. Then you have a set of different algorithms that can help you get these predictions, and there could be quite many to choose from. Then you have a stage where you actually try to tune the parameters for these algorithms. Just picking an algorithm doesn't solve the problem. You need to make certain that this algorithm is specifically tuned for the problem you're trying to predict or to optimize for.

Then, you may want to remove features which you no longer need, you try to manage the complexity of what you've built so far. Then there may be a process where you try to combine all the different approaches that may have worked so far, and sort of create a more powerful approach out of this. Then you may spend some time to understand or interpret the results. This is for your own knowledge, but sometimes it can help debugging, but at the same time nowadays, it has become a requirement from a regulatory point of view to be able to explain why your model is making the predictions it does, particularly in banking. For example, when you deny credit to someone, you need to be able to say it's because of this, this, this, and that. Then obviously, you find a way to make your code ready to put into production. So, these things can be very iterative, especially from the feature engineering part up to the ensemble. It's a very iterative process where people try a lot of different things to see what works and what doesn't.

## Visualizations

We often start with some visualizations, and the main idea is we try to understand what the data tell us. We try to see if maybe some of the distributions or some of the variables, potentially spot outliers. Outliers are observations which are very different from the normal distribution - not distribution as in normal distribution - let's say from the majority of your people. These are cases which might make your model, your predictions, a little bit off just because they are very extreme cases. What you always need to question yourself is, are these cases sensible even if they are extreme, or they're wrong? This is normally a phase where normally, they if are wrong; you just have to remove them. But if they're just extreme cases sometimes it's better to deal with it through other methods, for example, in algorithms that can be more robust for these kinds of problems.

Other things you can do is you can look for correlations within your data. Again, it's to increase your own understanding about the dynamics of your data, but at the same time you don't want to duplicate information. If, let's say, two features tell you pretty much the same thing, you normally don't need one of these fields. There are other graphs, for example, like a heat map, that can help you transpose your data and look at it from different angles, and find different patterns. The main idea is you try just to see the data from different angles and just get an understanding, because this can help you later on to potentially formulate the problem better. But also get confidence that an automated approach is really picking up the right things because you can also see them.

I can give you an example. I remember we were trying to predict for an insurance company, whether someone, which policy he was going to take. After looking at the data, you could see that 99% of the people were renewing. So actually, it was more valuable to say, "Will someone renew, yes or no?" so breaking down into two problems. If someone had low probability to renew, then trying to find which policy was going to be picked, rather than trying to predict the policy straight away from the beginning. So just ideas of how you could formulate the problem better if you have more understanding about your data and what they tell you.

## Feature Engineering

Feature engineering is a very critical process, from my experience, especially in predictive modeling, is that it is very important to find the right representation for your features in order to shop the algorithms, associate them with what you try to predict. If let's say, a column, a feature, a variable in your data called animal, that takes different distinct values like dog, cat, and fish. So a lot of the algorithms right now that are used in machine learning, they understand numbers, they don't understand letters. So you need to find a way to represent that numerically to help them associate them with what you try to predict.

One way to do it will be to use something called frequency in coding, where you just count how many times dog, cat, fish appear in the data. Just replace it with this, then you have a variable that says how popular an animals is. A more lazy way to do it is to just assign a unique index of these categories after you sort them alphabetically. Something that people do very commonly is to treat each one of these categories as a single binary outcome. So is it a dog, yes or no? Is it a cat, yes or no? Something that people in machine learning do with great success is, let's say if you want to predict something like cost, what you could do is estimate averages for, of course, each one of your different kinds of animals, in this case, and create a variable that actually represents that. Instead of using dog and cat, you represent the average cost. So you have a variable that already maps to what you try to predict, and quite often this helps your algorithms to achieve better results.

Other common transformations in this space is you have a numerical variable, something that you might consider is binning. When I say binning, it's essentially converting the numerical features to categorical through defining some bands. Say age from here to here will be one band, then from 40 to 50 another band. What you're trying to do is you try to capture shifts in relation to your target variable. For example, when age is low, generally income is low, but that increases with age. When we reach middle age, income still increases, but potentially at a lower pace. Then when you start reaching retirement, income starts decreasing. So ideally, you want to be able to capture these shifts, and this is where you might want to create these bands, so that you help algorithms understand that these are the areas where I can see shift in what I'm trying to predict, so I want you to focus on them.

Let's say for the sake of the argument, these are the bins. The original variable would be transformed to categorical using this approach. Other things that you might do if you have missing values, you might use the mean or median, or other methodology to replace those missing values to help algorithms make predictions for these cases which you might not have much information. There are other transformations, too, that normally help transform the variables to become better inputs for other algorithms, like taking the logarithm of a numerical feature of the square root.

Something that people do is they search a lot about interactions between different features. They try to get combined information from two features into one, sometimes making it more powerful. For example, they could try to multiply two features, or add them. If you have two categorical features, you can do the same thing. You can just create one concatenated string out of two strings. Or again, if you have a categorical and numerical variable, you can still create an interaction. You can use this target and coding approach I mentioned before, where you can estimate the average age, in this case of an animal, and replace with that. And you don't even need to limit yourself to just mean; you can take other aggregated measures too, like maximum values or standard deviation.

Similarly, other types of feature engineering, like text, require different ways to transform the data to give potentially a numerical representation. Something that people commonly do is they try to organize the sentences you have in your data. So they try to find all possible words within your data and then they represent each single role, each single record in your data with which words are triggered, essentially, and how many times they're triggered in this sentence. This is called the 10 frequency metrics, which then can be fed exactly with this input to algorithms.

Or you can try different approaches that try to compress this, because as you can see, this may be very large if you consider all possible words. So there are techniques to try to do what we call dimensionality reduction, so try to compress all the information you have there into less fields. Another trick that people use is this Word2vec approach. There are techniques that can associate a word through a series of numbers, to a vector of numbers in a way that you can do mathematical operations between words. So if from the word "king," you remove the word "man," the closest outcome is the word "queen.'' And it doesn't always work as well or give you such interesting results. But generally, these vectors are very good in giving you an identity for what the word is telling you about. Normally, creating features out of this can give a very good essence about what your text is about.

Other different types of problems, for example, time series. Again, there are many different ways to represent the data. You can extract day, month, year from a date and you can use this as your feature and fit it to your predictive algorithms. Or, you can use what we call auto regressive features. So if I know the sales yesterday and two days ago, can I use these as features to predict sales today? I can also create moving averages or averages of this past information, and I can do this for multiple past values in order to try and get a trend, a historical trend for where my target variable is going to move.

## Common Open Source Packages Used in ML

These are some of the open source packages that we've utilized, but I also have seen them to be very successful in the predictive modeling context. LightGBM is made from Microsoft, it has very fast implementations for gradient boosting trees, similar to XGBoost and CatBoost. Scikit-learn has pretty much everything. I guess, their best implementations, for me, are some of the linear models and the random forest packages. H2O has also strong open source [inaudible 00:18:20] with different algorithms. Keras and TensorFlow, and PyTorch are very good for deep learning implementations. Vowpal Wabbit is a good tool for very fast linear models.

Then there are many other packages to try to help you to process, pre-process the data and do different analysis. Just picking one of these algorithms, as I briefly mentioned before, is not enough. So those algorithms are very parameterized; you need to find a good set of parameters that can help you maximize the performance against what you are trying to predict. Consider something like XGBoost, which contains many, many decision trees, and takes a weighted average of them. Something that you might want to find is what is the maximum depth you should have for each one of these trees, and again, different problems will have different best values. You can play with different loss functions to expand these trees, or you can control the learning rate. So how quickly should the algorithm learn, how much should one tree be dependent on the other, how many trees you want to have in your ensemble, and many, many other hyperparameters.

## Validation Approach

Normally, for all these packages, there is good documentation about the hyperparameters you need to tune in order to get good results. Even if you know these parameters, in order to get good results, you have to be able to create a good validation strategy. So you need to be able to replicate an environment similar to what you are really going to apply your algorithm or your model on. For example, if you have a problem, a time-serious problem, where you need to make predictions in the future, but where time is really, really important, you always need to formulate your experiments to train some models, or make your approaches using past data, and evaluate them on future data.

There are many different strategies here. Just as an example, you can have this moving window approach where you always train on some past data, which is the yellow part. This is where you train your algorithm, you try your features, you try hyperparameters, but you evaluate them on the gray part. You can do this with multiple different periods, in order to be able to have an approach that's robust and can work in any time period. If the time element is not so important, then you normally use something called K4 cross validation or similar, where you could divide your data set into K-parts. Let's say, in this case four parts, and iteratively, what you would do is pick a subset of the data where you can try different things, different features, different algorithms, and then you can train a model.

Then you can use another subset of the data for validation to see how well you've done, let's say how much accuracy you have achieved. And then you repeat this process, but the next time, you're going to leave another part of the data outside this procedure. Again, you're going to make predictions, you're going to see how well you've done on this part. Basically, you continue this process until you get a fairly reliable estimate about what your approach is giving you. Normally, scikit-learn has very good packages to help you do this.

## Genetic Algorithm Approach

So we have proposed a way to do this. As you can see, the potential of different algorithms, different hyperparameters, different features, there are a lot of combinations that could work here. Ideally, you need to find a way to quickly iterate through some of these elements and be able to get good results because you just don't have the capacity to search for every possible combination. A nice approach is to follow an evolutionary approach where, just to demonstrate how that would work, let's say you have some features in your data and you try to predict an outcome. Let's say if someone will default. So the way this would work is initially, you can take these four features, you don't need to make much transformations on them, just use them as they are. Then you pick an algorithm and you try to predict the target, maybe within this cross validation approach. And then this will give you an X-percentage of accuracy.

Also, you can see how much the algorithm relied on different features in your data in order to give you that accuracy. So what you can do next is use this information in order to make the next path, the next iteration of looking at the same problem, a bit more efficient. So now, you can infer that maybe this x1 feature is probably not very important, but x2 and x4, it seems that the algorithm relied more on them. So when do you start the new iteration, you can focus more on these two features that seem to have worked well. You can either capitalize on leveraging their interaction, but it's also just finding better representations for these features.

But at the same time, because you don't want to get trapped into this very directed approach, you allow some room for random experimentation. You can still see the other features in a fairly random way, so that you don't miss on something that you just didn't see on the first approach. You essentially repeat this process. Or there is this exploration, exploitation element. I always exploit what I see that is giving me back some results, but at the same time I allow some room for exploration. You can repeat the same process, pick an algorithm through a cross validation approach, get a new accuracy, this will come back with a new ranking of features which then you can use to make the second iteration loop through the process a bit more directed.

## Feature Ranking Based on Permutations

A lot of people have asked me in the past, how can you decide which features are important? A technique that has worked quite well in a predictive context, too, is consider having a data set. What you can do is divide it into what we call training and validation. You can take an algorithm, fit it on the training data. In this case, I can use these four features to try to predict this binary outcome. And then you can make predictions to this validation data set. And let's say for the sake of the argument, I can get an 80% accuracy. So what you do next is you can take one of the columns in the validation data and randomly shuffle it. So you have one feature which is wrong in your data, but everything else is correct. You just repeat the scoring, so with the same algorithm. You don't retrain it, the same model. You just repeat the scoring and you can see how much the accuracy has dropped.

So let's say now, the accuracy went to 70%; this 10% difference is how important this feature was to your algorithm, and this ranking is extremely good to drive forwards in really filtering which features your algorithm prefers, to drive predictions and discard the noise. And then you just repeat this process. You can bring this feature back to the normal, and then you move on to the next feature and you do the same thing, randomly shuffle it, score, see where the performance dropped.

## Stacking

Maybe one last thing I wanted to mention is it may be that it's not only one approach that was good. So throughout the process you might have found different models that actually work well. They may not have given you the best score, but you might want to consider blending them together in order to see if you can get a better result. We normally use an approach called stacking in order to do this. Just to demonstrate how this would work is, consider we have three different data sets, A, B, and C. A has a target variable, and B has a target variable, but C is your test data set and doesn't have a target variable, is the one you want to try and predict.

So what you could do is build an algorithm on A, and make predictions for B and C at the same time. So remember, for B, we actually know the target results. For C, we don't know. And we save these predictions, we create two new data sets to just save this predictions. Then we can pick a different algorithm, let's say a random forest this time. I do the same thing. I fit it on A, and I make predictions for B and C at the same time, and I stack these predictions on my newly formed data sets. Then I can do this with as many algorithms as I want, basically. Then I have a data set which essentially is the predictions for all these algorithms, but I also know the actual target for this data set. So it was a B, I know this, so I can now feed a new algorithm from this data set to predict on C1. So in that way, I have made all these predictions of the previous algorithm features in a new algorithm.

## Machine Learning Interpretability

One last thing I wanted to touch is machine learning interpretability. There's a lot of discussion about this now; how we can make models more accountable. Generally, machine learning interpretability means to be able to explain in an easy format how the model is making different predictions. I guess there are not just two ways, but there are two concepts regarding machine learning interpretability. The first thing is, I can make a model which is very interpretable, and I can tell you exactly how it's making predictions, but it's not going to be that accurate. For example, I can see historically from my data that if everyone is less than 30 years old, average income is, I don't know, 30K. And if it is more than 40 years old, more than 30, let's say average income is 45K. So that's my model. It's a very interpretable model. I know exactly how it works, but it's not going to be very accurate obviously, because I'm not using any other information.

A different approach is to actually use a very complicated model that is able to achieve very high accuracy and combine many, many features at the same time, in multiple ways, and then try to give approximate explanations for these models. We do this normally by using something called surrogate models. So we have a complicated model that makes predictions, and we take these predictions and we use them as inputs to simpler models. So instead of trying to predict our original target, whatever that was, let's say if someone will default or not, we try to predict the probability of our model in order to understand that model, and we use that model for interpretation. For example, we can use a decision tree in order to understand how our model is making predictions, and we can use this tree for explanations about how our complicated algorithm works, and provide that information back if someone, a regulator, asks for it.

## H20 is the Open Source Leader in AI

I wanted to very quickly say that the mission of H2O is to democratize AI, to be able to give this knowledge back and through open source software, or we have also some propriety tools. So on the left hand side, all these are packages. You can say it's similar with sk-learn, only they have been built on Java and they can be distributed. But we also have this Driverless AI tool, which is essentially trying to bind together all these elements I've mentioned so far, the validation approach, the feature engineering, the selection of the algorithms, the hyperparameter tuning, the [inaudible 00:31:53] in an easy way to use, so that it can enhance and empower people to have access to this predictive analytics process.

The way this would work is let's say you have a data set again, some variables you try to predict a certain outcome, let's say if someone will default or not. You define an objective function: “I want to maximize accuracy,” or, “I want to minimize a form of error”, then you allocate some resources, like how long you want the software to run. Also, based on the hardware you have, it might be CPUs or GPUs, it's going run many, many experiments using this evolutionary approach, and then it's going to come back with some results. Results could be visualizations, could be this feature engineering, so how we transformed the data in order to get the best outcomes. Obviously, it can be predictions, or this model interpretability module, where we try to give back explanation of how a model has made the prediction it does. Then you can also download the pipeline, either in Python or Java in order to be able to do the scoring. So essentially, we write the scoring part of the code, and we make it available in these two languages, which then you can integrate with different systems.

Again, I don't have much time, so I just wanted to maybe very quickly show what this software looks like. Hopefully the internet hasn't failed me. It hasn't. So in this case, I just have two data sets about credit cards. I can quickly see the distributions for some of the variables. Essentially, what I would like to do is each row represents a card holder, and I want to be able to understand if someone will default or not next month. One means default, zero means a person did not default. I have some information about the credit card, like maximum balance, age, and also these payments statuses that show whether the person paid the credit cards in the previous one, two, three, four, five months.

What I could do is essentially run an experiment. So say, I want, for that data set, I want to associate all these variables to predicting whether a person will default next month, and I can provide a test data set too, so that you can give me predictions. These settings here control how long the software is going to run and how much data it's going to use, which algorithms it's going to use. Also, we have these settings that control how interpretable the final model will be. So putting a higher value here will make certain that the model is very interpretable, potentially at the cost of some accuracy.

Then you can just launch the experiment. What this software is going to do is it's going to remove redundant features, it's going to quickly start building models, and creating a ranking about which ones are the most important features. Dynamically, you will also start seeing some results, in terms of the selected metric you have chosen. In this case, I have selected to maximize something called AUC. While the software runs, it doesn't take much time, you can visualize the data. An interesting thing about the visualization is that it will mostly show patterns which are important. It will scan through everything, but it will only give you back what seems to be important. For example, features which are correlated together, or outliers which you could click and see specifically why this row was an outlier, why it was billed high amounts past months. And while you wait for the experiment to finish, which dynamically starts giving you some results ...

I'm a bit cautious about the time, so I would actually like to stop here. I think it took a bit longer than I anticipated. Yes, I would like to thank you for your time, and I'm happy to take any questions. Also, if you want to connect with me, these are my details.

**See more presentations with transcripts**