Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Going beyond the Case of Black Box AutoML

Going beyond the Case of Black Box AutoML



Kiran Kate covers the basics of AutoML and then presents Lale (, an open-source scikit-learn compatible AutoML library which implements Gradual AutoML.


Kiran Kate is a Senior Technical Staff Member working in the AI Programming Models department at IBM Research. She has been working in ML/AI for the past 13+ years and has built several solutions and frameworks using machine learning. She has published in top AI conferences and has filed patents in this area.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Kate: At the conference, there was a mix of techniques when mentions of ML came into picture. There was the latest and greatest generative models. There were also mentions of some traditional machine learning, including the talk where we talked about XGBoost. We talked about BERT, which is traditional now. Then if you listen to the Spotify talk, they talked about a lot of machine learning models they use, which they indicated that had a traditional machine learning component to it. What I'm going to talk about in the context of AutoML, the implementation that I mention will apply to traditional machine learning techniques, early forms of deep neural networks as well. The implementation does not today apply to generative models, though, conceptually, there is no reason why it shouldn't. I will mention why it's a bit challenging because it's expensive to do AutoML with very expensive models. Now with that context, I also wanted to understand if any one of you have used AutoML in the past.

Based on that, I'm going to spend a little bit of time explaining what AutoML is, so that at least those of you who haven't used, that's the takeaway for you. Those of you who have used AutoML, you would know that one of the marketing pitches of AutoML is that it can be done with three lines of code, or with just a few clicks on a graphical user interface. It works great for many of the cases but there are many assumptions that go into this black box approach. What we want to do is to go beyond the black box AutoML, understand what exactly is going on within AutoML, and how you can customize and control it. Because as ML practitioners, many of us know that model building is an iterative process. If a single model building is an iterative process, why not AutoML? We will see how it can be done. What are the control points for you.

Manual Machine Learning

To set some terms, before I go into AutoML, I want to highlight the framework I'll be using or the terminology I'll be using in describing machine learning. This is manual machine learning in this context. You have a dataset and you have a task, and say that actually this means classification, or regression, or whatever prediction task you have. Then you have a trainable pipeline, which is a model that you're considering, so XGBoost, let's say, along with some data preprocessing steps. Machine learning is all working on floating-point numbers. If you have text, or if you have categorical attributes, you first need to convert them into some numbers. Some of those can go into data preprocessing, into this pipeline. Once you pick the choice of models that you want to use, you will also specify some hyperparameters. If you're using XGBoost, it's a forest, so you're going to say these are the number of trees I want to use. That's a hyperparameter. In a trainable pipeline, you've given all specifications clearly. You say what operations you want to do. What are the hyperparameters for those? Once you have it, you have a dataset which you use to train the model. The train step now is converting a trainable pipeline into a trained pipeline. By that we mean that there are model parameters. If you're familiar with deep learning, the parameters of the weights and biases are the parameters of a model, and by this step, they're fixed. They're all learned based on your training data. Then, of course, you're not doing it in a silo. You want it for a specific task, which means that you have a metric against which you want to evaluate it. You evaluate it, maybe you're happy the first time, maybe you're not, and then you keep on iterating over this process by changing some of the things. You could change the dataset, or you could change the trainable pipeline hyperparameters, and so on. We will come back to this definition of pipeline. As a data structure, it's a DAG, it's a directed acyclic graph that is composed of multiple operations. Once you're happy with your model, that's the trained pipeline, that's the one you're going to use to deploy.

Automated Machine Learning (AutoML)

What we discussed just now was manual machine learning, so without any tooling that assisted you in coming up with the best pipeline. Here is the promise of AutoML. What it does is it says, ok, you have a task, you have a dataset, you probably know which optimization metric you care for, which is the quality metric, so let's say accuracy, or recall, or precision. You might have some constraints. Time is a good constraint for AutoML. Just given these three parameters, you can feed it into the black box, and the black box will give you a ranked list of pipelines. Why is that useful? That's useful because there are so many choices of what operations you can do on your dataset for a given task. You can encode the data differently. You can aggregate it differently. I think there was a Facebook talk that talked about aggregations on datasets. Once you do these different kinds of preprocessing operations, there are again, hundreds of models that you could explore, to see what works for your dataset. There is no one-size-fits-all thing in machine learning so you want to explore all of it. For each of the models, you have so many parameters you could do. A deep-learning learning rate is a parameter. Number of epochs is a parameter. You don't know which is going to work, so instead of manually trying and doing trial and error, AutoML is a tool that takes all the inputs from you, and it has an optimization backend. It's doing some mathematical formulation of your problem and solving it in multiple different ways. There are multiple optimizers available, and we'll talk about some of them. That's essentially what it is.

It's useful. Three lines of code sometimes can give you a very good quality model. Probably, that's enough for you. If it's not, then this is how the black box looks like internally. We already talked about the trainable pipeline to trained pipeline and the metrics part. The top three boxes are what AutoML is adding to it. What it is doing is a planned pipeline is a graph, again, coming back to a DAG, but now it's not trainable anymore. It has many choices that it can explore. It says, ok, try XGBoost, try logistic regression, try random forests, try LightGBM, and so on, if classification is your task, or I think it applies for regression as well. For data preprocessing, if you have categorical attributes in your dataset, which need to be encoded to numbers, then you could use one-hot encoding, or you could use categorical encoding, ordinal encoding. There are so many choices for each of those. This planned pipeline is going to create a big DAG with all these options, and then it will convert it into a search space. I should also mention that now for each of the nodes in the DAG, you already have ranges of hyperparameters you can try.

Given all of this, an AutoML tool is going to generate a search space that the optimizer understands. I talked about optimizers in the past, but let's say a very simple AutoML tool is grid search, which is, you just give a grid of points that is in the search space. You say try learning rate values ABC, and try number of epochs, PQR, and it's going to just do a cross-product of these and create a grid. That's what GridSearchCV will exploit. There are other optimizers which are slightly more intelligent, and they use something, we call it as sequential model-based optimization. What they do internally is they build a model to get a sense of this particular learning rate, how is it performing for this dataset. As they understand, and that's the point in the search space, as they acquire that point, that learning rate, and they evaluate it, they would know that, ok, this learning rate seems to be better than the other learning rate I just explored. They will eventually learn area of your search space that is supposed to be doing better. The optimizer itself is learning. It gets better over time, so that you get a good ranked list of pipelines.

Programming Model for Gradual AutoML

Many of the tools that exist today, AutoML tools, whether it's an open source package or cloud providers, most of them have their own ML services and AutoML equivalents. They hide this from you. Meaning the planned pipeline is predefined, and it is obviously an educated guess. The planned pipeline is restricted to models that are known to work well. In a way, it's a good starting point. Then, if you're a very experienced data scientist, you might have some intuitions or restrictions or domain specific constraints that you want to encode in this process. One example being that I work at IBM Research and one of my colleagues was working with our semiconductor facility. The semiconductor engineers wanted to use a decision tree, wanted to learn a decision tree, but learn rules from it. They were not so happy with very huge rules. What they said is, give me a decision tree, but limit the depth to five. If you're using an AutoML tool that's black box, there is no way you can do that. Or maybe I should modify my sentence and say that there is a way you can do that, at least with open source libraries by modifying their code. That's not what you want to do, you want a more user-friendly option to do that. What we did at IBM Research was we proposed a new programming model, we call it as gradual AutoML. The idea is that we want to go beyond the black box, we want to give ML practitioners more control on how they can do this AutoML process but iteratively. This is a paper that we published in NeurIPS 2021. NeurIPS is a top AI conference, and AutoML is one of the areas in many of the AI conferences today. Actually, AutoML has its own conference, as well.

Given the programming model, let me just show you what it means. This graph in the left is showing you different levels of controls that you can get. At the base is total automation, so three lines of code. Then you want more, you want to understand what happened. What has the pipeline iterator returned, and how can I deploy it in some other place? You can do some more inspection on the output of AutoML. That's the second step. The third step is if you want to define your own search space, how can you do it easily. The planned pipeline, the DAG that we talked about, how can you define it using some simple constructs? The next step is to control search space in terms of what ranges you want to explore, and so on. You can add your own custom operators. I'll talk about some of the machine learning operators for bias mitigation, for example. You can use that in AutoML. There are more controls that go up to writing your own pipeline grammars and optimizers. We will not talk about the last two steps, because those are really for expert users.

This programming model itself was built on three principles. One, we call it as progressive disclosure. If you're familiar with the Python ML ecosystem, then you know many of the existing open source libraries, and Scikit-learn is one of them. The principle of least surprise and progressive disclosure both actually take the burden from the user to learn a new library. You don't need to learn a new library. Many of what I will show are Scikit compatible constructs. If you've used Scikit-learn, you know it's widely used, very popular. TensorFlow, Keras, XGBoost have their own compatible APIs with Scikit-learn so we will be using many of those. Orthogonality means that we introduce only a small set of constructs, so that everything you need to do here can be done using those small set of constructs.

Dataset (credit-g from OpenML)

For all the discussions, I will be using this credit-g dataset from OpenML. Again, if you're a practitioner, you're familiar with this dataset. It's a public dataset for credit rating prediction. It's a small dataset, easy to understand. That's why I picked it. Most of these are available in the open source repository that I will point to, so they're under examples, you will see many of these notebooks. What I'm doing here is I'm fetching the dataset, and Lale is the open source project that I mentioned, and I will give a link towards the end. We have an API to fetch the dataset. If we fetch it, then we could see that it's about 1000 rows and 20 predictors. The class label, which is the first column here in the snippet that I printed, is credit rating, and two values it can take are, good or bad. Whether the credit rating is good or bad. In terms of features, we have a combination of categorical and numerical features, as you can see, and we will work with those.

Total Automation

This is the base level AutoML, which is three lines of code, actually four if I include the import. What you're doing is, Lale has a library, and it doesn't have to be Lale, all AutoML tools give you this interface. They will have an AutoPipeline object, maybe it's called something different. In this case, it's AutoPipeline, you have to give prediction_type. This is the task basically, whether you're doing classification, regression. Many of these techniques are for supervised learning out of the box. Again, they apply to unsupervised techniques as well, as long as you have an appropriate optimization metric. You can say prediction_type. You can give accuracy, which is the metric we talked about. max_opt_time is how long do you want the AutoML to run the search for you. This is 90 seconds. This is a small dataset, so 90 seconds is probably enough. Then I take the trainable pipeline, and I call a fit on it. There is I give my x and y, which is my set of features, and the class label. I get a trained pipeline as output. Then I use the same thing to get predictions on a test set. Those of you in the Python ecosystem probably are familiar with fit and predict. Here, as I said, the DAG, the planned pipeline is fixed. It's predefined. You can probably edit the code to change it, but you don't need to because there are better ways to do that.

Inspect Pipelines

By doing, I found a list of planned pipelines. If I call get_pipeline, I get my best pipeline on my dataset. I can visualize it. This is the visualization output. Again, Lale has it implemented, but Scikit-learn also now has good visualization. As you can see here, we had a combination of categorical and numerical attributes. Just to explain what this pipeline looks like, the first node here is project. What it is doing is it's projecting numerical attributes, literally in the relational algebra sense project, and doing a simple-imputer on it, because you want to do missing value imputation. The second part down here is, again, doing project on categorical attributes. It's doing simple imputation, most likely the strategy is different. For the numerical, you could do mean, median. Maybe for categorical attributes, you could do most frequent, and so on. Then you need to encode it. This is doing one-hot encoding on top of it. Finally, you need to concat the two because you want to work on both types of features, so you're concatenating it. You are probably doing some dimensionality reduction. Here, by default, I think we have PCA, Nyström, and AutoPipeline, but it chose PCA after doing the search. It then picked the classifier as XGBoostClassifier. If you hover over that rectangle, it's giving you all the configurations that AutoML picked for you. It says XGBoostClassifier gamma value picked is 0.72 something, and so on. You can see everything that AutoPipeline picked for you. Then, most often, and we have seen this use case very much, is that people are interested in getting the code that creates their model, because they want to get that code, run it somewhere else, and deploy that in a different place or train it with another dataset and so on. This pretty_print method will just print the code as Python, so this is all Python code, that creates the pipeline towards the end. Again, any other auto library might do it differently. The main objective here is users need to understand what is happening, what is the output, how can I recreate it, and make it reproducible, make sure that it reproduces, and so on.

Create a Pipeline from Scratch: Combinators

Here is where we go a little further. What we do is we try to create our own directed acyclic graph from scratch. We borrowed some concepts, so in the gradual AutoML paradigm, we borrowed some concepts from functional programming. These are combinators, which will allow you to compose multiple nodes easily. In the table here, the top two operators actually, we didn't invent them, so they are just syntactic sugar on existing pipeline creation, like Spark has MLlib, has a notion of pipeline, Scikit-learn and so on. They had these operators. We just use some syntactic sugar on top. The first one which is greater than greater than, means you're adding a Dataflow edge between op1 and op2, which means the output of op1 is given as input to op2. The ampersand operator is a union. The concat features that you saw before, it's basically doing the union of those two operators, and the output of those will be concatenated if you add concat features. The third operator, though, is new to Lale, and we introduced it for AutoML. What it is saying is that there is a choice between op1 and op2. As a data scientist, I don't know which will work better, but I want the AutoML tool to explore and tell me what works for my dataset. I'm just going to use that OR combinator when I define my graph.

I'll show an example of how this is done for the credit-g dataset. I have the project, I projected numbers. I already explained this graph, but we will see some OR operators in this. There is a simple-imputer strategy mean. Here, I did my imputation. Actually, the top rectangle that's bigger is a choice between scaling and PCA, or just Nyström. If you don't know what these operators are, they're some feature preprocessing operators. Because you don't know which is going to work well, you made a choice between those two. In the graph, you just added that OR combinator, and it's right there on the third line. There is an OR symbol there. For the second part, which is operation on categorical features, you don't know which encoder is going to work well, so you made a choice between those two. Now you give this graph to any optimizer. Whether it's grid search, or Hyperopt, or any Bayesian optimizer, it is going to explore everything that you see in this graph and tell you what works best for your dataset.

Execution Modes

How do we even run AutoML search on the graph that we just defined? We have a new execution mode. Fit and predict we saw earlier, which is for a trainable pipeline. For AutoML search, we defined a new API, we call it as auto_configure, which is using AutoML, configure this pipeline for me. What we are giving here is the dataset, so x and y, the optimizer. In our implementation, we have implemented multiple backends, so Hyperopt, SMAC, GridSearchCV, having GridSearchCV and so on. Again, in theory, it can be any optimizer. The implementation is just a part of it. Then you say cross validation 3, my scoring metrics is accuracy, time is 300 seconds. Now do your search and return me the best_found pipeline. There is a summary method which also will give you all the pipelines that it tried, and how they performed. If you're interested, you can look at, not just rank 1, but you can look at all the ranks and visualize it and pretty_print it and so on. Now you can use that best_found pipeline to predict, deploy, whatever you want.

Refine a Pipeline

However, like I said, AutoML or ML is an iterative process. If you followed Andrew Ng's machine learning courses, this is one thing that he emphasizes over and over. I'm a practitioner, so from my experience as well I know that that's so important. It's possible that even after you define your own search space, and you got an output, you're not happy with it, or suddenly, business constraint tells you that you only want interpretable classification. XGBoostClassifier is not interpretable, maybe you just want logistic regression. How would you refine your existing pipeline, keeping all the left side, the prefix as is, and just modify the last end. This is the kind of control you need, even though you're using AutoML for doing many of the other things. What we would do here is we would remove the last element, this graph manipulation. You just have some APIs, you manipulate the graph, and you run the auto_configure again. This time, you only want auto_configuration to apply to that last choice. Based on how I manipulated my graph, I only want the last choice to be tried, I want to freeze everything else. I'm happy with what preprocessing it has come up with.

Search Spaces and Optimizers

Going back to what I said earlier, the way Lale implements this, and it doesn't have to be this way, but what we do is we use some compilation techniques to convert this planned pipeline to search spaces for each optimizer. I will talk about a declarative way that we have to define search spaces. From that declarative specification, we have an intermediate representation using some compilation technique. Then that IR, the intermediate representation, then gets converted to each of the optimizer's own specification. If you know these tools, they have their own ways of specifying the search space. This is all done for you automatically. Then you can follow the third step. So far, we only looked at how did we manipulate the graph, but for each of the nodes in the graph, we already said that there is a set of hyperparameters. For each of the hyperparameters, we have ranges of values we want to explore. Here, based on the popular public opinion, I've considered deep learning as an operator. We have hidden layers as hyperparameters. For a neural network classifier, let's say hidden layer sizes is a hyperparameter. What you can configure is the number of hidden layers, and in each hidden layer, how many units you have. This is a default search space that we have in Lale, but you can customize it. That's the main point here, is that you have up to 20. AutoML is going to explore 1 to 20 hidden layers. For each of the hidden layers, it's going to explore 1 to 500 units. This is, in a way, similar to neural architecture search. There are different methods. This is a subset of neural architecture search for this example. Let's say you want to do it this way, and you want to customize it, what you have is you can customize the search space by calling customize_schema. You say, ok, I don't want that deeper network, maybe I don't need it, or I don't have the money to train that bigger network. What I do is I limit my search space. I will say, ok, just use two hidden layers, each one can have up to 50 units. This is how you can specify search space for each of the nodes.

Now we have this pipeline graph and each of the nodes customized if you want to, and it's going to convert it into a search space automatically. In doing so, it also uses the semantic of the combinators. As you put the nodes together, the search space compilation is also taking that into account how these nodes interact with each other. That's the glue. This is the declarative search space specification that I talked about. We use a very standard schema called a JSON schema. It's an open standard. It's used to define shapes of data. We use it here to define shapes of hyperparameters. It looks a bit complex, but if you know JSON schema or even JSON, I think it's easy to understand. What we are saying here is for simple-imputer, the strategies can be one of the four values. It can be a constant imputation, you just give a constant value and impute the missing value, or mean, or median, or most frequent. Then I should highlight the minimum and maximum values for PCA, number of components. These are the values that this optimizer is going to use to generate the search spaces. However, you don't need to do it for popular operators. In Lale, which is the open source library, we have defined schemas for 216 operators from Scikit-learn, from imblearn, from aif360, and so on. They can also be deep neural networks. I've created my own BERT embedding based pipeline using Lale, and it's possible. We have papers that we published in the past that describe how you would do that from your document. You don't have to again handwrite these, you can also infer these from documentation if you have it. If you want to add your own operator to AutoML, to the search, you can do that using make operator, and the documentation actually describes how you do it.

Class Imbalance Correction Using Over-sampling and Under-sampling

I'm going to talk about two somewhat complex use cases with ML. Many of the AutoML tools don't address it, because they're usually treated as separate steps away from the model building. One of them, a common problem that we see, is class imbalance. In credit-g, this was our dataset, I just plotted a histogram of the labels. As you can see, the first label here on the left is good, is almost double than bad. Which means the dataset has an imbalance in the class labels. There are existing techniques, these are not new techniques. SMOTE is a very popular one. It stands for Synthetic Minority Over-sampling. What that means is for the minority class, which is the class bad in this case, it's going to create synthetic examples from existing examples. Then it's going to add it to the dataset, so your class labels look balanced. What you could do to include SMOTE in AutoML is use this higher order operators. If you see, the first statement says from lale.lib.imblearn import SMOTE, and it is using some prefix. Prefix is all the data preprocessing part we saw before, it's adding SMOTE to it. Within SMOTE, it says Random-Forest classifier is my operator. What it does internally is it does all the preprocessing on your dataset, and just before you want to use classification, it will balance it, so change the data distribution and apply a Random-Forest classifier to it. Again, there is a k_neighbors equal to 12 there, which is a hyperparameter to SMOTE itself. When it creates this sample synthetically, it can consider 12 neighbors, but it's a hyperparameter, which means that it can be part of the search space. If you don't define it, AutoML can search over it. Just for completeness, I also included examples of other imbalance correction techniques, and one of them is under-sampling. Just like we did over-sampling with SMOTE, you can also do under-sampling, which means that you under-sample the majority class. You can also combine both these techniques.

AutoML with Imbalance Correction

Most often you don't know what's going to work well for your datasets. What you do, you take the help of AutoML and you define a choice. Here the last in the diagram is a choice between SMOTE and SMOTEENN, which are two techniques for imbalance correction, and you say auto_configure. Next slide I say auto_configure, find out what works best. Note two differences from the previous example, one is that scoring is a balanced accuracy metric now. Because it's an imbalanced dataset, I want to use balanced accuracy as my metric. I can use any of the metrics, at least with Lale implementations that are available in Scikit-learn. You can also define your own metric. The other thing I'm doing here is max_evals is 10, which means I want the AutoML search to only limit it to 10 points in the search space. AutoML did its magic and looks like SMOTE was a winner based on whatever search it did. We will go ahead with SMOTE.

Fairness and AutoML

The next use case I'm going to mention is more complex than this, but I still want to highlight that it's an important one, which is fairness and bias mitigation with machine learning. It's well known today, and it's a popular topic that machine learning models can be biased. They can be as good as your techniques and datasets. The first thing you could do is to measure your model and your datasets on well-known fairness metrics, and there are many. There are open source implementations of these available. aif360 is an open source package in Python that implements many of the fairness metrics and bias mitigation techniques. What you don't have in existing tools is the ability to use it in your AutoML search. What we do here is that we will allow that. Before I get there, I think I have some highlights on the credit-g dataset to understand what is the problem we are talking about. Here we said y, which was still our class label combination of good and bad, but then in the features that we had, we had some attributes which we call as protected attributes. They're protected because they're sensitive. If you plot the histograms conditioned on these attributes, you will see that there is a clear imbalance in how a loan application from female is treated compared to male. There is discrimination age-wise too, so we identified that personal status. It has many categorical values, which is a combination of gender and your married status and age, are two protected attributes. Again, it can be clearly seen that you don't want to be biased on these when you're approving or disapproving loans. These are treated as protected attributes by these techniques, and the fairness metrics and mitigations will take that as input. You can say what are my protected attributes, and which are the ones I want to probably redact during model building. You could use all of that and do it in AutoML using auto_configure. Our documentation describes what exact metrics and bias mitigation techniques you could use. They work the same way as we did for imbalance correction, which means that it's a higher order operator. There is some nesting, so you can say, this is my bias mitigator, this is my classifier, nest them together, create a search space together, and give me what works best.


Here is a link to our open source repository, Most of the concepts I talked about can be decoupled from the implementation, but we have it implemented in Lale.

Questions and Answers

Kate: Lale means tulip in Persian. Back at IBM Research, I'm part of AI programming models team, which means that we have a combination of AI people, but also programming languages people. Many of the programming languages, I think they pick these objects to name them. There is no deep meaning other than that.

Participant 1: I have a question about bias. Let's not take the example of the credit card, and let's imagine that it's like for insurance, like car insurance. I happen to be married. I drive a lot differently now that I'm married than when I wasn't. What if the bias is true, if that makes any sense? How do you reconcile that with, ok, we're going to lose a lot of money if we say that you're just as good a driver when you're single than when you're married.

Kate: Other than the machine learning aspects to bias, there are many social aspects, and per domain, they will change. At least as a machine learning practitioner, it is left to the domain expert to tell us what are the protected attributes, whether the married status needs to be a protected attribute or not. That's what the machine learning model is going to use. If you tell me it's not, then I will not use it as a protected attribute. I don't understand the social aspect so well, but when we experiment with it, there are many datasets where people have identified what are protected attributes for this case, and that's the debate that they need to have.

Participant 2: What is a protected attribute?

Kate: Which is a sensitive attribute that you do not want to include in your model building, which you don't want to be biased.

Participant 2: It won't be a feature.

Kate: It won't be a feature. Different techniques will treat it differently, some will just eliminate it from the set of features. Some will do some preprocessing on them. At a high level, these are features. You don't want them to be predictors or don't want to create bias.

Participant 3: Does AutoML tools support finding different feature engineering or feature preprocessing techniques, like finding the best one?

Kate: There are some which will use existing techniques, many of the preprocessing examples I gave. There are also some AutoML tools. I know certainly IBM's auto AI product does it, which they create new features based on your existing features. They create some derived features. One popular example I know is body mass index. If you have weight and height, they will create a BMI. That's an example they give, and then see if that makes sense, and treat it as part of the search space.

Participant 4: A question about the SMOTE's exercise you used for dealing with imbalanced class. The SMOTE exercise actually pollutes a model in a certain way, because your raw data is different then why do you use a Sim model to try to test [inaudible 00:43:00] to the validation set. Would the result be the same or different?

Kate: What you're saying is that we are changing the training distribution, but the test distribution is still going to come from real world. It is possible that it will change but without that the model would not give justice to the minority classes. Again, I think in AutoML, you can, in the search space, include SMOTE and don't include SMOTE. Include it without SMOTE, and maybe you test it on your validation set, and see what works better. This is really just a tool to make your life easier to take these decisions. Finally, it's up to you what you want to choose.

Participant 2: The class imbalance that you talked about, why is that a problem? Why is class imbalance a problem, in the training set?

Kate: We had good and bad. I don't know what the percentage is. Let's say 70/30 is our percentage split for class labels. Then a dummy classifier, which is always going to predict class label is good, will be as good as any other classifier. The best accuracy on credit-g dataset is 76 point some percent today. Most of the tools in papers you will see, that's about the accuracy of that dataset. Whereas if you had a dummy classifier, it's giving you 70% accuracy without doing anything.

Participant 2: These are the classes, like the labels are distributed such that 70% of the labels are good reading and 30% are bad, and this is the source of truth.

Kate: This is the source of truth, yes.

Participant 2: The goal was to make it 50/50?

Kate: More or less.

Participant 2: Is a technique like programmatic labeling, is that what is used to build?

Kate: They do over-sampling, which means that they create synthetic samples. You take the minority class, and you create new samples that you obtained by averaging nearest neighbors. They said k nearest neighbors is 12. You take 12 neighbors, synthetically merge them and create a new sample for minority, which means that your bad labels are going to go up now.

Participant 5: Is there any AutoML tools for like SMOTE XGBoost, or things that are not like, I know this library is fully in Python, and we might not be set up for that as we would think?

Kate: Yes and no. Basically, in our implementation, we have some Spark implementation. In fact, even for preprocessing, we can use Spark SQL as the underlying engine. If you're doing simple imputation, you can either pass it to pandas, or Scikit-learn, or Spark SQL. When it comes to the estimator, which is the final model, you can also pass it to Spark models.


See more presentations with transcripts


Recorded at:

Feb 22, 2024