Transcript
Sameki: My name is Mehrnoosh Sameki. I'm a Principal Product Lead at Microsoft. I'm part of a team who is on a mission to help you operationalize this buzzword of responsible AI in practice. I'm going to talk to you about taking responsible AI from principles and best practices to actual practice in your ML lifecycle. First, let's answer, why responsible AI? This quote from a book called "Tools and Weapons," which was published by Brad Smith and Carol Ann Browne in September 2019, sums up this point really well. "When your technology changes the world, you bear a responsibility to help address the world that you have helped create." That all comes down to AI is now everywhere. It's here. It's all around us, helping us to make our lives more convenient, productive, and even entertaining. It's finding its way into some of the most important systems that affect us as individuals across our lives, from healthcare, finance, education, all the way to employment, and many other sectors and domains. Organizations have recognized that AI is poised to transform business and society. The advancements in AI along with the accelerated adoptions are being met with evolving societal expectations, though, and so there are a lot of growing regulations in this particular space in response to that AI growth. AI is a complex topic because it is not a single thing. It is a constellation of technologies that can be put to many different uses, with vastly different consequences. AI has unique challenges that we need to respond to. To take steps towards a better future, we truly need to define new rules, norms, practices, and tools.
Societal Expectations, and AI Regulation
On a weekly basis, there are new headlines that are highlighting the concerns regarding the use or misuse of AI. Societal expectations are growing. More members of society are considering whether AI is trustworthy, and whether companies who are innovating, keep these concerns top of mind. Also, regulations are coming. I'm pretty sure you all have heard about the proposal for AI Act from Europe. As I was reading through it, more and more, it's talking about impact assessment and trustworthiness, and security and privacy, and interpretability, intelligibility, and transparency. We are going to see a lot more regulation in this space. We need to be 100% ready to still innovate, but be able to respond to those regulations.
Responsible AI Principles
Let me talk a little bit about the industry approach. The purpose of this particular slide is to showcase the six core ethical recommendations in the book called "The Future Computed" that represents Microsoft's view on it. Note that there are a lot of different companies, they also have their own principles. At this point, they are a lot similar. When you look at them, there are principles of fairness, reliability and safety, privacy and security, inclusiveness, underpinned by two more foundational principles of transparency and accountability. The first principle is fairness. For AI, this means that AI systems should treat everyone fairly and avoid affecting similarly situated groups of people in different ways. The second principle is reliability and safety. To build trust, it's very important that AI systems operate reliably, safely, and consistently under normal circumstances, and in unexpected situations and conditions. Then we have privacy and security, it's also crucial to develop a system that can protect private information and resist attacks. As AI becomes more prevalent, protecting privacy and security of important personal and business information is obviously becoming more critical, but also more complex.
Then we have inclusiveness. For the 1 billion people with disability around the world, AI technologies can be a game changer. AI can improve access to education, government services, employment information, and a wide range of other opportunities. Inclusive design practices can help systems develop, understand, and address potential barriers in a product environment that could unintentionally exclude people. By addressing these barriers, we create opportunities to innovate and design better experiences that benefit all of us. We then have transparency. When AI systems are used to help inform decisions that have tremendous impacts on people's lives, it's critical that people understand how these decisions were made. A crucial part of transparency is what we refer to as intelligibility, or the user explanation of the behavior of AI and their components. Finally, accountability. We believe that people who design and deploy AI systems must be accountable for how their systems operate. This is perhaps the most important of all these principles. Ultimately, one of the biggest questions of our generation as the first generation that is bringing all of these AI into the society is how to ensure that AI will remain accountable to people, and how to ensure that the people who design, build, and deploy AI remain accountable to all other people, to everyone else.
The Standard's Goals at a Glance
We take these six principles and break them down into 17 unique goals. Each of these goals have a series of requirements that sets out procedural steps that must be taken, mapped to the tools and practices that we have available. You can see under each section, we have for instance, impact assessment, data governance, fit for purpose, human oversight. Under transparency, there is interpretability and intelligibility, communication to stakeholders, disclosure of AI interaction. Under fairness, we have how we can address a lot of different harms like quality of service, allocation, stereotyping, demeaning, erasing different groups. Under reliability and safety, there are failures and remediation, safety guidelines, ongoing monitoring and evaluation. Privacy and security and inclusiveness are more mature areas and they get a lot of benefit from previous standards and compliance that have been developed internally.
Ultimately, the challenge is, this requires a really multifaceted approach in order to operationalize all those principles and goals at scale in practice. We're focusing on four different key areas. At the very foundation, bottom, you see the governance structure to enable progress and accountability. Then we need rules to standardize our responsible AI requirements. On top of that, we need training and practices to promote a human centered mindset. Finally, tools and processes for implementation of such goals and best practices. Today, I'm mostly focusing on tools. Out of all the elements of the responsible AI standard, I'm double clicking on transparency, fairness, reliability, and safety because these are areas that truly help ML professionals better understand their models and the entire ML lifecycle. Also, they help them understand the harms, the impact of those models on humans.
Open Source Tools
We have already provided a lot of different open source tools, some of them are known as Fairlearn, InterpretML, Error Analysis, and now we have a modern dashboard called Responsible AI dashboard, which brings them together and also adds a lot of different functionalities. I'll talk about them one by one, by first going through fairness and the philosophy behind it and how we develop tools. I will cover interpretability. I will cover error analysis. I will talk about how we bring them under one roof, and created Responsible AI dashboard. I'll talk about some of our recent releases, the Responsible AI Tracker and Responsible AI Mitigation, and I'll show you demos as well.
AI Fairness
One of the first areas that we worked on was on the area of AI fairness. Broadly speaking, based on the taxonomy developed by Crawford et al., at Microsoft Research, there are five different types of harms that can occur in a machine learning system. While I have all of the definitions of them available for you on the screen, I want to double click on the first two, harm of allocation, which is the harm that can occur when AI systems extend or withhold opportunities, resources, and information to specific groups of people. Then we have harm of quality of service, whether a system works as well for one person as it does for another. I want to give you examples of these two types of harms. On the left-hand side, you see the quality-of-service harm, where we have this voice recognition system that might fail, for instance, on women voices compared to men or non-binary. Or for instance, you can think of it as the voice recognition system is failing to recognize the voice of non-native speakers compared to native speakers. That's another angle you can look at it from. Then on the right-hand side we have the harm of allocation, where there is an example of a loan screening or a job screening AI that might be better at picking candidates amongst white men compared to other groups. Our goal was truly to help you understand and measure such harms if they occur in your AI system, and resolve them, mitigate them to the best of your knowledge.
We provided assessments and mitigation. On the assessment side, we give you a lot of different evaluations. Essentially giving you an ability where you can bring a protected attribute, say gender, ethnicity, age, whatever that might be for you. Then you can specify one of many fairness metrics that we have. We support two categories of fairness metrics, what I would like to name as, essentially group fairness, so how different groups of people are getting treatments. These are metrics that could help measure those. One category of metrics that we provide is disparity in performance. You can say how model accuracy or model false positive rate, or false negative rate, or F1 score, or whatever else that might be differs across different buckets of sensitive groups, say female versus male versus non-binary. Then we have disparity in selection rate, which is how the model predictions differ across different buckets of a sensitive feature. Meaning, if we have female, male, non-binary, how do we select them in order to get the loan? To what percentage of them are getting the favorable outcome for a job, for a loan? You just look at your favorable outcome, or your prediction distribution across these buckets and you see how that differs.
On the mitigation side, we first enable you to provide a fairness criteria that you would like to essentially have that criteria guide the mitigation algorithm. Then we do have a couple of different classes of mitigation algorithms. For instance, for fairness criteria, two of many that we support is demographic parity, which is a criteria that reinforces that applicants of each protected group have the same odds of getting approval under loans. Loan approval decision, for instance, is independent of what group you belong to in terms of protected attributes. Then another fairness criteria that we support is equalized odds, which has a different approach in that sense that it looks at qualified applicants have the same odds of getting approval under a loan, regardless of their race, gender, whatever protected attribute you care about. Unqualified applicants have the same odds of getting approval under loans as well, again, regardless of their race, gender. We also have some other fairness criteria that you can learn more about on our website.
Now that you specify a fairness criteria, then you can call one of the mitigation algorithms. For instance, one of our techniques, which is state of the art from Microsoft Research, is called a reduction approach, with the goal of finding either a classifier, a regressor that minimizes error, which is the goal of all objective functions of our AI systems, subject to a fairness constraint. What it does is it takes a standard machine learning algorithm as a black-box, and it iteratively calls that black-box and reweight and possibly sometimes relabel the training data. Each time it comes up with one model, and you now get a tradeoff between the performance versus fairness of that model, and you choose the one that is more appropriate to you.
Interpretability
Next, we have interpretability. The tool that we put out there is called InterpretML. That is a tool that provides you with functionality to understand how your model is making its predictions. Focusing on just the concept of interpretability, we provide a lot of what we know as glassbox models or opaque box explainers. What that means, for glassbox models, these are models that are intrinsically interpretable, so they can help you understand your model's prediction, understand how that model is coming up with its prediction. You train this model, and then they're see-through. You can understand exactly how they make decisions based on features. Obviously, you might know of decision trees rule is linear models, but we also support a state-of-the-art technique or model called explainable boosting machine, which is very powerful in terms of performance, but also, it's very transparent. If you are in a very regulated domain, I really recommend looking at EBM or explainable boosting machine. Not just that, we also do black-box explainability. We provide all these techniques that are out there under one roof for you to bring your model, pass it to one of these explainers. These explainers often don't care about what is inside your model. They use a lot of different heuristics. For instance, in terms of SHAP, it's a game theory technique that it uses to infer how your inputs have been mapped to your predictions and it provides you with explanations. We provide capabilities like, overall, model explanation. Overall, how the model is making its prediction. Also, individual predictions or explanations, like say for Mehrnoosh, what are the top key important factors impacting the model predictions of rejection for her?
Not just that, but inside InterpretML, we also have a package called DiCE, which stands for Diverse Counterfactual Explanations. The way that it works is it not only allows you to do freeform perturbation, and what if analysis where you can perturb my features and see how the model has changed its predictions, if any. You can also take a look at each data point and you say, what are the closest data points to this original data point for which the model is producing an opposite prediction? Say Mehrnoosh's loan has got rejected, what are the closest data points that are providing a different prediction? In other words, you can think about it as what is the bare minimum change you need to apply to Mehrnoosh's features in order for her to get an opposite outcome from the AI? The answer might be, keep all of Mehrnoosh's features constant, but only increase her income by 10k, or increase income by 5k and have her to build one more year of credit history. That's where the model is going to approve her loan. This is a very powerful information to have because, first, that's a great debugging tool if the answer is, if Mehrnoosh had a different gender, or ethnicity, then the AI would have approved the loan. Then, obviously, you know that that's a fairness issue. Also, it's a really great tool, if you want to provide answer to humans as they might come and ask, what can I do next time to get approval from your loan AI? You can say, ok, so if you are able to increase your income by 10k in the next year, given all your other features stay constant or improve, then you're going to get approval from our AI.
Error Analysis
After interpretability, we worked on another tool called Error Analysis. Essentially, Error Analysis, the idea of it came to our mind because, how many times really and realistically, have you seen articles in the press that mention that a model is 89% accurate, using therefore a single score accuracy number to describe performance on a whole new benchmark. Obviously, these one scores are great proxies. They're great aggregate performance metrics in order to talk about whether you need to build that initial trust with your AI system or not. Often, when you dive deeper, you realize that errors are not uniformly distributed across your benchmark data. There are some maybe pockets of data in your benchmark with way higher error discrepancies or higher error rate. This error discrepancy can be problematic, because there is essentially, in this case, a blind spot, 42%. There is a cohort that is only having 42% accuracy. If you miss that information, that could lead to so many different things. That could lead to reliability and safety issues. That could lead to lack of trust for the people who belong to that cohort. The challenge is, you cannot sit down and combine all your possible features, create all possible cohorts, and then try to understand these error discrepancies. That's what our error analysis tool is doing.
Responsible AI Toolbox
I just want to mention that before Responsible AI dashboard, which I'm about to introduce to you, we had these three individual tools, InterpretML, Fairlearn, and Error Analysis. What we realized is people want to use these together because they want to gain a 360 overview of their model health. That's how we, first of all, introduced a new open source repository called Responsible AI Toolbox, which is an open source framework for accelerating and operationalizing responsible AI via a set of interoperable tools, libraries, and also customizable dashboards. The first dashboard underneath this Responsible AI Toolbox was called Responsible AI dashboard, which now brings all of these tools that I mentioned to you and more tools under one roof. If you think about it, often when you want to go through your machine learning debugging, you first want to identify what is going wrong in your AI system. That's where error analysis could come into the scenario and help you identify erroneous cohorts of data that have a way higher error rate compared to some other subgroups. Then, fairness assessment can come also into the scenario because they can identify some of the fairness issues.
Then you want to move on to the diagnosis part, where you would like to look at the model explanations. How the model is making predictions. You maybe even want to diagnose the issue by perturbing some of the features or looking at counterfactuals. Or even, you might want to do exploratory data analysis in order to diagnose whether the issue is rooted into some data misrepresentation, lack of representation. Then you want to move on to the mitigate, because now that you've diagnosed, you can do targeted mitigation. You can now use unfairness mitigation algorithms I talked about. I will also talk about data enhancements. Finally, you might want to make decisions. You might want to provide users with what could they do next time, in order to get a better outcome from your AI, that's where counterfactual analysis come into the scenario. Or you might want to just look at historic data, forget about a model, just look at historic data, and see whether there are any factors that have a causal impact on a real-world outcome, and provide your stakeholders with that causal relationship. We brought all of these tools under one roof called Responsible AI dashboard, which is one of the tools in Responsible AI Toolbox.
Demo (Responsible AI Dashboard)
Let me show you a demo of the Responsible AI dashboard so that we can bring all this messaging home. Then I'll continue with two new additional tools under this family of Responsible AI Toolbox. I have here a machine learning model that can predict whether a house will sell for more than median price or not, and provide the seller with some advice on how best to price it. Of course, I would like to avoid underestimating the actual price, as an inaccurate price could impact seller profits and the ability to access finance from a bank. I turn into the Responsible AI dashboard to look closely at this model. Here is the dashboard. I can do first error analysis to find issues in my model. You can see it has automatically separated the cohorts with error accounts. I found out that bigger old houses have a much higher error rate of 25% almost in comparison with large new houses that have error rate of only 6%. This is an issue. Let's investigate that further. First, let me save these two cohorts. I save them as new and old houses, and I go to the model statistics for further exploration. I can take a look at the accuracy, false positive rate, false negative rate across these two different cohorts. I can also observe the prediction probability distribution, and observe that older houses have higher probability of getting predictions less than median.
I can further go to the data explorer, and explore the ground truth values behind those cohorts. Let me set that to look at the ground truth values. First, I will start from my new houses cohort. As you can see here, most of the newer homes sell for higher price than median. It's easy for the model to predict that and get a higher accuracy for that. If I switch to the older houses, as you can see, I don't have enough data representing expensive old houses. One possible action for me is to collect more of this data and retrain the model. Let's now look at the model explanations and understand how the model has made its predictions. I can see that the overall finish quality, above ground living room area, and total basement square footage are the top three important factors that impact my model's prediction. I can further click on any of these, like over finish quality, and understand that a lower finish quality impacts the price prediction negatively. This is a great sanity check that the model is doing the right thing. I can further go to the individual feature importance, click on one or a handful of data points and see how the model has made predictions for them.
Further on, I come to the what if counterfactual. What I am seeing here is, for any of these houses, I can understand what is the minimum change I can apply to, for instance, this particular house, which has actually a high probability of getting the prediction of less than median, so that the model predicts the opposite outcome. Looking at the counterfactuals for this one, only if their house had a higher overall quality from 6 to 10, then the model would predict that this house would sell for more than median. To conclude, I learned that my model is making predictions based on the factors that made sense to me as an expert, and I need to augment my data on the expensive old house category and even potentially bring in more descriptive features that help the model learn about an expensive old house.
Now that we understood the model better, let's provide house owners with insights as to what to improve in these houses to get a better price ask in the market. We only need some historic data of the housing market to do so. Now I go to the causal inference capabilities of this dashboard to achieve that. There are two different functionalities that could be quite helpful here. First, the aggregate causal effect, which shows how changing a particular factor like garages, or fireplaces, or overall condition would impact the overall house price in this dataset on average. I can further go to the treatment policy to see the best future intervention, say switching it to screen porch. For instance, here I can see for some houses, if I want to invest in transforming a screen porch for some houses, I need to shrink it or remove it. For some houses, it's recommending me to expand on it. Finally, there is also an individual causal effect capability that tells me how this works for a particular data point. This is a certain house. First, I can see how each factor would impact the actual price of the house in the market. I can even do causal what if analysis, which is something like if I change the overall condition to a higher value, what boost I'm going to see in the housing price of this in the market.
Responsible AI Mitigations, and Responsible AI Tracker
Now that we saw the demo, I just want to introduce two other tools as well. We released two new tools as a part of the Responsible AI Toolbox. One of them is Responsible AI Mitigations. It is a Python library for implementing and exploring mitigations for responsible AI on tabular data. The tool fills the gap on the mitigation end, and is intended to be used in a programmatic way. I'll introduce that to you in a demo as well. Also, another tool, Responsible AI Tracker. It is essentially a JupyterLab extension for tracking, managing, and comparing Responsible AI mitigations and experiments. In some way, the tool is also intended to serve as the glue that connects all these pieces together. Both of these new releases support tabular data, Responsible AI dashboard supports tabular data and support for computer vision and NLP scenarios, starting from text classification, image classification, object detection, and question answering scenarios. The new developments have two new differentiations. First, they allow you to do targeted model debugging and improvement, which in other words means understand before you decide where and how to mitigate. Second, they are interplay between code, data, model, visualizations. While there exist many data science and machine learning tools out there, as a ML professional team, we believe that one can get the full benefits in this space only if we know how to manage, serve, and learn from the interplay of all these four pillars in the data science, which are code, data, model, and visualization.
Debugging
Here you see the same pillars I talked about identify, diagnose, mitigate. Now you have this track, compare, and validate. The Responsible AI dashboard as is could cover identify and diagnose because it has tools like error analysis, fairness analysis, interpretability counterfactual, and what if perturbation, for both identify and diagnose. With Responsible AI mitigations, you could mitigate a lot of issues that are rooted in the data. Also, with Responsible AI Tracker, you can track, compare, validate, and experiment with which model is the best for your use case. This is different from general techniques in ML, which merely measure the overall error and only add more data or more compute. In a lot of cases, you're like, let me just go and collect a lot more data. It's obviously very expensive. This blanket approach is ok for bootstrapping models initially, but when it comes to really carefully mitigating errors related to particular cohorts or issues of interest that are particular to underrepresenting, overrepresenting to a certain group of people, it becomes too costly to add just generic data. It might not tackle the issue from the get-go. In fact, sometimes adding more data is not even doing much and is hurting particular other cohorts because the data noise is going to be increased or there are some unpredictable shifts that are going to be happening. This mitigation part, it is very targeted. It essentially allows you to take a look at exactly where the issue is happening and then tackle that from the core.
Demo: Responsible AI (Dashboard + Tracker + Mitigations)
Now let's see a demo together to make sure we understand how these three offerings work together. Responsible AI Tracker is an open source extension to the JupyterLab framework, and helps data scientists with tracking and comparing different iterations or experiments on model improvement. JupyterLab itself is the latest web based interactive development environment for notebooks, code, and data for Project Jupyter. In comparison to Jupyter Notebooks, JupyterLab gives practitioners the opportunity to work with more than one notebook at the same time to better organize their work. Responsible AI Tracker takes this to the next step by bringing together notebooks, models, and visualization reports on model comparison, all within the same interface. Responsible AI Tracker is also part of the Responsible AI Toolbox, a larger open source effort at Microsoft for bringing together tools for accelerating and operationalizing responsible AI. During this tour, you will learn how to use tracker to compare and validate different model improvement experiments. You will also learn how to use the extension in combination with other tools in the toolbox, such as Responsible AI dashboard, and the Responsible AI Mitigations library. Let me show you how this works.
Here I have installed the Responsible AI Tracker extension, but I have not created a project yet. Let's create a project together. Our project is going to use the UCI income dataset. This is a classification task for predicting whether an individual earns more or less than 50k. It is also possible to bring in a notebook where perhaps you may have created some code to build a model or to clean up some data. That's what I'm doing right now. Let's take a look at the project that was just created and the notebook that we just imported. Here we're training on a split of the UCI income dataset. We are building a model with five estimators. It's a gradient boosted model. We're also doing some basic feature imputation and encoding.
After training the model, we can then register the model to the notebook, so that in the future we can perhaps remember which code was used to generate the model on the first place. We just pick the model file. We're going to select the machine learning platform, in this case, sklearn, and the test dataset where we want to evaluate on. Next, we're giving a few more information items related to the formatting and the class label. Then we're going to register the model. We see that the overall model accuracy is around 78.9%. This doesn't give us yet enough information to understand where most errors are concentrated on. To perform this aggregated model evaluation, we are going to use the Responsible AI dashboard. This is also part of the Responsible AI Toolbox. It's a dashboard that consists of several visual components on error analysis, interpretability, data exploration, and fairness assessment. The first component that we see here is error analysis. This visualization is telling us that the overall error rate is 21%, which coincides to the accuracy number that we saw on tracker. Next, it's also showing us that there exist certain cohorts in the data such as for example, this one, where the relationship is husband or wife, meaning that the individual is married for which the error rate increases to 38.9%. At the same time for individuals who are not married on the other side of the visualization, we see that the error rate is only 6.4%. We also see other more problematic cohorts such as for instance individuals who are married and have a number of education years higher than 11 years, for which the error rate is 58%.
To understand better what is going on with these cohorts, we are going to look at the data analysis and the class distribution for all these cohorts. In overall, for the whole data, we see that there exists a skew towards the negative label. There exist more individuals that earn less than 50k. When we look at exactly the same visualization for the married cohort, we see that the story is actually more balanced here. For the not married cohort, the balance looks more similar than the prior on the overall data. In particular, for the cohort that has a very high error rate at 58% married and number of education years higher than 11, we see that the prior completely flips on the other end. There exist more individuals in this cohort that earn more than 50k. Based on this piece of information, we are going to go back to tracker to see if we can mitigate some of these issues and also compare them.
Let's now import yet another notebook that performs a general data balancing approach. We are going to import this from our tour files in the open source repository. These data balancing techniques, basically what it's doing is that it is generating the same number of samples from both classes, the positive and the negative one. If we see how the data looks like after rebalancing, we can see that the full data frame is perfectly balanced. However, more positive labels have been sampled from the married cohort. That is probably because the married cohort initially had more positive examples to start with. After training this model, we can then register the model that was generated by this particular notebook. Here, we can bring in the model. We can register exactly the same test dataset so that we can have a one-to-one comparison. We give in the class and see how these two models compare. We can see that, overall, data balancing has helped, the accuracy has improved, but we want to understand more. We want to understand how the tradeoff between accuracy and precision plays off. Indeed, we can see here that even though the overall accuracy has improved by 3.7%, precision has dropped by a large margin.
When we look at the married and not married cohorts, which are the ones that we started earlier with the Responsible AI dashboard, we can see that indeed most of the improvement comes from improvements in the married cohort. However, we also see that a lot of the precision has declined in the married cohort, which brings up the question of whether this type of balancing that is more like a blanket and general approach has hurt the precision for the married cohort, mostly because most of the positive data was sampled from this one. Then the question is, can we do this in a more custom way? Can we perhaps isolate the changes in data balancing between the two cohorts so that we can get the best of both worlds? There are two ideas that perhaps come in mind here, one of them could be, we can perfectly balance the two cohorts separately. The other idea would be perhaps to balance the married cohort, but since the not married cohort has good accuracy from the start, maybe it would be better to not make any changes. We will try both here. Both of these notebooks are available in our open source repository. I'm going to upload both of these in the project now, the one that balances both cohorts, and the one that leaves the not married cohort, unbalanced. We are going to talk about this as the targeted mitigation approach.
First, let's see how we have implemented both of these mitigation techniques using the RAI Mitigations Library. In particular, we are going to use the data balancing functionalities in the library. Here, we see that we have created the cohorts. The first cohort is the married cohort. The other one is the complement of that. In this case, this coincides with the not married cohort. We have defined two pipelines, one for each cohort. Through the cohort manager class, we can assign which pipeline needs to be run in an isolated way for each of the cohorts. We see that basically these are doing the same thing. They are sampling the data so that in the end, we have equal frequencies of each of the classes. This is how the data looks like after the rebalancing. The overall data is perfectly balanced, and so are the two other cohorts. The other mitigation technique that we are going to explore here is to apply rebalancing only for the married cohort, because this is the one that had higher errors on the first place. For the second one, we are giving an empty pipeline. This is how the data looks like after this mitigation. The married cohort is perfectly balanced, but we have not touched the distribution for the rest of the data.
After this, we're going to compare all of these models together. We're going to compare them across different metrics, but also across the cohorts of interest. After registering the new models that we just trained with the two new mitigation techniques, we can then go back to the model comparison table and see how the model has improved in all these cases. First, let's compare the strategy where we balance both cohorts separately. We see that this type of mitigation technique is at least as good as the baseline. However, most of the improvement is focused on the married cohort, and we see a sudden drop in performance for the not married cohort. Often, we refer to these cases as backward incompatibility issues in machine learning, where we see new errors being introduced with model updates. These are particularly important for real-world deployments because there may have been end users that are accustomed to trusting the model for certain cohorts, in this particular for the not married cohorts. By seeing these performance drops, this may lead to loss of trust in the user base. This is the story for the mitigation technique that balances both cohorts.
For the next mitigation technique, where we saw that we could target the data balancing only for the married cohort, and leave the rest of the data untouched, we see that in overall there is a 6% improvement, which is the highest that we see in this set of notebooks. At the same time, we see that there are no performance drops for the not married cohort, which is a positive outcome. Of course, there is good improvement in the cohort that we were set to improve in the first place for the married cohort. The precision is higher than in the blanket approach where we just balance the whole data. In this way, we can get a good picture of what has improved and what has not improved across all models, across different cohorts, and across different metrics. We can create more cohorts by using the interface. For example, earlier, we saw that the performance drops were mostly focused in the cohort of married individuals who had a number of education years that is higher than 11. We're trying to bring in that cohort now and see what happened to this one. Here, we are adding the married relationship as a filter. Then we are going to add another filter that is related to the number of education years. We want this to be higher than 11. Let's save this cohort and see what happened here. We saw that, initially, the accuracy for this cohort was only 41.9%. Through the different mitigation techniques, we are able to reach up to 72% or 73% by targeting the mitigation to the types of problems that we saw in the first place.
See more presentations with transcripts