Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Operationalizing Responsible AI in Practice

Operationalizing Responsible AI in Practice



Mehrnoosh Sameki discusses approaches to responsible AI and demonstrates how open source and cloud integrated ML help data scientists and developers to understand and improve ML models better.


Mehrnoosh Sameki is a senior technical program manager and tech lead at Microsoft, responsible for leading the product efforts on operationalizing responsible AI in practices within the Open Source and Azure Machine Learning platform. She has co-founded Error Analysis, Fairlearn, and Responsible-AI-Toolbox and has been a contributor to the InterpretML offering.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Sameki: My name is Mehrnoosh Sameki. I'm a senior program manager and technical lead of the Responsible AI tooling team at Azure Machine Learning, Microsoft. I'm joining you to talk about operationalizing Responsible AI in practice. First, I would like to start by debunking the idea that responsible AI is an afterthought, or responsible AI is a nice-to-have. The reality is, we are all on a mission to make responsible AI the new AI. If you are putting an AI there, and it's impacting people's lives in a variety of different ways, you have the responsibility to ensure the world that you have created is not causing harm to humans. Responsible AI is a must-have.

Machine Learning in Real Life

The reason why it matters a lot is, besides the very fact that in a traditional machine learning lifecycle, you have data, then you pass it to a learning phase where that learning algorithm takes out patterns from your data. Then that creates a model entity for you. Then you use techniques like statistical cross-validation, accuracy, to validate and evaluate that model and improve it, leading to a model that then spits out information such as approval versus rejection for loan scenarios. Despite the fact that this lifecycle matters, and you want to make sure that you're doing it as reliably as you could, there are lots of human personas in the loop that need to be informed in every single stage of this lifecycle. One persona, or many of you, ML professionals, or data scientists, they would like to know what is happening in their AI systems, because they would like to understand if their model is any good, whether they can improve their model, what features of their models should they use in order to make reliable decisions for humans? The other persona are business or product leaders. Those are people who would like to approve the model, should we put it out there? Is it going to put us on the first page of the news another day? They ask a lot of questions from data scientists regarding, is this model racist? Is this biased? Should I let it be deployed? Are these predictions matching some domain experts' insights that I've got from surgeons, doctors, financial experts, insurance experts?

The other persona is end-users, or solution providers. By that I mean either the banking person who works at a bank and is providing people with that end result of approved versus rejected on their loan, or a doctor who is looking at the AI results and is providing some diagnosis or insights to the end user or patient, in this case. Those are people who deal with the end user. Or they might be the end user themselves. They might ask, why did the model say this about me, or about my patient or my client? Can I trust these predictions? Can I make some actionable movements based on that or not? One persona that I'm not showing here, but is overseeing the whole process, are the regulators. We all have heard about the recent European regulations and GDPR's right to explanation, or California act. They're all adding lots of great lenses to the whole equation. There are risk officers, regulators who want to make sure that your AI is following the regulations as it should.

Microsoft's AI Principles

With all of these great personas in the loop, it is important to ensure that your AI is being developed and deployed responsibly. However, even if you are a systematic data scientist or machine learning developer and really care about this area, truth to be told is the path to deploying responsible and reliable machine learning is still unpaved. Often, I see people using lots of different fragmented tools, or a spaghetti of visualizations or visualization primitives together in order to evaluate their models responsibly. That's our team mission to help you operationalize responsible AI in practice. Microsoft have these six principles in order to inform your AI development and deployment. Those are fairness, reliability and safety, privacy and security, inclusiveness, underpinned by two more foundational ones: transparency and accountability. Our team specifically works on the items that are shown in blue, which are fairness, reliability and safety, inclusiveness, and transparency. The reason why we work on them is because they have a theme. All of these are supposed to help you understand your model better, whether through the lens of fairness or through the lens of how it's making its prediction, or through the lens of reliability and safety and its errors, or whether it's inclusive to everyone. Hopefully, help you build trust, improve it, debug it further, and make actionable insights.

Azure Machine Learning - Responsible AI Tools

Let's go through that ecosystem. In order to guide you through this set of tools, I would like to first start by a framing. Whenever you are having a machine learning lifecycle, or even just data, you would like to go through this cycle. First, you would like to take your model and identify all the issues, aka, fairness issues, errors, that are happening inside that. Without identification stage, you don't know exactly what is going wrong. Next, another important step is to diagnose why that thing is going wrong. The diagnosis piece might look like that, now I understand that there are some issues or errors in my data. Now I diagnose that the imbalance in my data is causing it. The diagnosis stage is quite important, because that discovers the root cause of the issue. That's how you can take more efficient, targeted mitigations in order to improve your model. Naturally, then you move to the mitigation stage where, thanks to your identification and diagnosis skills, now you can mitigate those issues that are happening. One last step that I would like to highlight is take action, sometimes you would like to inform a customer or a patient or a financial loan applicant about, for instance, what can they do, so next time they get a better outcome. Or, you want to inform your business stakeholders as what can you give some of the clients in order to boost sales. Sometimes you want to take real-world actions, some of them are model driven, some of them are data driven.

Identification Phase

Let's start with identify and the set of open source tools and Azure ML integrated tools that we provide for you to identify your model issues. Those two tools are error analysis and fairness tools. First, starting with error analysis, the whole motivation behind us putting this tool out there is the fact that we see people often use one metric to talk about their model's goodness, like they say, my model is 73% accurate. While that is a great proxy into identifying the model goodness and model health, it often hides this important information, that error is not uniformly distributed in your data. There might be the case that there are some erroneous packets of data, like this packet of data that is only 42% accurate. Versus, in contrast, this packet of data is getting all of the right predictions. If you go with one number, you're losing this very important information that my model has some erroneous packets, and I need to investigate why that cohort is getting more errors. We released a toolkit called error analysis, which is helping you to validate different cohorts, understand and observe how the error has been distributed across your dataset, and basically see a heat map of your errors as well.

Next, we worked on another tool called Fairlearn, which is also open source, it is to help you understand your model fairness issues. It is focusing on two different types of harms that AI often give rise to. One is harm of quality of service, where AI is providing different quality of service to different groups of people. The other one is harm of allocation where AI is allocating information opportunities or resources differently across different groups of people. An example for harm of quality of service is a voice detection system that might not work as well for say females versus males or non-binary people. An example of harm of allocation is a loan allocation AI or a job screening AI that might be better at picking candidates among white men compared to other groups. The whole hope behind our tool is to ensure that you are looking at the fairness metrics with the lens of group fairness, so how different groups of people are getting this treatment. We provide a variety of different fairness and performance metrics and rich visualizations, in order for you to observe the fairness issues as they occur in your model.

Both of these support a variety of different model formats, Python model using scikit predict convention, Scikit, or TensorFlow, PyTorch, Keras models. They also support both classification and regression. An example of a company putting our fairness tool into production is Philips Healthcare. They put fairness in production into their ICU models. They wanted to make sure that their ICU models that they have out there is performing uniformly across different patients with different ethnicities, gender identities. Another example is Ernst & Young in a financial scenario where they use this tool in order to understand how their loan allocation AI is providing this opportunity of getting a loan across different genders and different ethnicities. They were able to also use our mitigation techniques.

Diagnosis Phase

After the identification phase, now you know where the errors are occurring, and you know your fairness issues. You move on to the diagnosis piece. I cover two of the most important diagnosis capabilities, interpretability and perturbations and counterfactuals. One more to just like momentarily touch on is, we're also in the process of releasing a data exploration and data mitigation library. The diagnosis piece right now entails the more basic data explorer. I will show that to you in a demo. It also includes interpretability, that's the module we provide to you, which basically tells you what are the top key important factors impacting your model predictions. How your model is making its predictions. It covers both global explanation and local explanation. How overall the model is making its prediction, and how individual data points for them, how the model has made its predictions.

We do have different packages under Interpret ML capabilities that we have. It's a collection of black box interpretability techniques that can literally cover any model that you bring to us, no matter if it's Python, or Scikit, or TensorFlow, PyTorch, Keras. We also have a collection of glassbox models that are intrinsically interpretable models, if you have the flexibility of basically changing your model and training an interpretable model from scratch. An example of that is Scandinavian Airlines. They basically used our interpretability capabilities via Azure Machine Learning to build trust with their fraud detection model of their loyalty program. Of course, you can imagine that in such cases, you want to reduce and minimize and remove mistakes, because you don't want to tell a very loyal customer that they've done some fraudulent activity, or flag their activity by mistake. That is a very bad customer experience. They wanted to understand how their fraud detection model is making their predictions, and so they used interpretability capabilities to understand that.

Another important diagnosis piece is counterfactual and perturbations. You can do lots of freeform perturbations, do what-if analysis, change features of a data point, and see how the model predictions change for that. Also, you can look at counterfactuals and that is simply telling you what is the bare minimum changes to a data point's feature values that could lead into a different prediction. Say, Mehrnoosh's loan is getting rejected, what is the bare minimum change that I can apply to her features so that the AI predicts approved next time?

Mitigate, and Take Action Phase

Finally, we go to the mitigation stage, and also take action stage. We do cover a class of unfairness mitigation algorithms that could literally encompass any model. They have different flexibilities. Some of them are just post-processing methods and could adjust your model predictions in order to improve it. Some of them are more like reductions method, combination of pre-processing and in-processing. They can update your model objective function in order to retrain your model and not just minimize error, but also put control on a fairness criteria that you specify. We also do have pre-processing methods that will readjust your data in terms of better balancing it and better representing the underrepresented groups. Then, hopefully, the model that is trained on that augmented data is going to be a fairer model. Last, we realized that a lot of people are using our model, Responsible AI insights, for decision making in the real world. We all know models sometimes take on correlations rather than causation. We wanted to provide you with a tool that works on your data, just historic data, and uses a technique called double machine learning in order to understand whether there are any causal effects of a certain feature on the real-world phenomenon. Say, if I provide promotion to a customer, would that really increase the sales that that customer will generate for me? Causal inference is another capability we just released.

Looking forward, one thing that I want to mention is, while I went through different parts of this identify, diagnose, mitigate, we have brought every single tool I just represented under one roof, and that is called Responsible AI dashboard. The Responsible AI dashboard is a single pane of glass, bringing together a variety of these tools under one roof, same set of API, a customizable dashboard. You can do both model debugging and also responsible decision making with that, depending on how you're customizing and what you pass to it. Our next steps would be to expand the portfolio of Responsible AI tools to non-tabular data, enable Responsible AI reports for non-technical stakeholders. We do have some exciting work on PDF reports you can share with your regulators, risk officers, business stakeholders. We are working on enabling model monitoring at scoring time just to bring all these capabilities beyond evaluation time and bring it to scoring time, and make sure that as the model is seeing the unseen data, it can still detect some of these fairness issues, reliability issues, interpretability issues. We're also working on a compliance infrastructure, because we all know that there are nowadays so many stakeholders involved in development, deployment, and testing and approval of an AI system. We want to provide the whole ecosystem to you.


We believe in the potential of AI for improving and transforming our lives. We also know there is a need for tools to assist data scientists, developers, and decision makers to understand and improve their models to ensure AI is benefiting all of us. That's why we have created a variety of tools to help operationalize Responsible AI in practice. Data scientists tend to use these tools together in order to holistically evaluate their models. We are now introducing the Responsible AI dashboard, which is a single pane of glass, bringing together a number of Responsible AI tools. With this dashboard, you can identify model errors, diagnose why those errors are happening, and mitigate them. Then, provide actionable insights to your stakeholders and customers. Let's see this in action.

First, I have here a machine learning model that can predict whether a house will sell for more than median price or not, and provide the seller with some advice on how best to price it. Of course, I would like to avoid underestimating the actual price as an inaccurate price could impact seller profits and the ability to access finance from a bank. I turned into the Responsible AI dashboard to look closely at this model. Here is the dashboard. I can do, first, error analysis to find issues in my model. You can see it has automatically separated the cohorts with error counts. I found out that bigger old houses have a much higher error rate of 25% almost in comparison with large new houses that have error rates of only 6%. This is an issue. Let's investigate that further. First, let me save these two cohorts. I save them as new and old houses, and I go to the model statistics for further exploration. I can take a look at the accuracy, false positive rates, false negative rate across these two different cohorts. I can also observe the prediction probability distribution and observe that older houses have higher probability of getting predictions less than median. I can further go to the Data Explorer and explore the ground truth values behind those cohorts. Let me set that to look at the ground truth values. First, I will start from my new houses cohort. As you can see here, most of the newer homes sell for higher price than median. It's easy for the model to predict that and get a higher accuracy for that. If I switch to the older houses, as you can see, I don't have enough data representing expensive old houses. One possible action for me is to collect more of this data and retrain the model.

Let's now look at the model explanations and understand how the model has made its predictions. I can see that the overall finish quality, above ground living room area, and total basement square footage are the top three important factors that impact my model's prediction. I can further click on any of these like overall finish quality, and understand that a lower finish quality impacts the price prediction negatively. This is a great sanity check that the model is doing the right thing. I can further go to the individual feature importance, click on one or a handful of data points and see how the model has made predictions for them. Further, when I come to the what-if counterfactual, what I am seeing here is for any of these houses, I can understand what is the minimum change I can apply to, for instance, this particular house? Which has actually a high probability of getting the prediction of less than median, so that the model predicts the opposite outcome. Looking at the counterfactuals for this one, only if the house had a higher overall quality from 6 to 10, then the model would predict that this house would sell for more than median. To conclude, I learned that my model is making predictions based on the factors that made sense to me as an expert, and I need to augment my data on the expensive old house category, and even potentially bring in more descriptive features that help the model learn about an expensive old house.

Now that we understood the model better, let's provide house owners with insights as to what to improve in these houses to get a better price ask in the market. We only need some historic data of the housing market to do so. Now I go to the causal inference capabilities of this dashboard to achieve that. There are two different functionalities that could be quite helpful here. First, the aggregate causal effect which shows how changing a particular factor like garages, or fireplaces, or overall condition would impact the overall house price in this dataset on average. I can further go to the treatment policy to see the best future intervention, say switching it to screen porch. For instance, here I can see for some houses, if I want to invest in transforming a screen porch, for some houses, I need to shrink it or remove it. For some houses, it's recommending me to expand on it. Finally, there's also an individual causal effect capability that tells me how this works for a particular data point. This is a certain house. First, I can see how each factor would impact the actual price of the house in the market. I can even do causal what-if analysis, which is something like if I change the overall condition to a higher value, what boost I'm going to see in the housing price of this in the market.


We looked at how these tools help you identify and diagnose error in a house price prediction model and make effective data-driven decisions. Imagine if this was a model that predicted the cost of healthcare procedures or a model to detect potential money laundering behavior, identifying, diagnosing, or making effective data-driven decisions would have even higher consequences on people's lives there. Learn more about the tool on, and try it on Azure Machine Learning to boost trust in your AI driven solutions.

Questions and Answers

Breviu: Ethics in AI is something I'm very passionate about. There's so much harm that can be done if it's not thought about. I think that showing the different tools and the different kinds of thought processes that you have to go through in order to make sure that you're making models that are going to not only predict well for accuracy, but also that they're not going to cause harm.

Sameki: That is absolutely true. I feel like the technology is not going to slow down. We're just starting with AI and we're expanding on its capabilities and including it in more aspects of our lives, from financial scenarios, to healthcare scenarios, to even retail, our shopping experience and everything. It's even more important to have technology that is accompanying that fast growth of AI and is taking care of all those harms in terms of understanding them, providing solutions or mitigations to them. I'm quite excited to build on these tools and help different companies operationalize this super complicated buzzword in practice, really.

Breviu: That's true. In so many companies, they might want to do it, but they don't really know how. I think it's cool that you showed some of the different tools that are out there. There was that short link that you provided that was to go look at some of the different tools. You also mentioned some new tooling that is coming out, some data tooling.

Sameki: There are a couple of capabilities. One is, we completely realized that the model story is incomplete without the right data tools, or data story. Data is always a huge part, probably the most important part of a machine learning lifecycle. We are also accompanying this with more sophisticated data exploration and data mitigation library, which is going to land under the same Responsible AI toolbox. That will help you understand your data balances, and that also provides lots of APIs that can rebalance and resample parts of your data that are underrepresented. Besides this, at Microsoft Build, we're going to release a variety of different capabilities of this dashboard integrated inside our Azure Machine Learning. If your team is on Azure Machine Learning, you will get easy access, not just to this Responsible AI dashboard and its platform, but also a scorecard, which is a report PDF, summarizing the insights of this dashboard for non-technical stakeholders. It was quite important for us to also work on that scorecard because there are tons of stakeholders involved in an end-to-end ML lifecycle. Many of those are not super data science savvy or super technical. There might be surgeons. There might be financial experts. There might be business managers. It was quite important for us to also create that scorecard to bridge the gap between super technical stakeholders and non-technical stakeholders in an ML lifecycle.

Breviu: That's a really good point. You have the people that understand the data and how to build the model, but they might not understand the business application side of it. You have all these different people that need to be able to communicate and understand how their model is being understood. It's cool that these tools can do that.

You talked about imbalanced data as well. What are some of the main contributing factors to ethical issues within models?

Sameki: Definitely, imbalanced data is one of them. That could mean many different things. You are completely underrepresenting a certain group in your data, or you are representing that group, but that group in the training data is associated with unfavorable outcomes. For instance, you have a certain ethnicity in your loan allocation AI dataset, however, all of the data points that you have from that ethnicity happen to have rejection on their loans. The model creates that association between that rejection and belonging to that ethnicity. Either not representing a certain group at all, or representing them but not checking whether they are represented well in terms of the outcome that is affiliated with them.

There are some other interesting things as well, after the data, which is probably the most important issue. Then there is the issue of problem definition. Sometimes you're rushing to train a machine learning model on a problem, and so you're using the wrong proxies as a predictor for something else. To give you a tangible example, to make it understandable, imagine you do have a particular model that you're training in order to assign different risk scores to neighborhoods, like security scores. Then you realize that, how is that model trained? That model is trained on a data that is coming from arrest records of the police, imagine. Just using arrest records as a proxy into the security score of a neighborhood is a very wrong assumption to make because we all know that policing practices at least in the U.S. is quite unfair. It might be the case that there are more police officers deployed to certain areas that have certain ethnicities, and way less police officers to some other areas where there are some other ethnicities residing. Just because there are more police officers there, there might be more reporting of certain even like misdemeanors, or something that that police officer didn't like, or whatever. That will bump up the number of arrest records. Using that purely for proxying to the safety score of that neighborhood, has that dangerous outcome of affiliation between the certain race residing in that neighborhood and the security of that neighborhood.

Breviu: When those kinds of questions come up, I think about, are we building a model that even should be built? Because there's two kinds of questions when it comes to ethics in AI. It's, is my model ethical? Then there's the opposite, is it ethical to build my model? When you're talking about arrest records, and that kind of thing, and using that, I start worrying about, what is that model going to actually do? What are they going to use that model for? Is there even a fair way to build the model on that type of data?

Sameki: I absolutely agree. A while ago, there was this project from Stanford, it was called Gaydar. It was a project, which was training machine learning models on top of bunch of photos that they had recorded and captured from the internet and from different public datasets. The outcome was to predict whether the person is belonging to the LGBTQ community or not, or gay or not. At that time, when I saw that I was like, who is supposed to use this and for what reason? I think that started getting a lot of attention in the media that, we know that maybe AI could do things like that, questionable, but maybe. What is the point of this model? Who is going to use it? How are we going to guarantee that this model is not going to be used to basically perpetuate biases, stuff like that, against the LGBTQ community that are historically marginalized? There are tons of deep questions that we have to ask that whether machine learning is an appropriate thing to do for a problem, and what type of consequences it could have. If we do have a legit case for AI, could be helpful to make processes more efficient, could be more helpful to expedite certain super lengthy processes. Then we have to accompany it with enough checks and balances, scorecards, and also terms and services as how people use that model. Make sure that we do have a means of hearing other people's feedback in case they observe this model being misused in bad scenarios.

Breviu: That's a really good example of one that just shouldn't have happened. It always tends to be the marginalized, or the oppressed society, or parts of society that are hurt the most, and oftentimes aren't necessarily the ones that are even involved in building it as well, which is one of the reasons why having a diverse set of engineering for these types of models. Because I guarantee you, if you had somebody that was part of that community building that model, they probably would have said, this is really offensive.

Sameki: They would catch it. I always knew about the focus of the companies on the concept of diversity and inclusion before I joined this Responsible AI effort, but now I understand it from a different point of view that, it matters, that we have representation from people who are impacted by that AI in the room to be able to catch these harms. This is an area where growth mindset is the most important. I am quite sure that even if we are systematic engineers that truly care about this area and put all of these checks and balances, stuff happens still. Because this is a very sociotechnical area where we cannot fully claim that we are debiasing a model. This is a concept that has been studied by philosophers and social scientists for centuries. We can't come up suddenly out of the tech world and say, we've found a solution for it. I think progress could be made to figure out these harms, catching it early on, diagnosing why those happen. Mitigating them based on your knowledge, and documenting what you could not resolve and put some diverse groups of people in the decision making to catch some of those mistakes. Then, have a very beautiful feedback loop where you capture some thoughts from the audience and you are able to act fast and also very solid monitoring lifecycle.

Breviu: That's actually a good point, because it's not only just the ideation of it, should I do this? Ok, I should. Now I'm building it, now, make sure that it's ethical. Then there's the data drift and models getting stale and needing to monitor what's happening in [inaudible 00:35:32], so make sure that it continues to be able to predict well, and do that.

Any of these AI tools that you've been showing, are they able to be used in a monitoring format as well?

Sameki: Yes. Most of these tools could be, for instance, the interpretability. We do have support of scoring time interpretability, which basically allows you to call the deployed model, get the model predictions, and then call the deployed explainer and get the model explanations for that prediction at runtime. The fairness error analysis pieces are a little trickier. Fairness, basically, you can also specify the favorable outcome, and you can keep monitoring that favorable outcome distribution across different ethnicities, different genders, different sensitive groups, whatever that means to you. For the rest of fairness metrics, or error analysis, and things like that, you might require, periodically upload some labeled data based on your new data, take a piece, maybe use crowdsourcing or human labelers to label that and then parse it. General answer is yes. There are some caveats. We're also working on a very strong monitoring story that goes around these caveats and helps you monitor that during runtime.

Breviu: Another example, I think of ones where I've seen that make me uncomfortable, and this happens, like machine learning models as part of the interview process. It's one that actually happens a lot. There's already so many microaggressions and unconscious biases, that using a model like this in the interview process, and I've read so many stories about it as well, where having, just even on resumes, how quickly it actually is biased. How do you feel about that particular type of use case? Do you think these tools can work on that type of problem? Do you think we could solve it enough to where it would be ethical to use it in the interviewing process?

Sameki: I have seen both with some external companies, they're using AI in candidate screening, and they have been interested in using the Responsible AI tools. LinkedIn is now also part of Microsoft family. I know LinkedIn is also very careful about how these models are trained, tested. I actually think these models could be great initial proxies to figure out some better candidates. However, it's quite important that if you want to trust the top ranked candidates, it's super important to understand how the model has picked that, and so look at the model explainability, because often, there has been this case of associations.

There are two examples that I can give you. I remember that once there was this public case study from LinkedIn, they had trained a model for job recommendations, how you go to LinkedIn and it says, apply for this and this. Then they realized early on that one of the ways that LinkedIn algorithm was using the profiles in order to match them with the job opportunities was the fact that the person was providing enough description about what they are doing, what are they passionate about? How you have a bio section and then you have your current position, which you can add text to. Then there was a follow-up study by LinkedIn which was mentioning that women tend to have less details shared there, so in a way women tend to market themselves in a less savvy way compared to men. That's why men were getting better quality recommendations and a lot more matches compared to women or females, non-male, basically: females, non-binary. That was a very great wake-up call for LinkedIn, that, ok, this algorithm is doing this matching, we have to change it in order to not put too much emphasis. It's great that they have this extra commentary and whatever. First of all, we have to maybe provide some recommendations to people who have not filled those sections as your profile is this much complete, how they give you signals as go and add more context.

Also, we have to revisit our algorithms to really look at the bare minimum stuff, like the latest position posted, experiences. Even then, women go on maternity leave and family care leaves all the time. I still feel like when we have these companies receiving so many candidates and resumes, there is some role that AI could play to bring some candidates up. However, before deploying it in production, we have to look at the examples. We have to also have a little team of diverse stakeholders in the loop to get those predictions and try to take a look at that from the point of view of diversity and inclusion, from the point of view of explainability of the AI, and interfere with some human rules in order to make sure it's not unfair to some underrepresented candidates.

Breviu: That talks to the interesting thing, I think that you said, one of the beginning things is how the errors are not evenly distributed throughout the data. This is an example where, your model might get a really great accuracy but it was looking at the holistic approach, and realizing that on the non-male, female non-binary side that it was at a very high error rate. That's like a really good example of that point that you made in the beginning, which I found really interesting. Because many times when we're building these models, we're looking at our overall accuracy rate and our validation and loss score. Those are looking at it as a holistic thing, not necessarily on an individual basis.

Sameki: It's very interesting, because many people use platforms like Kaggle to learn about applied machine learning. Even in those platforms, we often see scoreboards where one factor is used to pick the winner, like accuracy of the model, area under curve, whatever that might be. That implicitly gives out that impression that, ok, there are a couple of proxies, if it's good, great, go ahead and deploy. I think that's the mindset that we would love to change in the market through this type of presentations that it's great to look at your model goodness, accuracy, false positive rate, all those metrics that we're familiar with for different types of problems. However, they're not sufficient to tell you about the nuances of how that model is truly impacting the underrepresented groups. Or any blind spots, they're not going to give you the blind spots. It's not even always about fairness. Imagine, you realize that your model is 89% accurate or 95% accurate, but you realize that those 5% errors happen to happen for every single time we have this autonomous car AI, and the weather is foggy and dark and is rainy, and the pedestrian is a darker skin tone wearing dark clothes. Ninety-nine percent of the time the pedestrian is missed. That's a huge safety and reliability issue that your model has. That's a huge blind spot that is potentially killing people. If you go with one score about the goodness of the model, you're missing that important information that your model has these blind spots.

Breviu: I think your point about Kaggle too in the ethics thing, kind of just shows where this ethics was an afterthought in a lot of this, and that's why these popular platforms don't really necessarily have those tools built in, as Azure Machine Learning does. I think also as we progress and people realize more just about like data privacy as well, I think as data scientists, we've always understood the importance of data privacy. I think now it's becoming more mainstream. I think that part, and then understanding ethics more, I think it really will change how and the way that people build models and think about building models. I think AI is going to keep moving forward exponentially, in my opinion. It needs to move forward in an ethical, fully thought out way.

Sameki: We build all of these tools in the open source first, to help everyone explore these tools, augment it with us, build on it, and bring their own capabilities and components, and put it inside that Responsible AI dashboard. If you're interested, check out our open source offering and send us a GitHub issue, send us your request. We are quite active on GitHub, and we'd love to hear your thoughts.


See more presentations with transcripts


Recorded at:

Mar 31, 2023