Transcript
Silberman: At NewBank we like to be data driven, we collect a lot of data. Annually, we ask our customers to allow us to collect this data. The thing is, we have a lot of personal information, as a credit card company, we have information on how you use the app. If you are allowing the GPS data, we know where you are, when you do a purchase, we have all those informations. Today, if you think about it, there are two important issues that are starting to raise as you collect this massive amount of data. The first one is privacy and the second one is, usually all the companies are going to use your data to make some decision about you. Here, we are not going to talk much about the privacy but more about, how can we, as a company, make sure that we are not discriminating anyone and we are treating fairly all the customers that we have.
Fairness
As a bit of the history, this is a real graph in a sense that those bars represent the real amount of paper that were published up to 2017, I didn't update it. You can see that at the beginning, everyone was, "Well we don't have that much data, fairness is not really a concern for us. We may just use the small datasets, we have just a statistic." Then at some point you are, "Google is amassing this huge amount of data." There is a lot of value in collecting all these data. Now, there are a lot of people thinking that there is something happening. We need to do a bit more research on fairness because that's something that is impacting a lot of people. Why is it impacting a lot of people? Today, the way we are doing business, is really changing compared 10, 20 years ago. 20 years ago, you might have just gone to some merchant to be able to get lending. Even in hiring, you just talk to people, the person in front of you don’t really know you, the only thing he knows is maybe your resume. Today, you are asking again this customer to give you a lot of information.
For hiring, some companies are going to go on Facebook and start to look at what kind of picture do you have there. Then you are going to say, "I don't really want this company to know that last Saturday I went drinking. It has nothing to do with the fact I can be a good customer or not." Here, what we have is a lot of different area mortgage lending, even prison sentencing. You see a miniature report or trying to predict in advance what people are going to do and to be able to stop a crime before even it happens. The thing is, in those examples and especially if we take the prison sentencing, you are putting the life of people on the line. It's like you are going to say if this person is going to spend 2 years, 5 years, 10 years in prison and it has a really big impact on the life of someone. I'd like to know the level here, like in the case of NewBank, issues usually are more related to, "Should we approve our customers? Should we increase the credit line of someone?"
This is at another level also impacting the life of someone, because at NewBank, we really want to improve the life of our customers by giving them the best credit line so they can have access to credit and they can be able to pay the bills, they are able to do stuff that they couldn't do before.
Here are some interesting papers that compile all those kind of different definitions of fairness, so, you will see in this presentation a lot of different metrics, different way you can compute that. This is a good representation, a good summary of everything that you can see. Here, there are like 20 - usually, there are around 20, 21 definitions of fairness. Something that is interesting is, it's impossible to satisfy all those definitions. In some sense, even if you want to be the good guy and you want to make sure you are biasing your customer, in some way you will always bias some part of your population. It's going to be pretty hard and what I want to pass as a message is this fact that we can try. It's going to be hard, we are not maybe going to succeed for all the customer, but the thing is, if you don't try, it means that you are not going to do a good job. As a data scientist, you have this responsibility to make sure that you are being fair with the customer that you have.
If you look at the second point, you can see that the research is very active. We have all conferences that are being created just on fairness, on ethics, on transparency, for example. Today, there is no clear metric, so that we can say, "This is the metric that everyone should use because depending of who you are and what you are doing is going to be really different." It depend of what you are going to do.
Example: Prison Sentencing
Let's take this example of prison sentencing. For those who don’t know what is this square, in machine learning it would look like a confusion metrics. If you look here on this line, what you'll have is usually people that you are labeling, you are predicting that those people are going to be of low risk. You did something and you are going to get all these data about who is this person, maybe you know his gender, where he is coming from, what was the crime he did, if he did some other crime before or not. Then you are going to label this person as either low risk or high risk. What is going to happen is, you are going to take the decision of, am I going to send him to prison? Am I going to keep him inside society? Usually, after some point is, maybe you've said, "Ok, I'm going to look at this person back after one year." After one year you are going to see, "Did this person did a crime again?" He recidivated or maybe he didn't do anything. Then, depending of what happen, if he was labelled high risk and he recidivates, then you were right, so it's a true positive. If he was labeled high risk, saying he didn't recidivate, it was a false positive - you made the action but in fact, the guy didn't do anything. We are going to take that, and we are going to see that depending of who you are, the definition of fairness is going to be really different.
Let's say you are a decision maker and what you really care is, you just want to put the high-risk people into prison because those are the most dangerous people and that's the only thing you want. What you will care about is only this column of high-risk people. Then, if you are a defendant, what you care is, you don't want to put an innocent person in jail. What we are going to care is all those people that did not recidivate, and if you already model labeled them as low-risk or high-risk, that's what you care about.
Then you have society, what we really want is to be fair with everyone. We don't want to put bad people in jail, but we also we want to make sure that, all the really bad people are going to jail. You have to figure out how to satisfy everyone.
This is taken from Wikipedia. Here, this small square that you see here is the confusion matrix, as you can see, you can create 18 scores. You can look at true positive value, you can look at true negative and then there is everything derived from that. You can look at false positive rates, specificity, prevalence, accuracy. Each of those scores is something that you care about and that can be useful for maybe the decision you are taking.
Terminology
Before we continue, in this field of fairness, as you saw, there’s a terminology that you need to understand a bit to be able to read all those papers. The first one is what we call favorable label - usually there’s a label or a target if you are doing modelling, it’s providing the advantage to someone. In some sense here, what we are trying to do is being hired or not being arrested or even receiving a loan or being accepted at NewBank.
Then you have protected attribute. Usually this attribute is the thing that you want to make sure you are not discriminating people on. Here, we might talk about race, we might talk about gender, religion, there are a lot of different ways. What's interesting is usually depending of what you are working on. Some of them might not even be discriminatory because you really want to make sure that for maybe some specific position - I cannot think of one right now - you only want a male, for example, or a female. In some sense, protective attribute is not universal and it's going to be very specific to the application that you are going to use.
The privileged value - when you look at the protective attribute, if you look at race, gender, and religion, if you look at race, privileged value is going to be if you are white, and the unprivileged is going to be if you are black. For gender is, if you are male or if you are female.
Then you have group fairness, and here usually the goal is to make sure that each of the groups that you have in those protective attribute are treated in the same way. You're not giving an advantage to one group compared to the others.
Individual fairness is usually to make sure that similar individuals are going to get the same kind of output treatment from the model that you just built.
Bias is what we are going to talk today. It’s like the systematic error, we get data but this data was generated by humans before. Usually, what's happening is, humans are usually fabling, we can make bad decisions or just decision based on, say, this morning I woke and I felt that it wasn't a good day. I'm going to talk to all those people and I'm going to hire and maybe the whole day I'm going to really hate everyone even if the guys are good, I'm just going to say, "You're going to be disqualified." At the end, it wasn't based on anything from data but just on your feelings. That's what we want to remove from the system.
We can say if we're doing a good job or not removing this bias, by looking at some metrics. We are going to talk about fairness metric. Usually it's just going to be some metrics that are going to quantify how much we are being bias or we are being unfair to some specific group of people.
One of the solutions we are going to talk about is what we call bias mitigation algorithm. The goal is, at some point, you can leave this room and just say, "Ok, there is still hope. Maybe I can do something about all the data that I have." By the end of this whole process, my prediction is going to be fair with a bunch of people. But wait, there is this issue. It's like you are telling yourself "I've been here before and I know that it is always not good to discriminate between gender, race or even religion." I am just removing every feature like race and gender, so I'm being fair. The future is not here, you cannot tell me I'm being unfair because I'm not even using it. Usually, that's not really a good argument, you can see it here. If you don't know it, it's a map of the Chicago area. Here in the middle, is going to be the center of Chicago, it's usually the financial district, so you don't have much people living here. Then you see all around are the different neighborhoods and every point that you see, it's going to be a person. Here you can see that most of those neighborhoods are already discriminatory against people. Here you can see that usually, in the north of Chicago is well known that is mostly white population living there. In the south of Chicago is usually the black population. In the West is usually the Hispanic and black population also.
Let's say you are doing some analysis and you are saying, "I'm removing the race of someone," but I'm a credit card company. At some point, I need to send a credit card to someone, so I would like to get the address of this person. Then you are maybe a data scientist, you're thinking, "It's been a while since my model is stuck in this prediction and it's not doing that good." Maybe what I can do is have all those geolocation data about people and use it, and I'm pretty sure I'm going to get some pretty good bump in my model.
What you can see here is, you are not using gender, but there is this huge segregation that you can see. This random forest is going to be pretty good to just separate here and say, "All those people south of here are going to be some population. Some people between here and here are going to be another population." In fact, your model is going to learn where people are living and at the end of the day, it's just going to learn what was the race of the people living in this neighborhood, which is pretty bad.
Fairness Metric
Let’s go back to some of that fairness metric. This is confusion matrix, usually, it's the easiest way to do stuff when you are doing some analysis. You might have all those different metrics, I'm not going to explain to you everything but those are true positive, false positive, true negative, false negative and you have all of that. Then you can have a fancier metric, like difference of mean, disparate impact, statistical parity, odd ratio, consistency, generalized entropy Index. Let's take some of them.
If we talk about statistical parity difference, you can find in the literature some of those names. Group fairness is equal acceptance rates, or even benchmarking. What you want to do here is, you want to look at the prediction that you have at the end. Let's say that you are hiring, you are going to filter everything, like every people that you hired and you are going to look at the probability of the people that you hired and from the group that you want to control, let's say here is the gender. Those probabilities are going to be the same or at least like pretty close to each other because if you a have really huge difference, it means that there is something wrong that you are doing and then you need to kind of unbias this data. At the end of the day, your model that you are training is going to be unbiased and you are going to make better decision for those group of people.
The other one is, what we call the disparate impact. A lot of them are just a byproduct of this confusion matrix. Let's say you have your predicted values and let's say here is going to be your true group. It can be either you are male or female and you are going to see what is the difference between those different ratio. Usually, there is this really funny 80% rule, where if you are doing the ratio between one of the group between the other group, what you want to do in some sense, is you want to have this 80% rule where the difference between those two group in some ratio should be always above or equal to 80%. We have that because it's the rule that happened in the 1970 in the U.S., when people used to hire in some company, make sure that you weren't discriminating against maybe gender or maybe against some race. You had to make sure that if you computed this ratio on the group that you were hiring, this ratio should always be above 80%.
There is usually some libraries and I am going to talk about them just a bit later. Aequitas is one of them and usually, they have this approach where they just give you this tree, "You want to make a decision, let's go through this tree where the first story is, do you want to be fair based on disparate representation or based on disparate errors of your system?" Then you choose, "I am caring more about representation," or "I care about the errors." Then let's say I'm caring more about representation.
Then, "Do you need to select an equal number of people from each group or proportional to their percentage in the overall population?" Let's say here, what we care about is the proportional because, let's say, we have this problem in DataSense where I am doing hiring and in hiring usually, the number of resumes I get from female candidates is going to be way lower than male candidates. Usually, it's just because in the population of people doing study in the area of DataSense, machine learning, you have this really imbalance where you have more men than women. Here, what I want is to select the number of resumes I have and usually, I have only 50% of those resume that are from female members. What I want to do is, to be really fair and have 50/50 but it's going to be really hard for me because in some sense, my proportion inside the population is only 50%. What I would want to do is I would want to care about the proportionally and not have an equal number. Here, at the end of the day what I want is ook at this metric which is proportional parity. What we called this before is the disparate impact. If I look at the ratio, I should have this ratio of 80%.
How About Some Solutions?
How about some solution? So far, I talked about the metric, I talked about how you can know if your dataset is biased but what you care about is, if I know it like, "How can I do? What can I do to change it and to be able to be a bit fairer with people?" You can use some disparate impact remover, do some relabelling, learning fair representation or maybe you can do reject option classification or adversarial debiasing, or maybe you can do reweighing or additive counterfactually fair estimator. You have a lot of them.
Usually at the end of the day I advise you to not just re-implement everything because it's going to be really hard. Today in DataSense, you have all those really good libraries and this is a list of libraries that are available, I think it's pretty extensive. It really is the main one with the libraries that are a bit more popular. Here, the one I put in red is usually the one I tested and I feel that they are pretty good in what they are doing. Most of them are important. If you're using a tool like scikit-learn, there is some kind of compatibility. One of the tools I am going to talk about is AIF360, which is a tool from IBM. What's interesting with this tool is, it is compatible with scikit-learn. It's based on the parading of transform, fit and predict. It's really easier to just incorporate a pipeline that you might have and to just have this whole pipeline running and to be able, at the end of your prediction, to have some fair prediction.
Let's go back to some solutions. How can we fix this prediction? Something that is really interesting in how we can do that is, there are three ways to do it. As you know, usually when you are building models, there are three steps that you might have. The first step is, you have to boost your dataset, when you are building your datasets, what you can do is you can already have a step where all these datasets, before even you train the model, you are going to do some changes that are going to make this dataset a bit better when you will fit it to your algorithm, so t's going to be a bit more fair.
Then you have in-processing. In-processing is, you are going to look directly at the model you are going to choose. In this model you are going to make some transformation. The way it learns on your data is going to be, again, fairer than if you didn't do anything here. Then, after you train your model, you have your predictions and here is one more time where you have the possibility to look at those predictions and to be able to do some transformations on those predictions to be fairer on people.
There is this diagram that is pretty cool, I have taken it also from the AIF360 paper. Like everything, you will start from raw data, then you have your original datasets. If you know the good practices in building model, usually you have to get a training set, a validation set and testing sets. Then from here, you have three different paths that we talked about before. One can be the pre-processing, what you are going to do is, you are going to take these original datasets. You are going to pass it through this function method that you have and then you will have this transform dataset that is usually a bit better to feed inside your classifier. Then at the end what you will end up with is a fair predicted dataset.
Otherwise, we go back, you have raw data, you do your different splits and what you can do is, you can put it inside a fair classifier. This fair classifier is going to do the work and I am going to explain a bit of each of those after that. Same thing, you do the fair in-processing and then you will end up with a fair predicted date set. Or the last point, again, you go back to original datasets, do the splits, you go there, and you put in, select the classifier. You have your predictive datasets. You do some post-processing and then you end up again with a fair predicted dataset.
Pre-Processing, In-Processing and Post-Processing
There are different methods we can use to do that. In pre-processing, the most well known in the field is really reweighing. Usually, you have your training sets and the only thing you do is, you are saying that each of the samples that you have in this training set is going to have different weight. Usually, those weights are going to be chosen relative to what you usually care about. Let's say you have under-representation of women in some datasets, you are going to just put this weights higher than usual and it's going to reduce the weight for all the male in the same way.
Then you have optimized pre-processing. Here, what you are going to do is, you are going to learn probabilistic transformation. They are just going to change the feature that you have inside your datasets. Then here you might have something else that we call learning fair representation. Here, something related is if you are doing embedding - you might say that, "I don't want to be able to look at my data. I'm just going to feed this data inside some embedding and the embedding I'm going to create are going to be just numbers that are going to represent my data, so I can still learn from stuff on top of that, but the way I'm constricting this embedding is going to make sure that I cannot recover the predicted value that I had before, which can be gender, religion, race.
The last one is what we call the disparate impact remover. The second one here, what we want to do is we want to edit the features, just to make sure that we increase the group fairness and preserve some rank-ordering that we might have in some of the group that we have.
The second one is In-Processing, here we have to touch the model. The two main ways you can do that, is usually in adversarial method. If you know about GAN, which is Generative Adversarial Network, what you are going to do is, you are going to two models. The first one is going to learn your data and the second one is going to learn on the prediction of your first model. You can also learn the prediction of the predicted value that you had before. What you want to do is you want to make sure that at some point, the second model is not able to differentiate any more on the predicted value that you had before.
Then you have here a prejudice remover. Here, what's interesting is, you just take the model that you have and other regularization that you might have when you have overfeeding and you want to do some regularization. You can have a term at the end of the formula that you wanted to optimize and this term is just going to be a term that is making sure that you are not going to discriminate against the predicted value that you had before.
Then you have the post-processing. You went all the way through, you have your prediction again and there is those three methods usually that you have, which is equalized odds post-processing. Here, you want to solve linear program to find some probabilities. If you change the output, you are optimizing some the metric that is equalized odds. Then you have calibrated equalized odds post-processing. Again, you want to calibrate the score that you have at the end, just to make sure that this metric is optimized.
The last one is reject option classification. It gives favorable outcomes to the unprivileged group and unfavorable outcome of the privileged group are in some confidence band that you usually have. I just want to make sure that you are getting the data inside some confidence interval that for you is good enough for what you are doing.
Experiments
Let's take a look now in some experiments. These are also taken from the AIF360 paper. Usually, in this field of fairness and bias, there are three main datasets that people are using a lot, usually pretty small datasets. The first one is some adult census income, it's just like a dataset where you have some information like the gender of someone, if the person is married, divorced, where the person is living. Then at the end what we ask those people is how much they are making, and it's going to be in range. For example, people are making less than $50,000 a year or more than 50,000 or more than 100,000.
The second one here, German Credit is dataset where same thing you have like some customer that asked for some credit. What we get at the end is, "Did this person like receive the credit?" and if he received the credit or if he managed to pay back this credit. Then COMPAS, the last one, is the datasets related to sentencing, so a bit of like the example I showed you before.
Then what we are going like to look at, are those four different metrics, disparate impact, SPD, average odds difference and equal opportunity. Then we are going to train those three different classifiers, Logistic Regression, Random Forest Classifier and some Neural Nets. What's interesting with taking that is, the Logistic Regression is like a biased line that people will usually choose when they want to train a model and have a benchmark. A Random Forest is a bit more, not complicated but usually performing always like a bit better than the Logistic Regression. Then Neural Nets us like the IP. Then, we have those different processing steps that we talked about before.
Let's look at the results. Here, you see there are three datasets - we have COMPA, the credit one and the census. For each of those you can see that there are two predicted values that we will look at. For COMPAS, it's going to be sex and race, sex and age and sex and race again. Here, it's going to be the metric. In blue is, if you look at the datasets before we do this pre-processing and we just look at the dataset and compute this metric, what you will see is, we'll get this score, if you look at all those predicted values.
What we really want for SPD is to make sure that we have a fair value. Usually, what we want is we want this metric to be really close to zero. As you can see before the pre-processing, we are kind of far from it and after the pre-processing we are closer from it. We have that here for this technique which is reweighing. If we use optimized pre-processing, we have the same impact all though a bit bigger than what we will have if we will just do the weighing.
Then, we can do the same if we use the metric disparate impact. Here, what we want to do is, for disparate impact, having a fair value means that your value should be pretty close to one. Here's the same thing, we use those two different pre-processing techniques. You should start with the blue to the orange and you will see that we are closer to one. It means that your dataset was changed in a better way and it's going to be fairer for people.
Then we can look at the results. Here, those results are taken only from the census. The predicted value that we had was race. Let's just look at the best model here. Above here, this line, is before the pre-processing, so if we look at statistical parity, you might see that it is not close to zero. You can see that the accuracy of this model here is pretty close to 77-76 accuracy. We have that for this model, and same thing for disparate impact. It should be closer to one to be fair, it's a bit below. Then we apply this transformation, and what you can see is here for these two methods you have Logistic Regression and Random Forest. Here you can find them again, Logistic Regression. You are a bit of loss of accuracy. You went from 77 to maybe 75, 76, but you go closer to the zero here that we care about. If you took the Logistic Regression, then here for the Logistic Regression, you might actually like to arrive to a re-fair model. Maybe you lost a bit more from maybe 74 to, let's say, like 73.
The takeaway from here is, at the end, you can see that you are not losing that much by being fairer to people. Your accuracy is not building very much, but as you can see, usually you can get a fairer model related to those metrics that we are seeing. In some sense, if you look at, let's say, other results, here, we are choosing other metrics. Here we are choosing average odds difference and equal opportunity difference, and here, we are seeing the same kind of conclusion. We went from having the model that is, again, pretty close to 76 to having, another model pretty close to 75, 76, so not losing much accuracy but being like more fair of all.
Questions and Answers
Moderator: You showed here that there is this small trade-off in accuracy when you apply some of these methods. Have you encountered a situation where this trade-off in accuracy is actually quite significant? How do you think usually you should deal with that? Because if the trade-off in accuracy is quite significant, I believe we may find some pressure to not apply some of these methods.
Silberman: There is some time a really huge decrease in accuracy, but if you are trying something that you should not try. Let's say one day you really want to know, "What if I'm putting gender? What is going to happen?" At NewBank, usually, we try to avoid that from the start. We know the different features that we might use or not. Depending of those features, we are going to take a look and maybe one day you want to say, "I really want to see the real impact of using this feature," and you go and you see this impact. The issue is, today we don't really have it but if you think of Europe - and it's going to come next year in Brazil - you have laws like GDPR, the data collection law, that is going to force companies to reveal the prediction you are taking from which features are they coming from. If at some point, you are saying, "My features are coming from gender," then you are going to be in some trouble because, in our case for example, the Central Bank can come and say, "What you did is really bad and you need to remove it" and maybe get some penalty or even remove some license if we do something really stupid. We want to be really careful on that. Here, what we want to do when we do all those research is, we want to go the last mile of making sure that even if you remove those feature and that's what we saw before. Just removing gender doesn't mean that your model is not able to find gender again in the other feature that you have.
Usually, what we want is to make sure that those features are not present somewhere else. In my experience, because you don't need the real one from the start, the decreasing feature is not that much, it doesn't have that big of an impact. At the end of the day, you can be better and have better accuracy but you still want to keep your license. In the case of NewBank, people really love us and we don't want to break this relationship with our customers. That's why we want to make sure that we are fair with them.
Moderator: That's a good point. If you start by not having this data in your dataset in the first place, I think you are not still tempted to start using it.
Participant 1: The example that you showed us is an example based on a structure data. Do you have any experience or know of any work under unstructured data like image or videos, especially for face recognition and stuff like that?
Silberman: There has been a lot of studies on that. Not so long ago, I think someone did a study on face recognition algorithm from Amazon, Google - I'm not sure Google is doing it - IBM, Microsoft. What they saw was, these models were doing pretty good for, let's say, male people, either they are black or white, or even female people that are white but really bad for female people that are black. The way you are going to do that is you have to create a dataset that is balanced because from the start, some of the few datasets that we had at the beginning to train face recognition, it was mostly extracted from maybe Google and it was just stars, movie stars that you had because it was really easy to get the data. The thing is, usually movie stars, if you think about it, you might have more representation from maybe male, maybe white male or even white women than a black woman, for example. At the end, it was really easy to fix that because the only thing there to do was, we just needed more data from this underrepresented group and then they did that. They retrained, they looked at the data again, and then everything was back to normal and they were able to classify everything.
Participant 2: In your talk we saw lots of things related to the code and that versus and the data treatment in order to lessen the problem regarding sexism, racism. Do you think there are other areas that we could improve in order to lessen the problem?
Silberman: Where it's usually complicated is, all these solution that I showed you so far, going on the fact that you have those predicted value. In some sense, you have this information of, let's say, the race of someone or the gender of someone or maybe the location of someone. Let's say you don't have this information, how can you say that you are discriminating against this population if you cannot even compare that this population is discriminated?
There is this research of, "I don't geolocations of people," but in my data, because of some issues, one feature is inside a proxy for this information, but I don't know it. The thing is, my model I'm training it, I'm saying this feature is really good and I'm really happy because my model is very good because of that. You are putting into production, but in some way, maybe you are discriminating against someone, but you don't even know it because you have nothing to compare. That's where it's pretty hard and that's where this is the interesting part of how can we remove this bias that we don't even know? How can we find it? Usually, you find it because when you are a data scientist, you might like just look at the data. We've been in a lot of cases where you start to see a feature really good and you are just wondering, "Why is this feature really good?" It doesn't make any sense. You just put in every feature that you have and let's say, something that is not related. If people were going to pay back some money, let's say, it's something like this.
We saw that the number of emails we were exchanging with people was very important. You are, "That is pretty good. I would just put email as a feature." In fact, what we saw is, email is an important feature because even before you classify this person as someone that you need to contact, maybe even before some people were thinking, "I needed to contact this person in advance because I want to make sure that everything is all right, maybe I can help him before." Then you are right at the situation where your model is just looking ok. Well, you already started to contact this person but, in fact, you contacted him because you already know that this person was in difficulty. It's a feedback loop that you want to avoid and it's pretty hard to understand and accept if you would really go deep inside the data and you understand the whole situation and you don't just stop at, "This feature is good, my job is done, I'm going home. I'm happy."
See more presentations with transcripts