Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Privacy: The Last Stand for Fair Algorithms

Privacy: The Last Stand for Fair Algorithms



Katharine Jarmul discusses research related to fair-and-private ML algorithms and privacy-preserving models, showing that caring about privacy can help ensure a better model overall and support ethics.


Katharine Jarmul is a pythonista and co-founder of KIProtect, a data science and machine learning security company in Berlin, Germany. She's been using Python since 2008 to solve and create problems. She helped form the first PyLadies chapter in Los Angeles in 2010, and co-authored an O'Reilly book along with several video courses on Python and data.

About the conference is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.


Jarmul: This talk is based partially on a talk that I gave at Strange Loop last year, which was called "Privacy: The Last Stand for Fair Algorithms." so now we will revisit it together. When I decided that we should revisit this, or when I was invited to revisit this topic, especially at an audience here, in the Bay Area, in Silicon Valley, sometimes called the Belly of the Beast, I thought to myself, "Ok, I might change a few things and talk about privacy with an emphasis on the larger tech companies here and what they're doing and/or not doing in some cases around privacy.

To begin, we'll pick on Google, because why not? Is anybody here from Google? No, perfect, even better. Here is Mr. Peter Fleischer, he is Google's Global Counsel for Privacy. This means that he represents Google on a global level when it comes to privacy and he's also their longest serving privacy officer. Privacy officer is similar to security officers, they tend to have a short term, let's just say. As soon as something happens, they're gone and they get replaced with somebody else, but Peter Fleischer, he's been in service for more than a decade.

Here he is, this is from his own blogspot, this is an image of him. He lives in France, so I'm assuming this is by the Alps, gorgeous, this is how he sees himself. I can't really see his face, so I said, “I've got to find a picture of this guy's face. How can we talk about him without showing his face?” so I went to his employer, Google, and I searched, and this is the top result on Google images. I said, “Ok, this is him speaking at a conference, it's a European conference on privacy, he's obviously dressed in his suit. He has a little bit of a funny look on his face. It's a little bit of a weird expression.”

I don't know how he feels, because how he sees himself is on the left, I also don't see any grey hairs on the left. On the right, he looks a little bit older, so how he sees himself and how he presents himself is the left image, but the top Google search result is, of course, the right image. How he sees himself and how Google, usually supported by algorithms and/or at least minimally computer programs, how they've generated an image because this is a YouTube screen grab, so how they've generated an image to represent him, this is quite different.

I think that this starts to touch upon consent, and this starts to touch upon how we define ourselves on the internet, and how other people define us on the internet, we're going to go a little bit deeper in that. Mr. Fleischer had some opinions on this. He, of course, has his own blogspot, and he's written quite a lot about the right to be forgotten. What he has written on the right to be forgotten is, "All of my empathy for wanting to let people edit out some of the bad things of their past doesn't change my conviction that history should be remembered, not forgotten, even if it's painful. Culture is memory." This is quite eloquent.

I'm curious, how many people here share this idea about Google and the internet, that we should never delete things? They're the cultural history, a cultural memory. He has gone on to say that Google will be the history of the future, this is how we can keep track of things. How many people here believe that the right to be forgotten simply erases history, that this is not something that we should allow? Keep your hands up. I want you to continue to keep your hands up if you think that Mr. Fleischer or if you yourself have been a victim of revenge porn. What about unlawful arrest? Do you think Mr. Fleischer was illegally or unlawfully arrested by police? Was he targeted or discriminated? What about employment? Do you think Mr. Fleischer has had some problems with employment based on the search history, based on things that people can find about him on the internet? Do you think he suffered that? Do you think the people that believe there shouldn't be a right to be forgotten have suffered that?

I want to come to Mr. Fleischer with some facts and these facts were actually leaked by Google themselves. Ironically, in their transparency report, they published a bunch of data in the source code of the page, and they accidentally leaked their own data, which is something that Google might do. The point is that 95% of over 200,000 requests for the right to be forgotten were not related at all to a convicted criminal, to a public official, or to some other public figure.

These were people that wanted to remove their private information, often, from the internet. "I didn't know that my address was posted there, I would like it removed. I didn't know that this photo was posted there, I would like it removed." These are private citizens, these are not public figures, and they would like things removed from the internet that they just don't want in a search when their employer goes to search for them or when something wrongful happened to them. Mr. Fleischer here actually, ironically enough, he's a convicted criminal, but do you think it's affected his life? Did he get fired from Google? He was convicted for Google's crimes against privacy in Italy towards child images, but he still has a nice life in France hiking Alps.

Privacy as Privilege, Power & Security

This touches on a deeper topic here, which is within our society today, privacy acts as a form of privilege. Those of us that have privacy, we can choose to opt in or out. We choose and we say, "Ok, I would like to give up some of my privacy to get this thing. I would like to give you some of my data to get this thing." but we don't need it as a core part of our life, we don't need it to survive, we don't need it to get food stamps, we don't need it to get coupons so that we can afford the things in our society.

How many people here, when you see the free Wi-Fi, and you have to fill out a bunch of your personal data, how many people put their actual personal date there? We laugh because we know how that data is used and resold, and we would never give our personal data there, but do you think that everybody knows that? Do you think that the average person knows all that information, and do you think that this trade, essentially, of some service or some reward, for data, do you think that this is fair for people that don't have other ways to access these rewards? Is this a fair way to treat people? Is privacy equally distributed, do you think?

Privacy also is a form of power, so privacy is privilege and privilege is usually some form of power. In privacy is power, we can think of this as information asymmetry. When you enter a negotiation or any type of back-and-forth business interaction, what you want is equal footing, what you want is you want to know at least as much information as the other side knows. This is essentially game theory, also business theory, and so forth.

What happens when we have erosion of our privacy and massive data collection, which we may or may not be aware of, is we have created now an information asymmetry that is so great, that our ability to negotiate with these people that hold our data is almost irrelevant, because they already have all of the information they need from us. We don't have any leverage to say, "Oh, well can I trade this extra piece of information?" "No, sorry, already got that. I bought it from somebody. It's fine." This comes into play a lot online when these larger tech companies, some of which may be your employers, use this to make decisions about what people see, about what people buy, about what prices people pay for things based on this information asymmetry.

In as such, it also affects their personal and their data security, it also possibly affects our societal security. When people's private data are breached, or sold, or given, like the news from Mark Zuckerberg today that he used private data as leverage for business deals, that he said, "Hey, man, you want to do business with us, I'll give you some data. Fine. It's just my users, no big deal." We saw it again, of course, with Cambridge Analytica. The idea that you can feel secure, both that your data is protected, that it's secure, stored in an encrypted way, hopefully, that somebody's looking out for you, that the data that you've given over to somebody via a trust mechanism, that this is respected in some way, shape, or form, also technologically, and then to have a breach, like Cambridge Analytica, classified as a breach, we'll call it a breach, and to have that data then used against you to manipulate yourself and people like you and to potentially influence elections around the globe, so privacy also operates as security.

I was thinking about it when I was walking around San Francisco two days ago, I was thinking about what is the ultimate lack of privacy? It's homelessness, right? The ultimate lack of privacy is you don't even have a private place to sleep, or eat, or do anything. As I walked around San Francisco and I see big, massive bank or tech building, and then homeless people outside, I think I saw quite a gulf there. To me, it really illustrated this idea of privacy as privilege, power and security, and the fact that we see such an example directly in front of a place that controls quite a lot of data, and therefore, people's privacy.

Privacy & Fairness

How does this relate to fairness? I hope some of the connections are somewhat clear to you already, because if it's privilege, power, and security, then also, of course, this might be unequally distributed, and therefore, this might correlate with fairness. I want to give you an even more illustrative example, I think that most people that work in applied AI, also myself, within the machine learning space, I've done lots of years of natural language processing, we see AI as something that can help us. We want to help people, we want to build good products, we want to do good work, we want to work on challenging problems.

Very rarely have we been hurt by this. In our experience, a lot of times, this automation, this AI driven change within our society, this has really benefited us, it has added convenience for us, it has added easier access to things, it has added quite a lot of positivity in our life. Usually speeding up processes, limiting paperwork, getting things done faster, or better, something like this.

Probably rarely have we been affected by the negative consequences of these automations, or the negative consequences of unfair treatment, or unequal access to privacy. Usually, we're not the one getting the loan denied message with no explanation, usually, that's not us. It might not even be anybody we know, so we're essentially separated both in our knowledge and use of privacy in other people's data, and also our understanding of fairness and fairness criteria within our space.

Fairness through Awareness

This was further advanced by Cynthia Dwork. If you're not familiar with Cynthia Dwork, she's a leading researcher, and I would say even activist, on privacy and machine learning. What Cynthia Dwork did, her initial research that really contributed to this space, was about a process called differential privacy. What happened was first she disproved the notion, there used to be this notion that you could do something called statistical disclosure. This notion was the idea that you could release something and not expose private information about an individual.

She actually disproved that theory mathematically, and also with a series of logic proofs. What she was then tasked with, really, is when you destroy the way that people have done anonymized releases forever, then you probably should rebuild something, what she rebuilt was differential privacy mechanisms. This essentially is what we consider, in a lot of ways, the state of the art way to measure privacy loss over time, so we can use differential privacy to quantify privacy loss and to measure that over a longer period of time with something that we often refer to as a privacy budget, this is some of her background work.

After her work on differential privacy, she decided, "Can I apply differential privacy principles to something outside of a query-based approach?" which is the original approach to differential privacy and she came up with this paper, "Fairness through Awareness." Her goal of "Fairness through Awareness" was to mathematically prove that privacy and fairness were correlated, that when we increased privacy for the individual, specifically for certain sensitive or private attributes, such as, let's say, gender, age, race, and so forth, that we could then improve the fairness of their treatment. The goal was "Can we treat similar people similarly?"

If it's between myself and yourself, and our only difference is gender, our only difference is race, and everything, all else aside, we have very comparative backgrounds, why should you get the job rather than me? Or why should I get the job rather than you, based on my skin color? Her idea, essentially, is that we can create a mapping space that essentially can remove the impact of these private or sensitive attributes, so we create a representation that can remove this variable or the influence of these variables.

Learning Fair Representations

Mathematically, it worked but mathematically working and working in practice are two different things. The research was actually expanded upon by several researchers at the University of Toronto, and they actually implemented this mapping. I can highly recommend this paper, it's called "Learning Fair Representations." What they did was they compared it with several practices that were very popular at the time, including a fair bayesian approach and a regularized logistic regression approach. With any time that you use fairness or other criteria alongside accuracy, you need to find some sort of optimization, you have to determine some optimization equation.

Here we see two wants. The one on the left is minimum discrimination, so let's minimize the discrimination of the groups, and the one on the right is the maximum delta between accuracy and discrimination, so trying to essentially push the gulf between the highest accuracy with some error, essentially, for the discrimination. What they were able to show, it's the dark blue line here to the far right, is that the learned fair representations performed quite well, especially compared to some of the other fairness approaches, especially in terms of retaining a level of accuracy that was higher.

Essentially, what we can see here and what we can take away here is that not only is it important for us to choose optimizations that allow us to evaluate things perhaps, like fairness and privacy, but also that these fair representations can actually be reused for other similar tasks, so tasks with similar objectives. This is something that you can actually take and utilize for more than one problem, this was really the goal of that paper.

Implement Private and Fair Machine Learning

When Wes asked me to come give the keynote, said, "Can you make it a little practical because I know that this is a practical engineering conference, and we want something that we can take home, take to work, implement, and so forth?" For the next rest of the talk, we're going to shift from this theory and these ideas of privacy into how we can actually implement them into systems that we're building and into machine learning or AI systems that we want to create.

Another piece of research that was recently released was by a team called GoDataDriven. They're based in the Netherlands, and they're actively working on fairness problems, as well as other machine learning problems. What we see here is an iteration across similar datasets, this is a dataset that predicts whether you earn more than 50,000 or less than 50,000, and they try to essentially remove the sensitive attribute. Here we have a GAN, so if anybody's used GANs before, we have essentially two machine learning systems that are interacting with one another. And the first one is just working on the classification problem, so does this row earn more or less? The second one is a discriminator that's trying to guess the gender or the race based on that classification.

In a way, this is exactly what Cynthia Dwork was proving, that if we can remove our ability to guess the gender, then we are treating similar people similarly, if we can remove the influence of race, then we are treating similar people more similarly. They have open sourced their code on this, feel free to check it out. What they showed is that, yes, there is this tension between accuracy and fairness, and accuracy and privacy, but they were able to optimize for both using a GAN model, so it's a fairly novel approach.

Collect Data with Privacy Guarantees: Group Fairness, Individual Privacy

Another approach that you can use is collecting data with privacy guarantees. There are many ways that we collect data and more often than not, we don't actually need all the data that we collect. I was at a conference recently and somebody said, "We just collect all the data so that we save it for a rainy day." I would say a rainy day is just a data breach waiting to happen, but that's just my opinion. Can we figure out how to collect data with privacy guarantees? Can we then utilize this data in order to show things like group fairness?

This is a nice use case, this is also from Google, it was reported on by The New York Times. A group of Googlers got together and said, "We're going to create a Google Sheet. Everybody there is going to put their gender and their rank, so to speak, their level of engineering title, and then they're going to put salary."

This isn't necessarily anonymized, because if I were to watch you and I knew that you just entered your data, then there's a lot of ways that one could attack this and re-identify the data, but we're not going to go into that. What we're going to go into is, the idea that each row would not be attributable to any one person. In aggregate, this showed a trend and this showed a trend that they actually were able to use in a class action lawsuit against Google that showed discriminatory pay for the women engineers that were working there.

If you're concerned about fairness, one way that you can enable people to share more information on this is to allow for individual privacy, because it means that I don't have to go up, as a female Google engineer and say, "Well, I heard my coworker is making $10,000 more than me. Can you please give me a raise?" The onus becomes not on the individual, but on the data that we can share as a group, and therefore, together we are actually stronger in this case.

Privacy by Design: Protect User Data in Software Design

Another thing that you can work on and do is implementing privacy by design. Privacy by Design principles allow us to create software where privacy is at the core. This means end-to-end encryption, this means very clear and transparent use of data, this means ensuring that the user can enable or opt in or out of privacy controls quite easily.

This is all a part of privacy by design, and it should sound a little bit familiar. Does it sound a little bit familiar in terms of topics? If you had to implement GDPR, you're probably very clear on the fact that Privacy by Design was a core piece of GDPR. In fact, there's an entire section or article within the GDPR that simply focuses on Privacy by Design, but this was a theory created by a working group in Canada before, I highly recommend you take a look at both.


Let's dive a little bit into GDPR because it's really funny for me when I come here, to the U.S. I'm originally a U.S. citizen. I live in Europe now, Germany for five years and it's pretty funny for me the way people talk about GDPR here. It's a good point of view into the American psyche. This is because I think people see GDPR as punitive, which is surprising to me because when I think about GDPR as a European resident, I think that we've created first-class data citizens and other. I think we've created a group that has control and ownership and ability, agency over their data, and we've created everyone else.

It's funny to me that even though I live in a place of GDPR and so forth, and yes, maybe I can't read the LA Times- I'm not really very concerned about the LA Times, I live in Berlin- but it's amazing to me, that the blocking of sites is somehow punitive towards me, when really, if I were you, and I lived here, I'd be pretty concerned about what the LA Times is doing with my data. That's how I would look at that, "Wait a second, what are they doing?" This is because, really, the LA Times has decided that it's easier to just block an entire region of the world than to implement Privacy by Design principles, so, have fun on the LA Times, let me know how that goes for you.

What I ask is maybe we should treat privacy as a right for everyone, maybe we shouldn't just let the nice European residents, like myself, have all the fun. If you've actually implemented GDPR at your company, it would be pretty easy to not offer this only to a geographic zone, it would be pretty easy to offer this to other users on your site. And you know what? This amount of transparency and openness and agency that you're giving the users, this is actually going to be seen as a nice, goodwill effort. I would say probably your trust in safety teams and your customer success teams will give you a thumbs up on allowing people to do some basic rights with regard to their data and how you're using their data.

Choose Fairness and Privacy Metrics Early and Often

Some final thoughts on fairness and privacy within machine learning. First and foremost is there was a talk yesterday from Google about a few different fairness metrics. I've spoken about a few privacy metrics. And what I suggest, if you're actually on an applied machine learning team, is that you choose and designate types of fairness and privacy metrics early and often, that before you go about training and modeling and so forth, you actually incorporate those into your choices. This is because you don't want to get to the end stage and then say, "Oh yeah, did anybody check the fairness thing?" and then you have bad and worse.

You need to choose this often, but this diagram is from an algorithmic fairness group. This is a group of two different research departments. One is the University of Utah and the other, I believe, is Haverford University, and they have groups working on algorithmic fairness. The problem really is that there is at least 21 documented definitions of algorithmic fairness. This was released in a paper, I believe either at FAT ML or NIPS last year. What they saw is that some of them- these are correlation charts- some of them are in direct adversary relationships with one another, so they're negatively correlated. If I improve this fairness metric, I'm actually removing this fairness metric, this is really essential that you get this right because if you optimize for the wrong one, you're actually increasing inequality with the way that you're analyzing your fairness.

There's lots of research on this, you can dive in deep, but the point is, for the task at hand and for the population that your model will be touching or affecting or training on, then you need to think and define the fairness metric, and maybe reevaluate over time. I warn that you please avoid the fallacy of metrics. What is the fallacy of metrics? The fallacy of metrics is the idea that I've optimized the metrics, and so this means I've fixed it. I fixed data science, I fixed ethical machine learning, we can all go home, it's all done because I've created one model that wasn't racist, it means I totally fixed everything.

I hate to pick on Google- maybe I don't actually hate it- but the idea is that if you create a fair or an equal opportunity bomber, you still created a bomber. You can optimize fairness for quite a lot of things, but it doesn't actually make the work ethical, you have to think about the greater picture. Just because you improved equal treatment of users by one model doesn't mean that that model does the right thing from a moral standpoint. It doesn't mean that we fixed the problem in society, it doesn't mean that we changed anything by just looking at one fairness implementation.

I don't know how many people here are familiar with Danah Boyd and her work, she's been working in privacy, particularly in social media. I noticed that there are lots of social media companies and/or other companies where the users are really your data, and the users are really your product. If you work at a social network and nobody's there, then you have no job, if you work at a place like Uber or Lyft, then you have no drivers and also no riders, so you have no job. When the users are the core part of your product, then you really have to think about them, you have to think about their perspective, because they can change their mind and go somewhere else, even if you think you have a monopoly now.

Danah Boyd has studied quite a lot of this, of social media context and what does privacy mean in social media context. There was a recent debate on a mailing list that I'm part of where somebody said, "Ok, fairness and transparency are at odds." This is Mr. Fleischer's thing, too, we can't have transparency and privacy, we can't have these, these are not available. She responded, and I want to read her response or a section of her response, "Privacy also has a long and complex history, in both legal and technical contexts. This often gets boiled down to who has access to the data. From my vantage point, that's a dreadful way to define privacy. Alice Marwick and I spent years interviewing teenagers about what they understood to be privacy violations, and what became clear to us is that privacy, in the colloquial sense, means control over a social situation. In other words, it's not about access, but about expected use. And this is where we realized that one line of legal thinking actually got this historically correct, a reasonable expectation of privacy."

What she's saying is not, of course, that privacy means security for the data, although I would argue that that's important as well as part of trust, maybe. Privacy is actually that the data is being used in a way that the user anticipates and expects, and that transparency actually joins that quite well, because if we're transparent with users about how we're using their data, if we're transparent with them about how we allow them controls for that data, then this means that privacy and transparency are actually working together in this case.

As Danah Boyd would say, and I would say, please give users some agency in this process. It's not that hard to give them some agency and some real transparency of how their data is being used, and the amount of trust and happiness you'll get from people by actually respecting them and asking them, and not just doing something and waiting for somebody to report about it in the newspaper, this is a bond that will not break, you will get lifetime users.

Privacy & Privilege Revisited

I want to bring back this concept of privacy and privilege, I want to do it with this idea of the right to be forgotten. I don't know how many people here saw this Tweet, or heard about it. It was in Oakland and there was this black man who was barbecuing by the lake and there was this white woman who called the police. She said that he was barbecuing illegally, even though this was a common area where people barbecued. This photo was posted on social media and got quite a lot of attention and reporting and so forth. Let's go back to the right to be forgotten, let's say that either both individuals in this photo or one or the other were tagged or indexed, that in the search results now for their name, there's this image. Let's say that one or both of them wants this to be removed, do either of them or both of them have the right to be forgotten?

We may not like the idea that the woman has the right to be forgotten because we may say, "Oh, well, her behavior was abhorrent. Everybody should know that she did that.", but let's think about the stakes for each of them, because we can't just apply right to be forgotten whenever we feel like. That's currently how Google does it, but we should probably try to be fair about it, so let's say the stakes for her, what's going to happen to her if her name and then this, comes up? Maybe she might get in trouble with her employer, maybe she gets even fired, although this really depends on the employer, maybe she'll get sensitivity training, maybe she'll lose that one black friend that she had.

What are the stakes for him? Because there's quite a lot of police call and arrest data that's being used for predictive policing. There are numerous jurisdictions that are actually using arrests without double checking that any conviction was ever made, there are many departments doing that. They're feeding it into automated and machine learning systems, and they're saying, "Yes, no, it's totally fine." but what I recommend there is that these systems are proven to be feedback loops, feedback mechanisms, mathematically that if you have an overabundance, let's say an over representation of, let's say, black men in incarceration and/or in arrests, whether they're lawful or not, and then you feed it into a model, and then you send more police to the neighborhoods where black men tend to live, and then you continue the cycle. It's mathematically proven, it makes a lot of sense if you just think about it.

The stakes for him, if this data then goes somewhere, if this right to be forgotten is not implemented and his data is then used as another arrest statistic, is that him and/or people like him may be negatively affected, continually negatively affected by this data being available. We can see here, perhaps, a little bit of how privacy, fairness, and privilege operate together when we think about machine learning and automation in our country.

I want to close here with blind justice. This is Lady Justice, also sometimes referred to as Blind Justice, this is an icon for the times that really says she is just. She doesn't see things like your skin color, she doesn't see things like your gender, she doesn't see things like where you live or what type of phone you have, this is not a way that she judges you. What I want to challenge us to do is when we're thinking of applied machine learning, when we're thinking of how we treat user data, when we're thinking of all of this, that we think about whether we're treating data like Lady Justice would.

Are we implementing data collection that allows users to control their privacy or are we just telling them, "Don't worry about it, we got it. We're totally doing all the security stuff, don't worry about it." Are we being open and transparent about the scales that we're using, or are we just saying, "No, we're doing the fairness research, it's totally fine. I can't really answer questions about that because my employer doesn't let me answer that question.” or are we pushing ourselves to a new plateau, a new ethical grade where we say privacy is actually important, it's connected with fairness, and users deserve minimally that.

I don't want us to be on the wrong side of history. I don't want us to be those people when we walk outside of the building with a homeless person and we're not concerned about privacy, or fairness, or implementing anything. We're instead concerned about implementing whatever our employer tells us to do. What I hope is that we can be this push forward, we can be the advocates for privacy, and that we can essentially be that last stand, and I believe this is the last stand for implementing truly fair and private algorithms.


See more presentations with transcripts


Recorded at:

May 23, 2019