Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Bias in BigData/AI and ML

Bias in BigData/AI and ML



Leslie Miley discusses how inherent bias in data sets have affected things from the 2016 Presidential race to criminal sentencing in the United States. He shares lessons on the past heritage and how forward-looking companies are developing mitigation strategies to prevent bias in their systems.


Leslie Miley is a Silicon Valley native who has held engineering leadership roles at Slack, Twitter, Apple, and Google. He has been featured in USA Today, TechCrunch Disrupt, and Wired's 2017 Next List. He advises several startups founded by women and minorities and is an investor in a fund dedicated to diverse entrepreneurs.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


So this morning's keynote is someone who has worked at Apple, at Google, he has worked at Twitter and, most recently, he came from Slack channel. I'm talking about Leslie Miley. Leslie is going to do a talk today on AI, ML, and the inherent bias within the data sets. He is going to touch on the things in the news, social networks and the zero day accounts, and how underrepresented groups can be hurt by AI and ML, and what can we do about it. Please join me in welcoming Leslie to the stage.

Leslie Miley: Good morning, everyone! This is a big crowd. I'm really surprised we had this many engineers up this early, including myself. So, before I get into this, a couple of things I have to do. I have to do a selfie, this is the age of social media. And so I'm going to just do a selfie with all of you, if you don't want to be in this picture, you can leave now. Anybody leaving? Seriously, I'm doing a selfie. Don't worry, I will only get the top of my head. Look, I got photo-bombed.

An Unprecedented Year Powered by Bias, ML and Social Media

This has been an unprecedented year for a lot of reasons, and it has been unprecedented because we have seen so much of social media being powered by bias, powered by ML, powered by angst, powered by fear. It has been one of the most tumultuous years that I can remember, politically and socially. And I started to think about the part that we all played in that. So I'm going to go through a little bit of it now.

So Facebook, in 2016, said fake news wasn't a problem. Right after the election, they said, fake news is not a problem on our platform. By October of 2016, they were like, eh, 10 million people saw ads that were fake or Russian-linked, and by November, that number went to 126 million. And by the end, I suspect that they will say everybody saw them. And I think that would probably be accurate. We have all seen information that's been fake, or false, or propaganda. And, you know, we have gotten so used to looking at ads, most of us in this room, that we really don't even notice them. But your friends, your parents, people who are outside of tech have probably sent them to you, and asked what is going on, but you don't think anything of it. You are just like, yeah, people send this information to me all the time.

So when Twitter went on Capitol Hill, they were like, yeah, we have 6,000 Russian-linked bots on Twitter that generated 131,000 tweets from September and November 2016, a lot of tweets, with 288 million views. We have 68 million viewers in the United States, so every user saw that. So how does that happen, how does something that you and I can see that we know is propaganda and is fake and false, how did that get by and pushed to tens of millions, hundreds of millions of people?

I will give you a little bit of my history in this. At Twitter, I ran abuse, safety security, and the accounts team. During an investigative session, we discovered hundreds of millions of accounts that were created in Ukraine and Russia. Hundreds of millions. This was in 2015. I thought, I don't know what these are for, they are probably not good, I don't know why they are here, we should get rid of them. But I left Twitter and I don't know what happened. I would like to think they got rid of them.

I suspect that they didn't get rid of them all, because we see what has happened. And once again, I'm like, what happens if you have hundreds of thousands, millions, tens of millions, hundreds of millions of these accounts on Facebook, on Twitter? Twitter, excuse me, Facebook just came out and said 200 million of their accounts may be false or fake or compromised. Yes, that's only 10 percent of their active users, I know that doesn't sound like a lot, but that is 200 million. There is a problem that we're just, like, not addressing, and we're going to dig a little bit deeper. I think that's the tip of the iceberg.

In 2016, Twitter came out with their algorithmic timeline. Facebook has been doing this, Instagram is now doing it. But Twitter said they wanted it to ensure that you saw more tweets from the people you interact with. And they also said- this is a quote from Twitter- ensure the most popular tweets are far more widely seen than they used to be, enabling them to go viral on an unprecedented scale. I say mission accomplished, they did a magnificent job of creating a timeline. Facebook has done a magnificent job of creating a timeline, showing you, your friends, and your family the most popular tweets.

But there's a problem with that. They are media publishers, whether they want to believe it or not. More people see information on Facebook than they see on the New York Times, CNN, MSNBC, and Fox combined. They are a publisher, and a publisher with no accountability, none. They publish it and they say, we're the platform.

The system didn't deliver news, the system delivered propaganda. The system didn't deliver your cat videos, they delivered biased information. They told people to go out and protest against Black Lives Matter, they told Black Lives Matter people to go and protest against something, they told somebody to go and shoot up a pizza parlor in the middle of the country because the DNC was running a pedophile ring, somebody did that because of fake information they got from social media. This concerns me, and I ask this question- what if there were hundreds of millions of accounts sharing compromised information, what if they were tweeting, sharing it on Facebook, what if it was on Instagram? What do you think these systems, like Facebook’s or Twitter's algorithmic time line would do with all of this? They would take it in and say, this is being shared a lot, I will share it with more people who like this type of stuff, who like this content.

And yes, it is not the people in this room mostly, it is probably people, just your friends and family who send you these things, I saw this, is this true? I get this all the time from my family. Is this true? I don't know why you think this headline is even remotely true, but it looks true. And between Twitter's 100 and something million, and Facebook's, and this number is in dispute, potentially over 700 million accounts on their platform, you have a billion accounts that could be sending false signals into these systems. Signals that take advantage of their algorithms, take advantage of our bias, and get us to think different things, to vote different ways, to talk to people in different ways.

And Facebook did a great study- if you want to call it great- in 2014, where they started introducing different types of information into people's timeline to see if it affected their moods. It did; people would post different things. People would read different things. It would actually change what they were doing. And my thesis is that, once they found this out, they published it, it went out there. My question was, did they do anything to stop anyone else from doing that? I think we know that answer today. It is a frightening world when you can reach hundreds of millions of people with data that is, and information that is wrong, information that is propaganda, and influence their moods.

And the funny thing is, they didn't see it coming. Twitter didn't see it coming, Facebook didn't see it coming. And they actually stood up and said it wasn't a problem, until they started looking into this. Does this start to sound vaguely familiar to anyone? I mean, would all of this information be part of the training data that determines what you see in your timeline? It is. Every day. It is part of the training data.

And it concerns me because it was hailed as the Next Big Thing, bringing relevant content, and targeted relevant ad serving. These systems were deployed at mass and scale, and worked with little human input or insight, showing people what they wanted to see, whether it was true or not. And that is not a world I really like living in, personally. It is a very scary world.

And so, Facebook, hey, I have to give them credit, Facebook said, we're going to hire 20,000 people to tackle fake news. 20,000 people to tackle fake news. One, is there that much fake news, and two, do you need 20,000 people? Twitter is determining how they view the ads to you, but they never said they are going to change anything, they are just going to throw people at this problem.

Shades of the Mortgage Crisis

And as I was preparing for this, I thought, why is this resonating with me in a way I could not figure out? I had to do a lot of reading and thinking. I said, this is shades of the mortgage crisis. This is shades of people taking a bunch of information in, chopping it into little pieces, feeding it out to a hungry public, and not really understanding, after a long enough time, how the system even works anymore, and why it works that way, and what it even in this system, and how it is generating its outputs.

Banks are trying to understand the risks they had after the 2008 crisis. They had to hire people to actually look at every mortgage to understand their exposure. They had to look at every piece of mortgage data that they had chopped up and thrown into a CDD, thrown into a CDO, and a lot of the banks just threw their hands up and said, we are going to write off some number and let the market come back and not worry about it. Which, lucky for them, it happened. It was interesting because, Warren Buffet called CDDs, and I'm sure everyone in here knows what it is, it is collateralized debt something, something or other. I don't know even know what it means. It is a mortgage that is chopped up, and is bundled and sold as security that is rated as AAA, but they weren't, because nobody knew what was in them, nobody knew how it worked or operated, and when it came crashing, everybody was holding the bag but nobody knew what was in the bag.

The Next Big Thing Will be a AI/ML Company

So, why it concerns me is the Next Big Thing will be an AI ML company. It may be Google getting bigger, Facebook getting bigger, it may be something that we don't know. It may be something that one of you in here are going to end up creating. And I wonder if we are just going to repeat the mistakes of the past. I don't know. I hope not.

Why the Concern?

So, I have explained some of my concern. I have explained why this is a problem. If anybody wants to talk to me after, catch me out the door, I'm going to run out the door so I don't have to defend any of this.

The reason that we have look at this now, more than ever, is that there's a growing and thriving industry growing up around this. These models are being applied pretty much everywhere, everywhere. They are being applied to self-driving cars, they are being applied to ride sharing. Imagine that, you know, Uber or Lyft or some other ride-sharing company determines that a certain neighborhood, their rides are always under $5. Are they going to send people to pick up there? Probably not. Or, they are going to send people who are lower-rated. This is happening now. What does that do for impoverished people, what does that do for people who are not advantaged?

It is just like redlining in the '40s and '50s. This is happening because nobody is looking at the data, where the data is coming from, there is no transparency in how the algorithms are put together. I will get a little real. This is happening in sentencing guidelines.

Propublica did a great article on this. They decomposed someone's algorithm, and the sentencing guidelines, the software they came up with, they said, this is going to remove the bias and people will be treated fairly. Guess what. African-Americans were 20 percent more likely to get a harsher sentence. In some cases, they were 45 percent more likely to get a harsher sentence, with the exact same parameters, because the data set that they used to train this model was inherently biased. They did not recognize it or remove it, and now they build it and it is in 25 states, and it is sending people I know, my family members, to jail longer and giving them harsher sentences.

This is real, this is happening, and this scares me. Because, at some point, it starts impacting us more than a self-driving car, or more than an election that we may not agree with. It is going to start making life-and-death decisions for you, it is going to start making decisions about your healthcare, it is going to start making decisions about your mortgage rates, it is going to make decisions that you do not understand, and the people that are deploying them do not understand. And, as usual, we, the public, will be left holding the bag. Because, after the mortgage crisis, no one went to jail. No one was held accountable, and we got the tax bill that we will continue to pay and your children will continue to pay. Really uplifting? Isn't it?

What Can We Do Now?

So what can we do now? I mean, we can not talk about it. We can put our heads in the sand, you know, or we can start to have a discussion around where the data comes from. We can start having a discussion of, is the data over sampled, or under-sampled? We can start bringing in other people to look at the data. One of my favorites is that we can be transparent about what it is that we are collecting, and what it is we are using, and how we are using it. This is not a trade secret, it is data. Anybody can get it, and how you use it should not be a trade secret, particularly because it involves people, because it involves places, and it involves things in the public domain already. And that is kind of where I want to go next.

Actionable steps: seek people outside of your circle. You have to talk to other people. I want to, like, call out QCon for really taking a large step towards making this a diverse and inclusive conference. It is amazing, I'm seeing people in here- yeah, give them a hand. They have done a great job. And the fact that I'm up on this stage means they did a great job. Thank you.

But, when you are creating these systems and deploying them, find people outside your circle. I know a few people who are doing people detection, they are making sure that they can identify people, and identify the right people, the wrong people, and I ask who is your data going out to? They showed me, these are all very wealthy tech people. These aren't people of color, these are not people from different backgrounds, or- particularly in California, different body shapes and sizes, they are healthy people that you are doing people detection on, you have to widen your data set, you cannot rely on that and roll it out because it is going to have problems. I talked about radical transparency, it has to be radical, you have to put it out for people to see and it has to be peer-reviewed, if not, you are going to continue to build into your data sets your bias. My thing, hire more women in engineering. Do it!

Every engineering team I have worked with that have had more women in it have been better engineering teams. I just get better output, it is a fact. Get over it. Do it. If you still think it is an issue, come. I will stay around for that.

If you want to talk to me about that, I have some friends I want you to talk to. And another thing, work on your empathy and your self-awareness. We can change the data, we can make it transparent, we can bring people in. But if we don't improve who and what we are, if we don't develop more empathy and self-awareness, we are just going to revert back to the mean. And that is something that I have seen every time. You know, just recently, we had, what's the guy's name, Jason Caldbeck is trying to repair his image, he was one of the VCs, and first of all, work, and second of all, work on empathy and self-awareness, meaning that you should not show up. You have tainted such a large pool that you have no business going out there. And this is what we like to do, though. We think that we can just go and programmatically solve a problem. But sometimes the problem isn't out there, sometimes the problem is in here. Sometimes the problem is with us.

And I really challenge all of you to, you know, to- this is one of my favorite Obama statements, it is totally non-political- every day, he tries to ring a little bias out, that is a great quote for the systems we are building, ring a little bit of the bias out, you cannot know about all of it or most of it, but just a little bit every day. It is like re-factoring, you know, it is. It is like refactoring, nobody likes to do it, but it is really good. It- you really end up in a better place because of it. Because refactor your empathy, refactor your self-awareness constantly and consistently.


And so there are some sources here, and bias variance trade-off, check that out, it is on a lead science, it is a great read. And Algorithm Watch, and the algorithmic Justice League, which I think is really good timing. These are sources that you can all go to that will help you understand how to start attacking bias in your data sets.

Europe, as always, is ahead of the United States when it comes to protecting people. They have a general protection regulations, it is amazing, go and read it, learn a lot. If you are not following that, I advise you to start doing it now. It might not make it to the United States, but it is the right thing to do.

And Federica Pelzel, I sourced from her for this talk and she goes in-depth for a lot of what I'm talking about, there's a lot of people talking about this now, let's not make the same mistakes, let's not build a data ML weapon of mass destruction and deploy it on people unsuspectingly and stand up a year or two, or five years later, and say, we're just the platform. Because nobody is buying that anymore. Thank you.

And, uh, this is -- this is going a little faster than I would thought. I thought I would talk slower, but the part that has concerned me is we generally have worked without a lot of accountability in tech. We have, for years, have been able to craft systems and platforms with little oversight. And I think that has been, for the most part, a good thing. Those times are changing, and they are actually coming to a close.

I would rather be self-regulated than regulated by the government. I'm sure most of you in here would rather be self-regulated than regulated by the government. But you have to start leading today. Don't wait for these problems that we've seen with Twitter and Facebook and Reddit and the other, you know, platforms that have been co-opted by foreign governments to spread false information.

The only way to do that is to start attacking this today, and to start attacking this in your data sets today. You might not be building the next Twitter or Facebook, but you are building the next Something. And I implore all of you to start thinking outward, to start thinking about the impact it has on people that don't look like you, people that don't come from your back grounds, people that don't come from your schools, people that don't come from your families. People who are less privileged than everyone in this room.

It is so important, and I am spending time on that because I've spent 20 years in this industry, and I have watched it become a force of change in the world that today I'm more embarrassed than proud of. I'm more embarrassed that we built a system that was co-opted by a foreign government, systems that were co-opted by foreign governments that put information in front of us that was not true, that started essentially or amplified the racial animosity that has been brewing in this country for decades, amplified it in a way that nobody would have anticipated five years ago. I'm positive that Jack and Mark Zuckerberg were not thinking, we're going to build a system that does this. They never thought that. But as the systems grew beyond what they understood of it, and as they brought people that scaled those systems, they did not ask those questions and we cannot make that mistake again.

I always think I should have something pithy to say at the end, but I can't. This is a dark talk. You could say it is a black talk.

I want to, one, thank you all for being here and listening to what I have to say. It is an honor to get in front of people, it is something that I look forward to because I think, when you can speak to your peers, and they listen to you, you can move the dial a little bit. So thank you for helping me move the dial. Thank you for showing up early in the morning, and thank you for laughing at my really bad jokes.

Wes Reisz: So we're going to go ahead and wrap, but Leslie will be up here if you want to ask him direct questions. Thanks, Leslie.

Leslie Miley: Thank you.

See more presentations with transcripts

Recorded at:

Dec 23, 2017