InfoQ Homepage Presentations Big Data's Ethical Drought: The Thirst for More Data Has Led to a Lapse in Ethics and Privacy

AI, ML & Data Engineering

Big Data's Ethical Drought: The Thirst for More Data Has Led to a Lapse in Ethics and Privacy

View Presentation

Speed:

Download

53:00

Summary

Katharine Jarmul provides examples of data (mis)use and asking how we can work with data without violating the trust and privacy of users, producing an ethical product?

Bio

Katharine Jarmul is a passionate and internationally recognized data scientist, programmer and lecturer. Her current work and research focuses on securing data for data science workflows as co-founder at KIProtect. She held numerous roles at large companies and startups in the US and Germany. She is an author for O‘Reilly and frequent keynote speaker at international software conferences.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Jarmul: When I bothered the program committee enough that they invited me to come keynote this year, I thought I would continue the conversation that I had here last year, which is around ethics and computing. I thought, "How do we expand the conversation from just talking about machine learning? How do we expand it to be applicable to all of the different folks in the room?" I thought, "Big data. Most of us here work in software architecture, maybe large data or scaled architecture." This is perhaps a shared understanding that we have.

What we're going to explore today is how big data has perhaps led us into an ethical drought, and perhaps we need to bring back the ethics in order to save the future of where we're going with software and data science.

How many people here know what this image is? Has anybody here seen this image? No? Clearly we have Fidel Castro as an American, I know his face, and we have a woman. When this image was circulated, I found it because I decided to read a study on data science and an analysis of the recent Brazilian elections. This analysis was put together by a few students and researchers at the university in Belo Horizonte, along with the "The New York Times" and several other media watch groups. What they found was that this image was one of the most shared in the studies. This image is indeed Fidel Castro, but the woman is not Ms. Rousseff. The woman is some other woman that was affiliated with Castro at the time, because this image was actually taken in 1959, and Ms. Rousseff would have been 11 years old.

I bring this up not to say that the people that circulated this image or the people that circulated many other images for different political beliefs. There were beliefs from the left-wing from the right-wing. There were these extreme images, these fake images, or images that were real that nearly had fake headlines. Similar to some other elections we'll talk about later today, we don't know how much these images affected the elections. This is incredibly difficult to study from sociology, political science. It's very difficult to for us to know why people vote, why people don't vote, and so forth.

What we do know is that the circulation of this image, could this have happened 20 years ago with the same velocity? No. The creation of systems like WhatsApp - I love using WhatsApp - signal other messaging and social networks, this is our ability to produce and spread information is ever-increasing. The ability to do that faster, let's say, than the fact-checking is ever increasing. Perhaps this is a good microcosm or a good prism to look at the problems that we see within data science today.

From Big Data to AI

To get to big data's ethical drought, so to speak, we have to begin with, "What is big data?" I think a lot of people here are Java, or large scale Apache systems like Hadoop and so forth, that you help manage systems like this. Back in the day, in 2002, was the beginning of the research around big data. By big data, here, I mean multi-compute, so, "Can we use multiple computers and storage systems together in some sort of query mechanism?"

The initial research was done by a group who was building software called Nutch. Nutch was a web crawler. This was pre-Google's dominance of that era. Nutch was a web crawler that wanted to document all of the internet. It was released as open-source software. As they built Nutch, what they wanted to do next was they said, "We have to be able to query this in a meaningful way," and going around to each computer and asking each computer became time-consuming.

They said, "Is there a way to parallelize these queries, and to do them more efficiently?" This was the beginning research that ended up in MapReduce. MapReduce was essentially research that was built on by the Nutch team, Yahoo researchers, and Google researchers at the time. This research eventually culminated in the development of the first open-source version of Hadoop, which was released in 2006. This is quite some time ago now, when Hadoop was released. I remember when Hadoop was released, because I was working on a web scraping project, and I thought, "Maybe we can use this thing." I thought it was pretty cool.

With Hadoop, what you needed to do is, you needed to maintain your own architecture. How many people here work in infrastructure? You maintain your own data centers and servers? Is it a fun job maintaining your own machines? It's the worst job ever, we should get robots to do that or something. Anyways, maintaining your own infrastructure sucks, and so thankfully other people do it for us.

An engineer on one of the infrastructure teams at Amazon and his manager, who are Benjamin Black and Chris Pinkham, they decided they were having problems. They were doing all this complex VPNs and other VLAN architectures, and they said, "There has to be a better way than continuously writing new scripts to do this, the new VMs, and so forth." They started having debates of, "Could we make this into several services, and make these services somewhat automated?" They started having debates, and then they said, "You know what? I think other people would pay for this."

Such was the brief that they wrote to Bezos, and Bezos tends to like to review these suggestions from engineers. He reads through the executive briefing, and he makes a go/no-go decision. He said, "Sure, let's try it out." That got put to several other engineering teams, and in 2006, the initial EC2 was released. This was an amendment, I think before that they had either SQS or something similar to that, and then they also had S3. This is the first time EC2 compute boxes were available.

I spun up my first EC2 in 2008, 2009 or so, and I was so excited. I was like, "I'm in the cloud, everybody!" and I thought it was really cool. A bunch of the engineers at the company that I worked at, at the time, were like, "This cloud stuff is garbage. We have to maintain our own data centers," whatever. I drew a little picture of a cloud and I had it pinned up by my desk. I was, "Look, you could see my cloud. I'm monitoring it," and I had a bunch of Bash scripts or whatever.

How many people were on EC2 pretty early, pre-2010? A few. It was a wild west a little bit. There was lots of Bash scripting and random other things. Thankfully, it has grown since then, we have this big data now, and we have this big compute that we can rent from other people whose problem it is to make sure it works.

From that came big machine learning, or what we now like to call AI, or artificial intelligence. You can see there's a little bit of a lag time, this is Google Trends map of the term "machine learning." You can see, in 2012 it starts to spike up a little bit, and it goes to - of course, we know the hype cycle of machine learning today. I would say it was more than hype, we had big data available and now, we had big compute available. What the researchers were able to show is that this actually allowed us to do new things with machine learning that previously we didn't have the compute power for, or we didn't have the data sets for. Most of the advances in machine learning are small iterations on different techniques that were available in the '50s and '60s.

The birth of the deep learning around computer vision, of Word2Vec, which came out in 2013, our ability to use deep learning with language, all of this happened also because we had enough computers that we could have researchers working on these problems, and they could find new innovative ways to apply the same loss equations and come up with new things.

Big Data Collection: An Investigation Into Privacy and Ethics

Now we have a bit of a shared understanding of these origins, but through all these developments, were there any costs along the way? As we were working on making big data available, making big compute available, making AI available, did we cut any corners? I like to always include XKCD. I don't know if there are any XKCD fans, but I really like his work. In this XKCD, I am hinting a little bit at where we're going. It's almost always that we find some way to cut corners. We have to be efficient, and fast, and cheap, and everything else, and there's going to be some corners cut along the way. What we're going to do for the rest of the talk is we're going to explore, "What are these costs?"

We're going to start with big data collection because as with everything, especially when we're looking at big data or machine learning, we need to start with some data that's available to us. We're going to investigate maybe some previous examples and some current examples of privacy and ethics.

MNIST, LeCun computer vision, recognition of digits. I'm sorry if you took machine learning in school, because you're probably traumatized right now and you're so tired of seeing this, but it exists. We use it, and we use it in a lot of introductory classes, and it's used in a lot of research as the toy example. Does anybody know where the numbers came from? MNIST is short for Modified NIST. NIST is a U.S. entity, it's, let's say, government-supported, where government and industry get together. It's called the National Institute for Standards and Technology, and it's been around for decades. This is a computer programmer from the first digitization of the U.S. Census, so this is her working at her desk, digitizing the U.S. Census, and this was a NIST initiative. Is it the U.S. Census? Why do we have these digits? Who wrote them? These people wrote them.

This is an example of the handwriting collection form that was used for MNIST. That eventually became digitized, of course, for computer vision. What do we see here on this form? Is there are any data which could be privacy-related? There's a name, that top left, that's blocked out, thankfully. There's a date, there's a city, there's a state, there's a zip code, that's all where this person wrote this. Then the person is asked to write a series of letters, and digits, and part of the U.S. Constitution.

Who wrote these though? We still don't know. We know how they were collected, maybe, but who actually wrote them? It was primarily children. It was distributed at two different types of offices. It was distributed at the NIST offices of the census workers, and then it was distributed at a bunch of high schools. These are 14 to 18-year-olds filling this out. Do you think that their parents were asked? Do you think that they knew we'd be making fun of how they write their 2s for the rest of their lives and probably far beyond? Do you think they can revoke consent?

In a world of GDPR - and we'll get to LGPD later - can they say, "No, I want my numbers back. Please don't use them anymore." Do you think this is feasible? How many of you have NIST literally on your computer right now? This is important to think about, I'm not saying we shouldn't have MNIST, and I'm not saying we shouldn't have open data sets. I believe very strongly in our need for these, but can we get better at consent? What do you think? Did we get any better in 20 years or so?

IBM Diversity in Faces Dataset has a little bit of a long story in terms of how it evolved. I'm not sure if you're aware of the ethical problems with the machine learning, but one of the largest ethical problems in machine learning is within facial recognition. Facial recognition for people of different races, ethnicities, and shades of skin, it doesn't always work. Sometimes it's very wrong.

Joy Buolamwini along with Timnit Gebru, two researchers, decided to study the MNIST gendering of darker-skinned individuals. Specifically, they found that darker-skinned women had 30% accuracy gapped from white men. Part of how they were able to do this is they assembled a dataset of everybody in the UN. This had many different skin shades, and they assembled this on, of course, different genders. They assembled this, and they tested a bunch of different APIs. One of the APIs they tested was IBM's, and IBM had one of the largest gaps of accuracy between gender identity of darker-skinned women and light-skinned men.

They presented this as research, and IBM did the right thing - they invited them to come in. They called them up the next day after the paper was published, and they said, "Please come visit our offices. We'd like to investigate this further." They said, "Sure." They went and visited the offices, they talked with the teams, they showed them the dataset and IBM said, "Don't worry, we're going to release a new version." They released a new version, and it was much more accurate. I think it was a few percentage points difference. A few months pass, and IBM says, "You know what? We're going to be the leaders in diversity in AI. We're going to release this dataset Diversity in Faces." It's pretty revolutionary because, in it, we don't have man/woman. There's actually a spectrum of masculine versus feminine, which is an interesting new approach to how we tag this data. In it, they also said, "We have so many different skin tones and hairstyles, and it's very diverse."

There's two problems with this release. The first was, when they initially released it, they didn't credit Joy Buolamwini and Timnit Gebru, which is kind of just a jerk move, in my opinion. The second problem was how it was collected. The problem with how it was collected is it used this old Yahoo dataset. This Yahoo data set is not a million years old, but it's quite old. It has a lot of old Flickr data. Here, it has 99.2 million photos and 800,000 videos. They used this data set, and they said, "We went through and we took all the faces."

What was the consent given of this dataset? How many people remember when Flickr was still popular? I guess people still use it. It used to be like everybody had all their photos on Flickr. You had a digital camera, because of course you did, and you're taking the photos. Then you're, "I'm going to share all my vacation photos with my friends, so I'm just going to upload the whole SD card to Flickr. I'm going to choose the default settings," because that's the only way you can do it with a bulk upload. "I'm going to put the default settings, then I'll put all my vacation pictures," and who knows if I asked the consent of all the people I took the photo of." There's a whole other issue. Minimally, "I'm going to just upload everything," and this is the dataset. As we can see, the data was shared under one of the various CC licenses, so it was "maybe" completely free use - maybe not. You don't have to be a researcher to use this dataset, anybody could download it. You just go put in your email right now. It's actually an automated process, it automatically just emails you an S3 link, of course.

The problem that I really have here is, if we're talking about building ethical datasets, shouldn't we also be critiquing the way that we develop and put these datasets together? Should we be slightly critical of the consent use? Maybe? I don't know. I think so.

Big Compute: A Study of Privacy and Ethics at Amazon

Unfortunately, I haven't seen a lot of massive development in more ethical data collection. I think, hopefully, with new regulations, this may or may not happen, but I'm calling on us to do that. Now that we're done with data collection, let's move on to compute. Because Amazon has the most compute of everybody, pretty much, we're going to study how compute is used at Amazon, because they got to have some free cycles here and there. Let's take a look at how they're using them.

Amazon decided they were going to automate recruitment, which makes sense to some degree. Recruiters cost a lot of money, it takes a lot of time, maybe this is something that we can automate. It's a painful process anyways, so, can we just make it better? What they did is they hired a team of researchers, I believe they were based in Ireland. They said, "From all this historical resumes, and all of this historical performance improvements, can you figure out a way to make it automatic? Can you figure out who's the good engineers, and who's the bad engineers, and who we should hire, and who we should fire?", and so forth.

They took that data, and they also took some data scraped and bought from different recruiters, and they put it together and they made an NLP model, a Natural Language Processing model. What could go wrong? Unfortunately, the NLP model found a good hack. NLP model was clever - I mean, it wasn't clever, it was just learning what was in the data that was historically biased. What it figured out is that even though they had written moved women, and the racial category, and the age, and all this stuff, and they removed it, they were, "We did a good job. Privacy is protected, and it should be ethical," what the Natural Language Processing Model learned was that anytime it saw the word "women's," it was like - downvote. Lots of resumes that had things like women's soccer, women's hockey, where people were describing their extracurricular activities, it would find them and it would say, "No, you're not a good fit." It also learned the names, ironically, of the women's colleges. There are some all-women's colleges, and it learned those, and it was like, "Nah, don't think so. Probably not a good engineer."

The problem here that we have is that we have all of these historical problems within our societies, right, and we have an imbalance in terms of gender in engineering and data science. We've seen this in many studies of Word2Vec too, for example, "Man is to computer scientist as woman is to homemaker." The problem here is that we have so much compute, but we can't solve societal problems with compute, it's pretty hard to do that.

They didn't go in to describe any racial biases, any class or poor versus rich biases. These have all been studied in a variety of other automated mechanisms for recruitment. I would not be surprised if those were also in this recruitment system. They scrapped that, they're, "Well, that's pretty bad. We should just get rid of that." Amazon is an optimization company, it wants to optimize things. It wants to use all those compute cycles - I mean, you can't just have boxes lying around without 100% CPU usage.

They figured out some other ways that they could use the compute. What is this? A shelf? Yes, I see a shelf. Is this a torture device? When I saw it, I was, "I don't know. What website am I on right now?" This is a U.S. patent, and it's an approved U.S. patent for Amazon for using bracelets - there's a nice little bracelet there, it's a little hard to see - that give haptic feedback. When you're working in an Amazon warehouse, and you reach for the wrong shelf, "Wrong." You're reaching, putting it in the wrong box, "Wrong".

When this came out and was released, of course, the Amazon lawyers and PR people were immediately on the phone with the press, and they were, "No, it's just a patent, man. Everybody has patents, it's totally fine. Everybody has patents, and everybody's tracking their workers, and don't single us out." This may actually be true, it's still a little creepy.

Amazon's lawyers were enmeshed in a legal argument last year about some wrongful termination lawsuit, some people that were suing Amazon and saying, "They wrongfully terminated me," and so forth. In a rebuttal, Amazon's lawyers actually admitted that at some of their fulfillment centers - warehouses - that hundreds of employees are fired every year for not meeting their quotas. How are these quotas generated? Do I go talk to my boss, and my boss says, "Please, can you do 10 more boxes today?" Is that how it works? Or is big compute at play here?

How it actually works, and this is further defined in the lawsuit, is that there are automated systems for defining quotas. These automated systems for defining quotas are based on the other workers at the warehouses mixed with the optimization of packaging. If you fail to meet your quota, you get what's called a warning slip. If you get three warning slips in a row, you get a termination slip, and this is all automated. Your manager can supposedly override it, but I'm curious of how the manager is then audited. Here we see the use of big compute, and the use of big data, and the use of statistics to create a working environment that nobody wants to live in. There's not a single human that wants to have a bracelet that zaps them and a machine that fires them.

Privacy and Ethics in Data Science and AI: Election Edition

We've seen some of the problems with big data, and we've seen some of the problems with big compute, and now we get to dive into data science. I love data science, and I love machine learning, it has been such an amazing field for me to work in. I feel really grateful to be alive now and to be able to see some of the advancements that we've made. As we look at data science, we're going to look at the elections, we're going to focus on data science and machine learning in elections.

2007, probably, or '06 was when this was designed, it was a simpler time. Barack Obama was a young senator, and he had no gray hairs. He decided that he would figure out if he could run for President of the United States, obviously, and nobody knew who he was. I actually got to interview him when he was still a senator on the congressional floor. At the time, I was doing my data journalism degree, and nobody knew who he was, he was just a senator with some opinions.

He decided, "Nobody knows me," and in politics that's a pretty bad thing. He decided that he would hire a team of really smart data scientists, and he would figure out if they could help him out. He would figure out if they could help them get more donations, because in the U.S., at least, it's kind of like whoever has the most money wins. It tracks pretty well. You at least have to have a lot of money, usually, to run for president.

He said, "Can we optimize my donations?" This is an image that they settled on, and we could see it was a time when you would just load this site, and there would be no cookie warnings, and there would be no other stuff happening. You just put in your email and your zip code and click the Learn More button. How is this designed? Did he just get a really awesome team of designers? It was actually designed like this, in a sense. This is essentially a statistical analysis, it's multivariate testing, that was used in order to figure out conversion lift. This was used to choose what image, it was used to choose what text.

The computer scientists and the data scientists that he hired, they actually went on to create Optimizely. Before they had Optimizely, they were working on Barack Obama's campaign, and they saw a significant lift. I remember this, because at the time, by the time the donations were in, I was working as a data journalist at "The Washington Post," which is very large newspaper in the U.S. We did a whole visualization and interactive, because Barack Obama exploded all records of small donations. He had the most small donations of anybody we had ever seen in the history of U.S. politics.

We were like, "What is this guy doing? How is he getting so many people to donate $10, $100, $200, $500, all these small donations?" Besides optimizing the website, what they did was they started to create personalization campaigns. They would send you an email if you had done a little small donation, and they would say, "Please, can you give $20 more?" They were testing the messaging, "What's going to get the most other small donations?" They were using data science to get money for Obama's campaign.

How could this go wrong? About five years later, Barack Obama is already president twice at this point in time. There was some research being done at Cambridge University, taking a look at how much information you could get from a person by looking at their Facebook likes. How much information do you think you can get? What these researchers wanted to do is, they were a little bit concerned about privacy, and they said, "Could we figure out private attributes of a person, sensitive things about a person by only looking at their Facebook likes?"

What they found was, indeed, that with more than 90% accuracy, they could guess your gender, that with between 75% to 88% accuracy, they could guess whether you are gay or lesbian, or you identified as gay or lesbian. They could identify even the lowest percentage - they were, "We don't know. We could probably do better if we had more time," - whether your parents were divorced. You could figure out if your parents are divorced based on your Facebook likes, figure out your political party, ethnicity, whether you used drugs recreationally, and so forth.

These are all things that they were able to study via the psycho-analytics. It's called the psycho-analytics department at Cambridge Analytica. Kolinsky was the lead researcher on this, but there were several other researchers involved. I was curious, "How are these people chosen? How did you get so many people to agree to give you their likes?" In the abstract, even, you don't even have to read the full paper, they said that they had over, so more than 58,000 volunteers. These volunteers gave them access to their likes, along with detailed demographic information. What is this? Just your profile. Facebook is, "Please, profile 80% complete, fill in more information about yourself." Then several psychometric tests, so personality tests primarily. Where am I going with this? What is this related to?

This is an example of one of their psychometric tests. Remember, they were at the University of Cambridge. This group of researchers was targeted, and several of them were hired by the guy who ran Cambridge Analytica. This was some of the research that the initial data collection and profiling of Cambridge Analytica was based on. This then went on to be used by the Leave campaign for Brexit, this went then on to be used by Donald Trump's campaign.

We don't know how this affected the elections. It is impossible for us to go back and do it again. What we do know is that messaging like what we saw in the Trump campaign, and like what we saw in the Leave campaign, that this affects society, that it affects our ability to have a reasoned, educated conversation. It affects our ability to not be polarized and to try to find solutions in some sort of meaningful way, some sort of compromise, some sort of moving forward. It minimally might affect turnout, might affect who decides to vote.

Technical Development and Increased Data Use – A Story of Winners and Losers

I want to figure out if there's a better way, because I really like the work of machine learning that I do. Right now, I work mainly in privacy and security because of the concerns. I like this work, and I like working with computers, and I don't probably think I'm very good at anything else. I'm only even half good at computers, so I got to figure out how to make this work. I think that a lot of other smart people feel the same way. I want to ask if we can turn this from winners and losers into some sort of non-zero-sum game.

I want to use game theory, and I want to see, is there a way where we can get to a place where we don't have such an extreme group of winners and losers, where we perhaps are not actively harming people in society with the technologies that we develop? I must bring up new developments in the world. Here in Brazil, you have LGPD, which, as far as I understood -I read the full English text yesterday, it was a fun read - there were lots of actual similarities to GDPR. I was surprised how many things were almost directly related. How many people are actively reading or thinking about LGPD right now? It's still a few years until it goes into effect, so there's still plenty of time to procrastinate and wait until the week before, and then just shut your website down. If you want to follow in San Francisco's footsteps, then there you go.

LGPD - here's this new idea, and the idea is, "Can we enforce some form of ethics in data?" We can definitely argue about whether regulation is the right way to do this. However, we also haven't done a very good job at regulating ourselves. I can see the point of view when public opinion is starting to turn against us, and politicians then use this to make new regulations.

What does LGPD call for, actually? One of the biggest pieces of LGPD and also GDPR is around consent. It's around the idea that if you're going to be asking or collecting data from users, you should ask consent. The consent should be really clear, it shouldn't be like five pages of legal text where I have to keep scrolling and then I can finally click Accept. It should be really clear, "What is it used for? How are you going to process it? Who are you going to share it with? When might it be deleted?" Pretty clear stuff, not rocket science.

I like to say that consensual data collection is sexy, that consent, in general, is sexy. It's nice to be asked. I would say that, "Why not ask people?" If you change your mind, you should be able to revoke it, "No, not really. I'm not feeling the data collection today. Maybe tomorrow, ask me tomorrow." I think that that should be a normal thing, that should be ok. We think about consent in a lot of other areas, but I think in data science and in computing, a lot of times we're just, "No, get as much as you can. Store it away." It's like we got these little bunkers stored with data we're maybe going to use one day. I mean, give me a break, guys.

Another part of LGPD is about ensuring that private information or sensitive information actually remains private. It has some of the principles, and this is very clear in GDPR, I would say less so in LGPD, that it has some of the principles of what we call privacy by design. Privacy by design is essentially just good privacy tech and good security. What it asks is that you store things in encrypted forms, and what it asks is, if you don't need access to data, then don't use that data or don't give access to the application. What it asks for is that if something is designated as sensitive or private, that this is under extremely restricted access. What I ask is that we ensure privacy for everyone, not just the limited few, and by this, I mean defaults. Privacy should be the default for everyone. It shouldn't be like uploading on Flickr and, "Oops, now my face is on a million different machine learning models." It should just be privacy and security by default. We know that this is the only way that works, because people don't choose.

How many of you go into your settings and you actually read everything, and you look, and you assemble things? Come on, I expect more of you. How many people here, when you have an open Wi-Fi access point, do you use your real email, your real name, and your real birthday? You use a little Facebook login, "What could go wrong? The airport just wants my Facebook login." No, we don't do this, because we're, "No, I know how that data is used. I'm not going to do that. No." The problem is that lots of people do, my mom does. I keep yelling at my mom, "Mom, stop doing that. Create some fake emails. I'll make some for you if you really want them." My mom does, I'm sure some of your moms do. Lots of other people's moms also, people that are not moms do it.

All this data is just being collected and stored away, "Maybe one day we'll use it," or, "Maybe I can get a few cents for this. Do you want an email? I'll charge you two cents." That's really the problem - we shouldn't have to be knowledgeable, we shouldn't have to be privileged, we shouldn't have to be data scientists, or computer scientists, or engineers in order to figure out how to protect our privacy. That shouldn't be the way.

This relates to Danah Boyd's work. Danah Boyd is a researcher who studied a lot of social networks, and she studied young people's opinion of social networks. How do young people view themselves, view the world via social networking? She's on a mailing list that I'm a part of, and I'm going to really quickly find my note from her. There was a debate on this mailing list, and it said, "Privacy is the enemy of free democracy. If we ensure privacy, then it means that we can't look at politicians, and we can't do journalism, and so forth."

Her response, I thought, was quite reasoned, "Privacy also has a long and complex history in both legal and technical contexts. This often gets boiled down to, 'Who has access to the data?' From my vantage point, that's a dreadful way to define privacy. Alice Marwick and I spent years interviewing teenagers about what they understood to be privacy violations. What became clear to us is that privacy in the colloquial sense means control over a social situation. In other words, it's not about access, but it's about expected use. This is where we realized that one line of thinking actually got this historically, and this is a reasonable expectation of privacy."

I think what we can take out of this is, let users help define what they think is a reasonable expectation of privacy. Help involve them in the conversation and make the controls easy to understand for non-engineers. If you're going to be using sensitive data, please take some time to think about transparency. A lot of people, if they're trusting you and using your service, I think they're ok with a lot of data use, but be transparent to them. Give them a clear understanding of how it's being used. They might even opt-in more if they understood it better, because they trust you. It's a trust interaction.

Build Systems That Do No Harm

Finally, I think it might be starting to get to a place where we should think about taking some sort of step towards an oath. I had the benefit of visiting a wonderful engineering team last week at Kunumi, Juliana's company. There they're actively working on AI for medicine, and so they're thinking about, "Do no harm," every day. When they're interacting with doctors, and they're working with medical data, they're literally helping people's lives, and they understand what's at stake. They understand, literally, if a mistake is made, it could mean somebody's care.

We don't all work in such extreme losses and costs. What I would say is that the power that we have, we don't always realize how much of it we have. I would say that we have a lot more power, and as we expand machine learning, and use of data and data collection, we're having more and more power over time. Should we think about, "Do not harm"?

Microsoft: An Epilogue on Privacy and Ethics

As a final epilogue, I want to take a look at Microsoft, because Microsoft has been around for a long time, and so I think they're good to see the growth of big data, big compute, and so forth. I want to take a look at Microsoft with a view of their privacy and their ethics. It would be silly to think about Microsoft and ethics and not mention the philanthropic work of Bill Gates. He's donated literally billions of dollars to help cure malaria, to help build renewable energy, to many initiatives, education, research, prizes, to try and make the world a better place. This is really fantastic kind of mentorship of the entire tech community.

It would also be incredibly silly to mention Microsoft and privacy and not talk about the work of Cynthia Dwork - brilliant, fantastic researcher, cryptographer, privacy expert. Her work at Microsoft invented a new way of thinking about privacy. She literally created a whole new way of us to measure and think about privacy, and that's called differential privacy. She won a prize last year for it, but all of her research, or most of her research on differential privacy, was done while she was at Microsoft Research. Her research went on to prove that there is a link between ethics and privacy, literally a mathematical proof that ethics and privacy are connected, which is pretty cool.

However, Microsoft is also the biggest vendor in the U.S. of predictive policing software. I'm not necessarily anti-police, I did grow up in Los Angeles, and the LAPD has a bit of a reputation, as you might know. Predictive policing, this sounds like a good idea. Can we save money? Can we make it safer for neighborhoods? Can we figure out ways to deploy police and to have this be more cost-effective, more helpful?

When we look at the examples of the women who are bad engineers, or the darker-skinned women who are men, when we look at the examples of how machine learning and statistics can go wrong, when we think about predictive policing, we have to be questioning. There were several papers at the Fairness, Accountability, and Transparency in Machine Learning Conference, which I highly recommend, that covered the fact that predictive policing follows what we call feedback loop. If in the historical data, it says, "This neighborhood is full of criminals," because it's poor, because it's more black, in the U.S. at least, then what happens is police are sent there, and then they only arrest people there? Then what do you think happens to the algorithm over time? Eventually, all of the police are deployed to one or two neighborhoods, and none of the police are solving crimes or helping anyone anywhere else. We have to question these systems, and I think that there is now a tide turning, that there are more and more engineers that are questioning these systems.

In an open letter to the CEO, hundreds of Microsoft employees signed it and said, "We're not going to work with ICE." ICE is the Immigration and Customs Enforcement, they're the people doing the deportations in the U.S. This was shortly after the news of the family separation happened, where, of course, children are being separated from their parents as part of the deportation process. They said that they refuse to be complicit. They said that, "We are part of a growing movement comprised of many across the industry who recognize the grave responsibility that those creating powerful technology have to ensure that what they build is used for good and not harm."

They weren't the only ones, because the biggest contracts from ICE are from Amazon, and the second biggest, Microsoft, and then Palantir, and then Salesforce, and then Google. There were lots of other employees that decided to protest these actions. Yesterday, I was listening to the keynote. In the keynote, there was a bit of a joke made, "We create lots of problems, but we also solve them, and sometimes we create more problems than we solve." That resonated very much with me. I was, "Yes, this is what I'm talking about tomorrow."

I also felt pretty inspired by that, because of course we're going to make mistakes, and of course we're not going to anticipate everything. Of course, there's going to be times where we try to meet a deadline, and we cut a corner based on privacy, or ethics, and security. What I believe is that we are actually problem solvers, that we're pretty badass problem solvers, and that we have solved a lot of really hard problems over our careers, and that we're intelligent, and that we're motivated, and that we don't want to hurt ourselves, and we don't want to hurt other people, and we don't want to hurt society.

I believe that we can truly actually make a stand, that we can be the ones taking a stand instead of sitting next to problems that we see. We can take a stand and we can call out our companies. We can be the ones to implement privacy and security in our application. We can be the ones to say, "No, this isn't going to work, because it's going to turn out to be a very biased model." We can be those engineers.

What I ask for you today is that you solve the computing problems, but you also solve the privacy and ethic problems, so that we get out of the ethical drought that we're currently in and we go to a new and better future where we haven't completely forgotten or put by the wayside the point of privacy and ethics.

See more presentations with transcripts

Recorded at:

Oct 17, 2019

Katharine Jarmul

InfoQ Software Architects' Newsletter

Big Data's Ethical Drought: The Thirst for More Data Has Led to a Lapse in Ethics and Privacy

Summary

Bio

About the conference

Transcript

From Big Data to AI

Big Data Collection: An Investigation Into Privacy and Ethics

Big Compute: A Study of Privacy and Ethics at Amazon

Privacy and Ethics in Data Science and AI: Election Edition

Technical Development and Increased Data Use – A Story of Winners and Losers

Build Systems That Do No Harm

Microsoft: An Epilogue on Privacy and Ethics

Related Sponsors

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ