InfoQ Homepage Presentations Mind the Software Gap: How We Can Operationalize Privacy & Compliance

Mind the Software Gap: How We Can Operationalize Privacy & Compliance

View Presentation

Speed:

Download

52:09

Summary

Jean Yang talks about some of the ways GDPR and CCPA can influence software, but also about practical solutions to protecting data privacy and security. Understanding software behavior makes up a big part of the compliance gap, and automated techniques can help.

Bio

Jean Yang is the founder and CEO of Akita Software, an enterprise data monitoring company. She was previously an Assistant Professor in the Computer Science Department at Carnegie Mellon University, where she led a research group working on techniques for automating software-based security and privacy.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Yang: We're going to talk today about "How to Help Developers Protect Their Data." I'll just say that what we're actually working on, it’s still in stealth mode, so I'm not going to talk too much about that. I'm going to spend the next hour setting up the problem, so you also have a pretty good idea of how to solve it for yourselves when you go home today.

We're here today because data protection matters now. I think when that big $5 billion fine hit Facebook, and the recent Marriott fines and the British Airways fine happened, people started taking this very seriously. Before that I had friends who are very smart people who said, "Jean [Yang], this is all going to just blow up." I was, "I don't think so." It's not just about the monetary fines, but it's also about reputational damage. Someone who was heading up privacy at a pretty large company said last year, they're not scared of regulators, they're afraid of what the consumers will think. I've been working in this field for over 10 years and even I was surprised by how much people cared about this most recent Twitter scandal.

I think a couple of years ago nobody would have really batted an eye that Twitter was using two-factor phone numbers for ads. At the same time, maybe especially because GDPR forces everyone to report, it's become very clear that even the industry leaders are suffering from seemingly preventable issues. This last year, we saw many of the major companies suffer from leaking passwords in plain text. I'm going to get back later in this talk to exactly why this happens so often. When I first heard about this problem - I came from academia before - I was, "Wait, guys, this is solved now, isn’t it?"

How many of you guys still suffer from this problem in your organizations? That's actually pretty good because I think that a lot of the people at the bigger companies I talk to say, "This is just a total mess." I'm here today because nobody became an engineer to spend all their time tracking down stray passwords, locations, how many times I listened to Miley Cyrus this morning - don't ask. We really want to help engineers focus on what matters.

There's this thing that I'm going to call the software gap. On the one hand, you have regulators, compliance people, consumers saying that, "Software should do this with my data, I should be able to delete all my data from a platform, this is what data should do." On the other hand, you have the reality of software development. You have your services, your containers, your frameworks, you have this manged up pile of - we'll just call it an ecosystem. You have this ecosystem that's taken on a life of its own that is software itself, and there's really this gap between the two.

I define the software gap as the abyss between intent and reality, where code isn't just 100 lines of C that's distributed between two or three files with a makefile anymore. It's really this whole ecosystem, and we've lost understanding of what the code does, which means we've lost understanding of what the code is doing with the data. We can band-aid over this problem as much as possible with different security solutions, but at the end of the day, if we don't get back this understanding of what the code is doing with the data, we're going to have these problems over and over again.

You probably are here because you know this already, but let's just get on the same page about why the software gap matters. It leads to fines, it leads to reputational damage, money lost. It leads teams to being conservative about data use, it leads to developers spending a lot of time tracking down data. One company I talked to recently said they gave themselves multiple months to do their data subject requests, so those are their requests to get all the data that they have on a user. The reason they were giving themselves multiple months is sometimes it actually just takes that long to find all the data.

Then finally, this is only going to become a bigger problem. I was talking to Weng about this earlier, he said, "There's also the Brazil regulation, there's also the New York regulation, it's not just going to be Europe, it's not just going to be some small problem that only some companies have to deal with."

In this talk, first, we'll get on the same page about why GDPR is hard in modern software development environments. I'll shorthand all data regulations as just GDPR because that's the one we have the most data on. Then I'll talk about why silver bullets don't solve the problem. I feel like we're all pretty good on the fact that silver bullets and software engineering are not going to solve everything, but people haven't really come to see security and privacy as a software problem of the same nature yet.

I'll break down why differential privacy, why homomorphic encryption, why all those are not going to be silver bullets that solve the problem. Then finally, I'll talk about good software practices that you can adopt today to start closing the software gap. Before I go on, here's a little bit about me. I came from academia before starting Akita. I really fell in love with this idea that you can build software tooling to help people understand what's actually going on in your systems. I decided about 10 years ago that security and privacy was where this was going to matter today or matter today now in our systems. I went to grad school because I felt like it was the best place to really get an understanding of this. I decided after interning at places like Google, Facebook, and Microsoft Research, that if I wanted to have an impact with this kind of tooling, one day, I would have to start my own company, but it would have to be the time people cared.

Then when GDPR came along, I was, "This is a place where the regulation seems like it's ahead of the technology, it's all about understanding software, tracking data, it's time." In the last couple of years, I've talked to dozens, maybe hundreds of security privacy DevOps teams, and so what I'll be talking about today is based on not just my own experience, but what I've seen other people do across the industry.

GDPR and CCPA

GDPR and CCPA, they're more than just annoying cookie banners that you just click without reading. A lot of you said that you've had to deal with this compliance already, so you're probably on the same page as me that essentially everything that's good for the consumer is really hard for the developers. In GDPR, consumers have the right to know what data companies have on them, which is great. This is what we want from a consumer perspective, but what this means is that companies now need to track all of the data associated with each user.

Similarly, with data deletion, consumers have a right to request data deletion. I love that sites now support, if there's some kind of breach, I can just be, "Look, take away all my data, I don't want you to keep any of it anymore." On the other side, companies now need to track everywhere data goes so they can delete it. Similarly, companies now have to use data for the purposes it was collected for. If you don't have consent to do targeted advertising, if you don't have consent for political specialization, you can't do that anymore under GDPR.

What this means on the other side now is that companies now need to track not just the data, but also permissions on that data. To me, GDPR was super interesting, because I was, "How are people going to track data? This is just an unsolved problem." As Weng was saying, "It's unsolved, not just for privacy, but for everything." Look at your machine learning pipelines, look at everything else that requires data which is everything these days. You're going to need to know how fresh is your data, where did that data come from? What did you do with the data before it got here? GDPR is really the first instance of the kind of problems where you need to start knowing what's going on with this.

Now let's look at what happens when GDPR meets modern software development. If we think about our modern software development stacks, codebases are larger and more fragmented than ever before. The slide is smaller text than I thought, but one fun fact that comes out of this codebase's size graphic is that the average iPhone app is now four times the size of the first version of Unix, which just blows my mind. The codebases are bigger - I think they have on here like DNA, and all this other stuff. Codebases are really big now.

There's also extensive use of third party APIs. I went to a college hackathon recently, and when I was in college people were programming in Assembly and C. Now people are calling out to Twilio, they're calling out to Salesforce, they're calling out to Box. This means that over the course of the weekend, these students are developing amazing things, and they're also relying on a lot of stuff that's just out there and sending sensitive data there. Something that was really surprising to me, actually, was that companies are also developing this way. A lot of companies aren't rolling everything by hand anymore. You're sending your data around, and when you actually need consent to do that, that's when this starts becoming problematic.

How many of you work in companies with service-based or microservice-based architectures? Ok, most of you. You probably also have realized that systems are taking on a life of their own. Before an architect could just say, "Here's the whole system, here's how everything can work," as soon as you start breaking things down into services and microservices, any service can call out to anything. Your intention with that credit card number or that location data is really hard to enforce unless you have some really strong top-down directive and everyone on board with how to do that. Yes, I'm surprised that companies are doing as well as they are these days.

If we think about what GDPR means in modern environments, it means taking these really clean sounding ideas of data should be deletable, data should be trackable, and then imposing them on to graphs that look like this. This tweet went viral a couple of weeks ago. It was the security engineer from this bank, Monzo in London, and he was saying, "We have 1,500 microservices," which is on the lower end if you look at the big companies. Every line in this graph is a line of enforced network traffic. In order to lock down all these microservices, just for security, not for any of the privacy or compliance stuff we talk to, this guy, and hopefully other people too, are tracking down all of these services and putting network rules here.

What GDPR means is, understanding data flows means tracking data across potentially thousands of services owned by hundreds of different teams. Each individual developer that I have talked to said, "This is painful," but the people who have been responsible for tracking down all the individual developers seem to be in much more pain. Then there's the fact that enforcing privacy policies means not only keeping track of which policies get enforced. It's not enough to just have a line there saying, "yes, some network rule is getting enforced," but which network rule is getting enforced is also tricky now, because you have different consents you're keeping track of. The data has different sources and there's just all kinds of stuff going on here.

Then there's also the people aspect. Ensuring compliance everywhere isn't just, "Ok, we set up a firewall, now we lock down all the data going out of our firewall." It's really getting all of these developers to cooperate. The big thing that happened with GDPR is that it's not just passwords that are not private anymore, or social security numbers, or phone numbers. It's locations, it's anything identifying. It's all kinds of freeform text that's hard to just pattern match on, and it's everywhere. It really goes back to the fact that it's a time where we have to understand the software now.

Recall Our Plaintext Passwords

Now let's go back to our plaintext password problem now that we have this understanding. Some of my friends, even when I was telling them about this talk or what we are working on, they said, "Why is this so hard?" I thought I would just dig a little deeper and talk about why is everyone leaking plaintext passwords and why it's such a big deal. What the issue is, is that passwords are supposed to be the most secret data of all, you're supposed to keep them encrypted at all times. If passwords are discovered in plaintext sitting in a database somewhere, what the company now has to do is tell everyone to reset their passwords. Breaches are also higher importance, companies have more responsibility now.

What companies really don't want to do is store a plain text password somewhere because then they have to email out to all of their customers, tell them to change their passwords. The reason it happens is because you have user objects floating around, the users have passwords inside. Anytime you log that user, or anytime you do an error dump, or anytime anything happens with it, where you send it off to someone else who might log or do an error dump, or just do like a print for debugging that they forget to take out, that is a place where a password might accidentally get logged.

To give a really concrete example, my team and I found a bug in WordPress that we asked to disclose, where WordPress was taking URL query parameters and accepting passwords in there. What do people do with URL query parameters? You log them to Apache. Even in this very short sequence of data flow, take the parameter log, you already have a password leak. If you imagine a much more complicated system, you can imagine, you took the parameter here, you did a bunch of things, then you log, but at the end of the day, you're logging these plaintext passwords.

I don't know how exactly a lot of these leaks happen, but something that was interesting to me was most recently Robinhood. How many of you use the financial services app Robinhood? You had to change your password recently. Most recently, after the Robinhood breach, what was really interesting to me was, I think their security team was really bracing for the Internet, just saying, "Everyone should stop using this app," "How can they be irresponsible?" Then, if you look at the Hacker News comments, most of the time, people are, "How can this happen?" Half the comments were actually people defending the developer who accidentally logged the password saying, "It's actually really easy to make this mistake."

This person ecnahc515 said, "This happens at basically every large web app company. Turn on debug logging in the app which logs HTTP request headers," exactly the form of bug that we also found, "and likely doesn't strip out sensitive information. Easy mistake." Then there are many other people who said, "Look, I did this once, it was some kind of crash dump." Security teams actually say they look for 500s because whenever there's a 500, there's probably a dump, and then there's passwords in there and so you have the security teams chasing down the errors.

Someone came out and said, "Look, if you don't automate detection of things like this, you will make the same mistake again." Yes, every developer can check everywhere they're logging, writing out to network doing anything, and making sure this isn't happening, but this is a mistake waiting to happen, especially as systems get more complex. What people on this thread also pointed out was that most major companies have had this bug in the last year. Apple, who's known for now privacy is their thing - I think it was their API error logging - had this exact issue. GitHub, also pro-developer, very good at engineering - same bug.

It's not just small companies who haven't figured out how to not log passwords yet, it's really a very ubiquitous problem. The shape of the problem is not just passwords, passwords are just the symbol of the most secret things. If you're even logging plaintext passwords, what else are you doing is kind of the idea. A similar shape bug is this Twitter two-factor incident. What happened was, Twitter took phone numbers for two-factor authentication. It said, "People, give us your phone number so we can text you a backup code to keep you more safe. Then somebody accidentally, I presume, used those phone numbers for targeted advertising. The Internet was totally scandalized, they're, "We're trying to keep ourselves more private, safer, and you go and do this?"

Matt Green, who's a cryptographer at Johns Hopkins said, "If you just want to secure the phone numbers, you just put them in a database called '2FA numbers don't sell to marketers'." I argue that that's really hard if we can't even keep passwords – passwords, encrypt, don't decrypt. People are going to take that information, you have a microservices service-based environment, someone's going to call out to something somewhere, and people are going to be, "Cool, phone numbers, I can make more money." At the same time, people still will get on your case for making these mistakes.

Matt Green said, "This stuff is like a bank leaving customers money lying around and then spending it on snacks." Obviously, that could happen, we just try to prevent it from happening because, you know, ethics. I think that on the one hand, yes, it's really easy to leak passwords all over the place. Should we do it? Probably not. This talk is really about, it's easy to do this, but we should really stop and here's how we can stop.

If there's nothing else you remember from this talk - I can only remember like one thing at a time - here's the one thing you should remember: it's all about data flows. A lot of the problems that are new is really about how do we figure out where the sensitive data is and then track it as it's flowing across the software. As you saw with this WordPress or the hypothetical Hacker News password leak, the data doesn't have to go very far. Data flow could just be, comes in, goes straight to the log, but you really want to track that path of where it's going.

How to Bridge the Software Gap?

My team and I have been thinking a lot about how do we bridge this gap and actually track the data flows? We came up with a set of criteria. I'm going to say now that these are broad strokes criteria, we can split hairs as much as possible for the rest of the talk about like how to make the categories, but this is just to give you an idea of how we're thinking about the problem. We think that there should be solutions that are adaptable. I was a programming languages researcher before, and you know, our papers are, "Here are the semantics," and we're, "We have come up with a solution for this big problem, just use this language." Everyone's, "We're not going to use this language."

Someone once said, "Jean [Yang], everybody in your field, how you talk - if you go to the doctor, and the doctor said, 'If you just ate an apple every day and exercise from the day you're born, you wouldn't be here now.'" It's not practical, we have to meet people where they are. I think the top priority for a lot of the solutions that we should be looking at is that they should be adaptable, they should meet modern tech stacks languages, the heterogeneity of modern environments, where they are today. They should be automated.

If we think about the scale of modern software, human DNA genome size scale software, we really need to automate checking across that entire codebase somehow. The solution should be adaptable, so it's not like there's just one shot, one regulation and never going to change, everything is just going to be good from there. We need things that are modular with respect to the policies so that it's future proof.

We need solutions that are actionable. This is a pretty new on that we feel very strongly about now after talking to a lot of security and privacy teams. It's not just enough to detect that there are problems and to address the symptoms of this issue. You actually want to make software systems better, and build understanding of how data is flowing so that developers cannot just see an alert, "Shoot, I'm leaking data again." They can say, "Ok, this one matters, this is how I'm going to fix it, and this is how I'm going to prevent this from happening next time. Then finally, a lot of solutions are great, but they don't actually address the data flow problem. Which I think is a core reason why these systems are not meeting where regulators want them to be.

For the rest of this talk, I'm going to talk about why things are not silver bullets with respect to this list, and I'm going to talk about what you can do today, and how it measures up to this list. Then there's still a gap, which is why there's stuff for us to do, so I'll talk about that as well.

Data Anonymization

Let's move on to dispelling the myth of the silver bullet. There's no silver bullet, you already know this. I feel like people still often ask me about all these silver bullets for security, because they're, "Security is not software, it's just over there." I'm, "No, still no silver bullet."

The first one that people talk a lot about is data anonymization. This is what I have to say about it. There are all these headlines that say, "Sorry, your data can still be identified even if it's anonymized." If there's one sub takeaway, this is not even the main point of my work, I just feel really strongly about this. Every time I tweet about this, then there's the data anonymization mafia and they're, "But Jean [Yang], we proved in these cases," and I'm, "Yes, you proved in those cases, not all cases." You have to be really careful with data anonymization.

This was my Halloween costume from this last year, I was data anonymization. As you can see, my disguise is very poor, you can still tell exactly who I am, and exactly what's on my shirt. I think I wrote a tweet that said, "Because I didn't have a bespoke costume, you can still tell everything." That's how I think everybody should think about data anonymization.

There's a couple of parts of data anonymization we can talk about. There's data de-identification, and then there's also differential privacy, which people really like to talk about. Both of them can get you out of certain GDPR requirements, it looks like. What it says under GDPR is that if you make your data so that it cannot be re-identified, then consent and tracking and all that stuff goes away. This hasn't really been set yet, and the data not being re-identified - no one has really set strong precedent there. What I would say is, most data can be re-identified, so if you don't want potential fines, and you don't want potential reputational damage, I would be extremely careful there. Here's why. De-identification refers to taking certain fields off of your data. If you have your entire like Grubhub, or whatever, it will have your order history, but not your name, phone number, address. I mean, who else is ordering Asian Box every meal every day? That's me.

A lot of researchers have done work across all of these domains, even if you take off all of these identifying fields within very high probability. They might narrow it down to me and Weng, who also really loves Asian Box. It's very easy to narrow it down to who it is. Right now, I think people are still figuring out what to make of this, but pretty much most things that people have been, "This is the way to de-identify your data." Somebody has come along and said, "I have re-identified it."

A big reason for that is linking; how many of you have thought about link data sets? If someone has my Grubhub history, and they have my Asian Box every day, all meals order, and then they have my credit card data that they bought from somewhere, then they can really tell that this is most likely me. This is the same person who's also buying all these Miley Cyrus band t-shirts. If you take all the data in isolation, re-identification maybe feels hard. If you think about if someone can buy all of your de-identified data together, and then try to put it back together, you probably don't want that. I'd say, be extremely careful if you go into this territory.

Same with differential privacy. I have about a 5-minute rant I can do on differential privacy that I'm not going to do in this talk because I don't think we should take up that kind of time, but I'm happy to do it for you anytime. Essentially, it goes after this dream that you can take individual records, keep them private, and then release aggregate information. The canonical example is, if you have employee records, you can keep the employee salaries private, but then the average salary gets released. As you can see, there are some potential issues here. For instance, what if you only have one employee? What does the average mean, etc.?

Differential privacy, I'll just say for now that it's very sensitive to both what your data set is and what the computation is you're doing on your data, and how effective you can actually give statistical guarantees. The chances that they can tell what my salary are, are very low. It's very dependent on both of those things. The cases where people who always tweet back at me "Jean [Yang], but it works." They're, "We've proven that it works on this algorithm for this data set." I'm, "Yes, most people aren't going to do that." Unless you're going to prove that it works for, you know, a specific algorithm and a specific data set, I would say be very careful about re-identification attacks.

My takeaway for data anonymization is you really need to make sure your anonymization method fits both the data and the data analysis algorithm. You like don't have the mathematical guarantees otherwise, and so if you're not sure, just don't do it.

Then bringing back our friendly checkbox, you can split hairs about categories, some of them you could be it's in that category. I would say that a big appeal of this is that it's really adaptable, you just anonymize your data and then do business as usual. It really doesn't address a lot of the other problems, and there are a lot of dangers involved. I would say it is no silver bullet.

Encryption

Let's move on to crypto. It only helps if you actually use it, and then it only helps really if you're using it right. The fact that we have all these plaintext passwords floating around shows that if you're not actually encrypting, it's kind of problematic. Let's talk about the issues with crypto. People say, "If we just had better encryption, that solves all of our problems." And I'm always, "It doesn't solve the GDPR, CCPA problems, because the problem isn't that stuff got hacked, is that it wasn't encrypted at all, it was just sitting there." People say, "What about homomorphic and functional encryption?" I'll talk about that a little but that's not going to save us either.

What encryption does is it gives access to data only to those who have a key, but you're still responsible for protecting your keys. Also, you're still responsible for locking your data up, so if you're leaving your stuff around in plaintext, that is not helping. The most relevant part is that encryption doesn't tell you when it's ok to give someone a key, or when to unlock a value for someone. I'm sure that these companies, most of the time are encrypting their passwords and encrypting other stuff, but sometimes you got to decrypt stuff so you can look at it and show it to people and that's when the bad stuff happens.

I call this the decryption problem. I think the basic encryption, people have hacked it. These other problems aren't happening because encryption was weak. It was that stuff was decrypted wrong, or not kept encrypted at the right times, or all this other stuff. People say, "What about homomorphic encryption?" A really quick rundown - this is like way too fast, probably - but what homomorphic and functional encryption does is it lets you take an encrypted value and then do computations over it. At the limit, people have done homomorphic encrypted database where your database stays encrypted, and then you can still do some queries over it. A thing that's like not my main point is that that's beautiful theory, but it's really expensive, and the things you can actually do are pretty restricted. The main point is, it doesn't tell you what happens when you decrypt it. I can take my encrypted values, I can do computations over it, and then if I just decrypt it and store it in plaintext, that still defeats the whole purpose of having encrypted it.

Again, I would say it comes back down to the data flow problem. I think that encryption is great, we need it, but it's like you need to wear shoes, but you can't only wear shoes, you got to also wear your other clothes. To me, that's encryption.

We see how we measure up, it's super easy to adapt, you just encrypt all your values, do stuff with it. It's automated, so once you encrypt, it's encrypted everywhere, but you are still responsible for managing when you decrypt. This means that you are still responsible for managing all of your policies, tracking all of your data flows, and this is still how we end up with all the problems that we have.

Programming Languages

On that happy note, let's move on to how new languages can't fix everything either. I'm actually very surprised that people bring this up. I've had some meetings where a head of engineering were, "If we just had a new language and got all of our developers write in that new language, it would solve all of our problems." I was, "When I was a first-year grad student, I also said that and everybody laughed me out of the room for years." Since then, I've really learned that language is not going to fix all of your problems either for a lot of reasons.

This is a picture of the Fred Brooks's paper and what it says is, "There's no single language, tool, abstraction, or whatever that can solve all of your software problems." It's the same for security and privacy.

Yes, there's no single programming language or library that can solve your whole problem. Programming languages and libraries provide very nice guarantees, but they require you to port everything to that language or library. If you think about data flows, it's not just a single part of your program, like "Here's my payment server." I remember the first time I learned that your windows actually uses prologue to do your driver matching upon boot, I was, "Wow, prologue." That's one task it has to do. It's not like everything across your whole entire software. If you actually want to track data using a language, it's got to be everywhere.

What about your third party libraries? What about that code that you brought in from somewhere else? What about anything else, like side project of someone from 10 years ago that you now have to port? That is all very problematic. Really, for data flows, anything that's not part of your really nice world now can subvert any of your guarantees. Most of the time I kept everything safe and one place I didn't. That's how we get breaches, that's literally how each one of these leaks happened.

This was actually the topic of my Ph.D., how do we track data flows at the application level. These people are, "I know, I'll just write a library." I'm, "Look, I spent like 10 years studying this, there's a lot of subtle problems that come with tracking data flow." What happens when you have to put two data values together and say where they came from? What happens if you have a conditional and something that depends on a sensitive value, but the thing that's actually being leaked isn't? If this is something you tempted to do and the first two points aren't deterring you, I hope the third point deters you a little bit.

Programming languages are great, they can potentially address all of the points. I'd say their big weakness is that the adaptability is very low. In fact, the more it tracks the data flows, the less adaptable it's going to be.

To summarize, data anonymization is not relevant to tracking and protecting data as it flows across this graph. It also has the other issues that I've discussed, so it is not a silver bullet. Encryption protects individual values but doesn't say anything about rules for accessing any of this, so that's still up to you. I would say that's where a lot of the problems come from. Then finally, if you use a specialized language or library, you're applying it to something that looks like this, potentially more complicated, so good luck with that.

What You Can and Can’t Do Today

Now we can move on to what you can and can't do today. I was told that I have to leave you with hopefulness, so I had to add this section at the end of the talk. I think it's also really important to be aware of the limitations, so I will leave you with some anti-hopefulness as well.

The most practical thing to do is, from this point forward, you can start architecting your systems to take compliance into account. A lot of my friends who run engineering teams will say things like, "We did our whole software thing and then at the end security was like, 'no'. Then we said no to them, and then we shipped it." That was maybe ok in the past, but as these fines get bigger, as people start caring more about this, this is going to be less ok in the future. One thing to do is to just stay up to date on best practices for designing and employing encryption, which, like I said, is necessary, but not everything, but in the very least, you're making sure there's that extra layer of protection everywhere. The important thing is you're still responsible for keeping track of when you decrypt.

The other thing to do is to build tools and libraries for centralizing your policy management. A lot of companies have done things like make API's for developers to use when they're reading from the database, writing to the database, and when they're writing values to the outside so that they're not responsible for writing those checks themselves, but it's become centralized. I think Uber has some paper about it somewhere about how they do it. Then there's a lot of academic papers about the proper ways to do this as well. I was thinking about putting a blog post together - if you follow me, one day maybe you'll see.

It seems like moving forward if you set everything up so that developers aren't responsible for at least doing the checks themselves, they're still responsible for making sure they're instantiating the checks with the right arguments, but that can go a long way. If you have this, this is pretty good, but it doesn't address your data flow issue, and especially if you have large legacy codebases, you're going to need to untangle your data flow hairball. There also has been a lot of effort towards this in the push to GDPR, so you probably know about some of these already. The first one is, it seems like the most effective thing is still just a document because it's so complicated that the tools just, frankly, haven't caught up.

I asked one privacy engineer at a pretty progressive high-tech company, what they do. They said, "If I have to track down data, I write a Jira ticket with all the people I'm going to talk to, and I talk to those people as quickly as I can." I think a lot of this information is still in people's heads, so getting people to put more of this into documentation is a good idea.

Metadata also has been really useful, so this is everything from adding columns to each of your tables - to talk about where the data is coming from, what the permissions are, and making it so that when you join on the tables, the metadata gets propagated, that kind of thing – to, some companies have even started making object-store this metadata, every object store is a metadata inside. Again, if you get into data flow, like information flow kind of things, this starts getting a little bit tricky.

Then finally, there's just starting to be tooling that gives visibility to data flows. I would say we're still at the very beginning of this, so how many of you use APM tools like New Relic, LightStep? Those will give you something like, this is what service talks about service, and this is the superset of all of your possible data interactions. What it won't give you is, if your credit card leaves your service, this is exactly the path it can take. I would say this is something we hope to address and there's also a small set of people working on addressing that problem as well.

The screenshot on the left bottom corner is actually from Lift's cartography tool. They map out between your services, like what credentials are shared and what resources you're sharing between your different AWS-hosted services - I'm not sure actually, go read the documentation. This is a very important part, especially if you have these complex legacy systems of sorting out your data situation.

This is all great, but what you do with it is still an open question. Then even this leaves gaps, so I would say it's really important not just to build guardrails, but also install safety nets. I asked around, and it seems that one of the most tried and true ways of not leaking passwords is just to use sentinel values. In your tests put in passwords that you know how to recognize. For any sensitive data, put in stuff that you know how to recognize. Put that into your system, see where it ends up. I think someone who was the head of privacy at a very large company said that to this day, this was one of the techniques that they still use because it beats pretty much most other techniques for certain kinds of leaks.

There's also pen testers and bug bounties. People have traditionally used them for security purposes. Pen testers are people you bring in from the outside to beat up against your system. Bug bounties are a crowdsource way of doing it. I haven't seen a lot of people doing bug bounties, for instance, for information disclosure, but that's something that I think is doable. That's something that especially as regulation gets stricter, I think, would be a really good idea to do.

Then finally, there are these data loss prevention solutions that can provide an extra layer of protection. These are solutions that sit at your API, at your network edge, at the boundaries of your system, and they'll match on sensitive data that might be flowing out. As I'll get into, this is also limited in some ways.

All these approaches are pretty good, they can catch a lot of stuff at the edges of your system, but I would say the thing they're missing is this actionability. We caught the errors, but why do they happen? How do we fix it? How do we prevent this from happening next time? How do we build our software to make it easier not to have this happen?

My caveats with this are first, a lot of the approaches I talked about just now that you can do today rely on the developer, and relying on the developer has its limitations. Developers need to stay up-to-date with latest process, teams may need to update legacy code and get up-to-date with new best practices. Then finally, any outside code that your organization brings in might not fit with this, and there's actually quite a bit of outside code if you think about just what outside code your org might be using today.

To dig into what I mean by lack of actionability, let's talk about data loss prevention. If you just have a safety net, it's not going to fix your problem. It'll help fix your symptoms, but you still have data flows that you don't have any idea about. Data loss prevention techniques operate at your points of egress, your APIs, your networks, your logs, to detect potentially sensitive data leaving the system.

First, they don't have visibility into how the data got there, so developers have to figure that out. What one security engineer told me once is, if all I'm doing is giving alerts to my developers, they're probably going to ignore them. They don't have time to go through hundreds of alerts and figure out which ones of them are legit. Also, these alerts don't give hints about how to change the code, because there's no visibility, it's completely on the outside. The developers, even if they agree that something is a leak, they have to figure out how to change it.

Also, because of the lack of visibility, it introduces the weakness into the tool. If you think about if you're just sitting on the outside of a system, and you're trying to figure out is something a password, or an address, or a location, there's some stuff you can use regular expressions or simple AI to match on, but it's actually really hard to get that right. People also have complained about false positives, and just, in general, having to do a lot of manual work sorting through the results of this. There is some stuff you can do today, but it's also good to recognize that there's still some ways to go.

The software gap remains. Today, the developers can bridge the software gap, but it takes a lot of time and energy on the part of developers. What we think is that software teams can be much more effective at finding and fixing data issues if they get an understanding of the underlying data flows, and then everybody can architect systems better to deal with compliance and privacy issues in the first place.

Again, if there's one thing that you take away from the talk, I'd like you to remember that it's all about the data flows. What GDPR and CCPA introduce in terms of complexity for developers is that everything that was pretty much designed to make development easy, scalable, modern, etc., made things very hard to reason about data flows. There's a lot of techniques that you can use to patch together today to help, but when it comes to the underlying data flows, we still have some work to do in building the tooling and process to untangle that.

Akita: Giving Visibility into Your Data Flows

I said I wasn't going to talk very much about what we're doing, but I'll just give a very quick sneak peek. What we're going after is giving visibility into those underlying data flow so that people can build better software. The immediate goal is to just help developers find data leaks faster. The canonical passwords and logs problem, existing tools don't do a very good job with that. We want to help developers find those kinds of leaks and others faster, and in a way where they can take action once they have the results. We think it's really important to be very easily integratable, so we've been building in conjunction with modern container infrastructure, with modern API infrastructure, and trying to fit in and meet people where they are.

Finally, we really believe that, as Weng said, this is not just about privacy, this is not just about compliance. It's really about understanding data flows to build better software for all purposes with data. We've been thinking about this as developer-centric tooling for data flows, rather than security tooling, or just privacy tooling or, something like that.

To end, I'll say that we are still pretty early, we're in a private alpha. This is a problem I'm very passionate about solving, so I would love to talk more and understand what your problems are, even if you don't think that this is useful for you now. We're also hiring, so if you email invites, and say you want a job, I'm sure we would consider you. We'd also just love to keep in touch. We're trying to make a lot of fast progress on this problem, so hopefully, we'll have updates soon. Let's close this software gap and make life easier for developers.

Questions and Answers

Participant 1: You pounded on the no silver bullets point repeatedly, and I think that many of the people in this room understand that there is no silver bullet, you can't just fix something. To what extent do you think that regulators or legislators understand that you can't just decrease something and have it happen?

Yang: I think when GDPR first came out, I was pretty shocked because I was, "This is so far ahead of what people can do. The auditing isn't even turned on to do any of this. How are they going to do it?" What I hope will happen is that standard bodies arise and then technical people can get together and say, "This is what we agree, it’s the state of the art, and we can do some things." The one point where I was really interested in what they [GDPR] would do was around re-identification of data, and I think they have been pretty good there, not just saying, "If you use differential privacy or something like that, it's ok." They left it in terms of, if it's re-identifiable under whoever's ruling at the time. Yes, I think it's very tricky there.

Participant 2: I was curious if you're also thinking about the thing that keeps me up at night. It’s the analytical side of this thing as well, is where you have not just the software and log files and things like that, but you have some user who runs a report and all of a sudden drops it somewhere or leave it on their laptop running locally, or something like that. They aren't necessarily involved in all the best practices and automation and stuff like that.

Yang: Yes, that's a really good point. I would say, we've been really thinking a lot about the software part of this problem, but one of our advisors actually is working on the permissioning part of which users can access which data. Yes, that's a really big problem as well.

Participant 3: When I think about being able to track data flows, in my mind, it feels like there needs to be some kind of instrumentation done. Essentially, to get a really good picture you need to carefully instrument the system, so they'll be costs associated with instrumentation. You mentioned that using widely adopted programming languages might make that better. There's also the cost associated with generating this whole new sea of data. Is there a good way to quantify this and measure the effort? Because it might be easier to tell people, "This is a good idea, do it and this is the cost associated with it."

Yang: Yes, that's a really good idea, that even if you're not adopting a new programming language, the instrumentation you might have to do to track this is very high. Something we've been trying to solve is how can we do this with as low instrumentation as possible, but it's an extremely hard problem. Yes, I think that it's a really good idea to actually say, "At some point, you might as well just use a whole new language because you're going to have to do all this work." Yes, that's super interesting. I agree, if you do this at runtime, there are runtime instrumentation overheads. Even if you do it in test, there are things you have to do there. Your question is precisely what we've been trying to go after, how can we take all the overheads and knock them as low as possible.

Participant 4: One of the things you talked about was legacy code, and I'm wondering when you're in a company that is trying to overcome tech debt and actually get not even to the future of privacy, compliance, but just trying to maintain status quo? How do you communicate the value of starting to wear down that tech debt with new applications, new business solutions? How do you think about that?

Yang: Something that I've noticed is, there seems to be a communication barrier between engineering teams and compliance, security, privacy. Something we've been thinking a lot about is how do we build essentially communication tools to bridge that? Communication tool is a very specific thing, but at a higher level, how can you help communication be better there? Because there's some kind of people talking past each other where people are, "It needs to be like this," and developers are just, "Look, we can only move so fast, we can't spend all of our time making this compliant. We also need software."

I there's more two-way communication, developers will also appreciate it more. A lot of what I've been seeing is people just say, "Developers, you got to do this." And they're just, "No, it's not possible," and then they'll just ignore it. If you tell developers "Do this one thing this month, and then, we see what you're doing, we understand what the status is," or, "Here's actionable things you can take," I think that is a way forward. I think right now like if you just tell developers, "You got to do this thing," that sounds impossible, they're just not going to.

See more presentations with transcripts

Recorded at:

Dec 04, 2019

Jean Yang

InfoQ Software Architects' Newsletter