InfoQ Homepage Presentations Privacy Tools and Techniques for Developers

Privacy Tools and Techniques for Developers

Bookmarks

View Presentation

Speed:

Download

46:54

Summary

Amber Welch talks about privacy engineering, from foundational principles like privacy by design and the OWASP Top 10 Privacy Risks to advanced techniques such as federated learning and differential privacy in ML, as well as upcoming technologies like homomorphic encryption. Each tool/technique has an explanation and example use cases with a description of the benefits and limitations.

Bio

Amber Welch works as a Privacy Technical Lead for Schellman & Company. She has been assessing corporate privacy compliance programs for the past year and prior to that, managed security and privacy governance for a suite of SaaS products. She has previously worked in companies creating ERP, CRM, event planning, and biologics manufacturing software.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Welch: We're going to have a sample flight of privacy today. We're not going to get into any really particular depth on any topic, we're just going to get a high-level overview and there's a GitHub repository that I'll provide at the end where you're able to get additional resources, if something looks like it might be usable for your kind of use case and the work that you're doing.

My name is Amber Welch, I am currently a Privacy Program Assessor for an organization called Schellman. I do privacy audits, primarily focused on things like GDPR, upcoming CCPA, SOCK for privacy, ISO. I actually will be moving to McKinsey to do the data protection and risk management for their managed healthcare data services, so very exciting to finally be able to say that out loud.

I have some certifications that probably don't mean anything in this room, so I won't really talk about them, but the most important thing is that I love privacy. I've basically dedicated my life to it, and my goal is to kind of evangelize a little bit and get some of you excited about some of the opportunities in privacy as well.

Today, we're just going to get a high-level overview of what privacy engineering is. We're going to talk about privacy by design, which tend to be more soft concepts, they are more values and goals and design principles that you might run across. Then we'll go over a couple of privacy-enhancing technologies. These ones I've tried to focus on the more recent developments basically everything that's come out since 2017 and later, and two of them have come out in the last two weeks, so you're going to be getting the freshest stuff in privacy.

First, I have to start by apologizing to all of you on behalf of the privacy profession. It's heavily led by legal teams. Legal teams have not always done the best job of connecting with technical people. I usually talk to InfoSec Conferences, so we still see the same thing. There are security people on legal, technical developers, and just don't always see eye to eye. Sometimes they see each other as the problem in the room, and they are the solution. I know that some of you are smiling because you feel maybe some of that in your own organizations.

About 80% of privacy professionals are from a legal background. I do not have a legal background. I came out of the security governance side, so I kind of see some of the opinions that the legal side has, but at the same time I've also seen a few of the challenges that come from separating that into just a hard legal and compliance concept without sharing and walking across the aisle to the tech side.

There are some concepts in privacy which are very specific to developers, highly relevant to just the development aspect of it, but unfortunately, when developers are asked if they've ever actually used or even heard of these concepts, the majority of them say no, they've never been trained on it. That's not really your fault, how could you possibly be expected to implement something that you have never even heard of? Have any of you heard of any of the concepts up here? That's maybe 5%, which is quite low considering we're sort of banking on developers to be able to lead the charge on implementing all of this stuff. We just say do privacy by design, and then you never really hear that and it's a big disappointment for everyone.

Privacy Impact Assessment

We're going to go over just a couple of the privacy by design concepts, and privacy impact assessments are sort of the foundational starting point for that. What a privacy impact assessment is? It's a means of identifying privacy risk at the initiation of a project through the end of it, so this is really focusing on the entire data management life cycle as developers view it. This includes mapping personal data flows as it goes through the application through your data lakes if you have them. It's also a means of documenting any privacy risks and mitigation. It's similar to a security risk assessment but very focused on personal information.

Sometimes this is also required to just to do a privacy impact assessment at all may be required by regulatory assessments, but you can also use the privacy impact assessment as a means of identifying regulatory requirements that you might otherwise have overlooked. It's a more formal way to go about it.

Sometimes, especially from the development side, we tend to view this kind of risk assessments as a checkbox before you release something, but the goal of a privacy impact assessment is to help inform the entire development process. It should be a living document that happens that you engage with throughout development. Has anybody ever participated in a PIA at all? Ok, that looks like maybe five people - that's really sad.

Generally speaking, these should be done and led by a compliance or privacy team, but that doesn't mean that a developer who wants to care about privacy isn't able to do that. You guys have the ability to just do this yourself and approach it or maybe interact with your compliance teams and say that you'd like to start doing this, if you work with sensitive data especially. A privacy impact assessment should generally be in process until the application is in use, and even then doing an annual review is a good way to make sure that nothing's changed that would be really significant.

You want to use a PIA when you've got new applications that you are going to be developing from scratch, and then anytime you're adding really significant changes, new functions and features that make a big difference in how privacy is implemented in your system. Also, anytime you're collecting new types of sensitive personal data, so maybe you had an application that you really only had addresses and then you started inputting, maybe some health history, then it might change the risk profile of it, so a privacy impact assessment is a really good way to make sure that you've actually looked at it from that privacy perspective and done your due diligence there.

This is also a really great way to make your annual audits and reviews a lot less painful because you may have to answer some questions about this anyway. Especially if you're in a regulatory environment, where you need to submit information about your processing environment, you might have to just give that information about how you've set up the system, so this is a way you can just kind of deliver that document without having to go through and do the painful auditor interviews.

Some of the benefits, legal compliance, both just doing a PIA at all, and then checking to make sure that you've done the compliance aspects that might be specific to your situation, but also identifying and reducing privacy risks, just specific to the risk profile that you have, and then catching any basic privacy errors. We're all human, we make mistakes, we forget about things sometimes. This is a good way to make sure that you have done that before production.

Some of the drawbacks of doing this is that it does actually take some time. Anytime you sit down and you document something, it's some amount of hours, it takes resources. It's also the kind of thing that can become pointless busywork if you don't do a good job of implementing it, and you don't engage with it very well. If you're not going to do it well, just skip it, because it's not really going to add that much. It can add a lot of value if you do it thoroughly and you actually care about the work that you're doing.

Also, a privacy impact assessment can't replace a security risk assessment, they don't look at the same things. At the same time, a risk assessment can't replace a privacy impact assessment, so they might work together and be done at the same time, but they are different things and should be, you know, viewed as separate items.

Data Minimization and Retention

The privacy impact assessment is basically the foundation of a privacy by design program. Once you've got something like that in place, even if it's maybe a little informal, then some other concepts to keep in mind are data minimization and retention. This is a topic that's one of the oldest concepts in privacy, this is kind of where we started, partly because we know that companies like hoarding as much data as they possibly can, because they might be able to sell it, they might be able to get some kind of monetization out of it.

Throughout the data lifecycle, the goal of data minimization is to collect only the necessary data to actually do the processing that you're planning on doing. Then maintaining that information and updating it as needed, and deleting old data that isn't needed as well. The way to think about this is, whenever you're looking at collecting data, ask yourself, "If I really wasn't allowed to have this piece of information, would I still be able to do whatever it is I'm attempting to do? Does my application still work?" If it does, then why are you requiring it or why potentially are you even asking for it at all? Why do you even put in that field? Make sure that there's purpose behind all of the data that you collect from the beginning.

A good example of this is the birthday rewards program. What do you really need to provide birthday rewards for someone in a loyalty program? Do you need the year of birth, do you need the day of the birth? You can probably do it with just the month. I can say my birthday is in April, that's it, birthday month, done. That's a more privacy-respecting way of providing that same kind of benefit for the user without getting too much information from them that's not really necessary. Similarly, you can ask for just a ZIP Code perhaps or a county, instead of getting a full home address for something. If you're just trying to estimate someone's location, then there's really no point in asking for the full address.

New applications are where you would want to use this. This would also be applicable from everything from the field level, so when you're requesting data from a user in a form, all the way up to entire tables of information, so making sure that you're not just collecting data and retaining data because you can. One of the reasons we don't want to do that is because in the event of a data breach, you've now potentially lost a lot more information, people are going be a lot more upset and you might be liable for a significant amount of data that you wouldn't have really had to have in the first place.

API integrations are also another area where we tend to forget about doing data minimization. It's a lot easier just to pull all the data, just it up and let the data flow, and I'll just take what I need when I want it. It does take a little bit more time to set up a good minimization structure, but it really adds a lot of value and reduces your risk. Same thing with adding new features and functions, you always want to think about it, don't just plug and play, don't drop modules in just because you had them already available, and collecting new personal information - same thing.

Also, customer termination - this would include both users being deactivated potentially, or enterprise customers who've decided to terminate a contract with your company. If you're not deleting the data, have you really fulfilled your retention requirements? Also, users get really upset when they find out that they thought they deleted their information, but you just deactivated it. Once that comes out, it really makes a big difference in how your company is viewed.

The way to summarize this is to think in terms of the sort of minimum viable data. What do I really need to make this functional? You can also think of it as least data, the principle of least data, like the principle of least privilege, only collect what's absolutely necessary.

Legal compliance is sometimes required, and if you can show how you actually implemented it, especially for data retention, this is going to be highly important. Contracts usually include retention timelines as well. This also minimizes the amount of data that you lose when you're breached, and it's usually a when, not an if. This also improves data quality because users are a lot more likely to provide useful information if it's not old and it's really just the necessary stuff. They're a lot more likely to start putting in random stuff because you're asking for way too much or they don't feel comfortable giving that much information.

Some of the frustrations that might come from this, though, is that users may become really annoyed if they think that their data is going to last forever, and then they log in and it's gone, because you didn't keep it for some reason or you severed an integration. Also, as we said, companies like to hoard data, tell your product manager that you want to delete a bunch of data, and then let me know how happy they are. It can be a difficult battle to fight, but it's a very important one if we can advance those principles.

Default Settings

Default settings - this is the one that we don't tend to do a very good job of speaking about because it's a very fuzzy concept. It's something that's very easy to say, but it's hard to explain how to implement this because the way to implement it is very dependent on the environment. We can give some examples, but it can be pretty tough to actually sit down and give you a checklist of ways to make appropriate default settings. It takes sort a level of insightfulness and genuine thought about what you're doing. Nobody can really tell you what to do, you have to make these decisions yourself.

The goal of this is to minimize personal data, so like with the data minimization principle, you're able to implement that through default settings very often. You also want to prevent default data sharing, so think of things like ad tracking and affiliate sharing. If you can get away from providing users information just by default, then that's the better way to take. This also would ideally require users to enable more intrusive apps settings, so if you have a setting that maybe some users would find useful but it requires more sensitive information like location data or allowing the microphone, then start by leaving that off, and if the users want it, then enable it, don't just enable it by default.

You also want to avoid making data public by default, if possible. Venmo, as you all know, or hopefully you will know - if you didn't know this, maybe you check your settings - Venmo makes all transactions public by default. If you're doing anything illicit, and putting it in Venmo, maybe just take that off the internet. Also, Amazon Wish Lists - a lot of people don't realize that Amazon Wish Lists are actually just shared by default, because the assumption is that you're telling people what you want for Christmas. If you're using that as a means of tracking whatever it is you're going to buy it for a project, the entire internet can see that.

Here is a really interesting statistic, and this is key for you guys to remember. Less than 5% of general users will change any kind of application default setting. Programmers, all of you, change on average about 40% of the settings. I know that this is a little bit of a mindset shift, but we have to think in terms of default settings as user behavior. Whatever is that you make the default behavior, that will be your user's behavior, because they're just not going to make those changes.

A way that we can think of this in financial terms is 401k plans. User enrollment in 401k plans was pretty abysmal until companies started enrolling them by default, and because nobody changes the default setting of everything, they just stayed enrolled in the 401k program. There are things that you can do to actually help users help themselves with privacy just by not invading in the first place.

Amazon saves all voice recordings by default. Everybody know about this? I hope you know about this. If you have any kind of IoT, this is something that is rampant in IoT. The default setting is just to save all of the data. It's pretty concerning, because this information is stored without most people knowing about it. You can actually turn this off, but it takes quite a lot of effort. You can see how many steps you have to go through, if you even know that this is happening in the first place. It takes a higher-level user to even get to this point.

Also, AT&T requires the opt-out of targeted advertising. This is just an example, because it's pretty common. Most organizations, if they can get away with it unless they have requirements against this, like under the GDPR, then they're just going to send that advertising and make use of all of the data. If you can avoid doing this to your users, that's preferable.

Big kudos to Firefox, because they started ad blocking by default, so this is a great example of how the browser settings can actually become part of an individual user's privacy protection.

You want to do this because it gets you a reputation for privacy, it's a huge topic right now. Anytime that you're able to tell users that you actually care about their privacy, it's a differentiator and it's starting to become a fairly big one, so you could even do marketing on this. Apple's been doing it for a really long time and has a fairly good reputation for privacy, so make that your company if you can.

This also reduces user frustration, because there's always that sort of sinking feeling, this feeling of betrayal when you find out that your application has been sharing a bunch of data that you didn't even know you were having collected. Most importantly, this protects less educated users, because users who aren't really power users don't know about this, so they're really the ones who are most often abused. It's not you guys, you know how to do this, you know how to protect yourselves, you're smart enough. Maybe younger people, older people, those without a lot of formal education, perhaps aren't really going to have that level of awareness, so think about them when you're changing default settings.

Some of the drawbacks, again, companies want to monetize data, so anytime you're trying to take it away from them, they get really upset. If there are any ways where you can exercise any autonomy over these items, then go ahead and do it, if you can, advocate for it where you're able to. This also requires privacy awareness at design. If you're operating under a model where you don't think about privacy until you're at the checkpoint, then it's not likely that this is going to happen.

Encryption

I did include encryption in here. I debated about this quite a bit, because I know that you guys all know about encryption, so I'm not going to explain to you what encryption is. I do just want to go over a couple of really common errors that do happen at the development level. You know to do all of this stuff, encrypt wherever you can, that's the best rule of thumb. If you can encrypt it, do encrypt it. That includes data in transit and data at rest. Your backups don't suddenly stop being data at rest, just because they're not your production environment.

A really excellent way to learn more about how to implement cryptography well for developers is the OWASP guide to cryptography. I included this in the references, so there's a lot of information in the GitHub about how to actually implement some of this stuff. There are some really common problems making your own crypto and using deprecated crypto. This is where you might technically have encrypted something, but you haven't done a very good job of it perhaps or you did a quick and dirty way around it, or maybe haven't changed the encryption after something has become deprecated. Check out what version you're on, make sure that you aren't hard coding keys anywhere, you want to keep the keys away from anything that users would be able to use.

Also, don't store keys on the same server as the data. This is a pretty common flaw, because there's this mindset with developers that usually isn't a security mindset, because often devs are more focused on building things and less about breaking them. If your security teams haven't gotten on your case about this, then try to do it yourself. Also, avoid using just one key for everything. Then if you have a breach with that one key, then that's it, that's the keys literally to the entire kingdom. You have any password management in your system, make sure you're doing the hash and salt. I know everybody's been saying this for so long, but it is important, so make sure you're doing it.

Also, if you turn off certificates to do any testing, make sure you put some kind of checkbox in there or something where you remember to turn it back on, because that does happen fairly frequently, and just avoid using really old crypto library. Make sure you keep up-to-date on it, maybe do an annual check on your encryption that everything's working and there's not been any major breaches or flaws in the last year.

Differential Privacy

We're done with the basic stuff. This is all the really exciting things from here on out. Differential privacy, I am really excited about. This is one of the most important aspects of making data usable in ways that it never has been before. Differential privacy is a highly mathematical concept. It is a way of adding statistical noise to a data set, and this is actually mathematically measurable, so we are able to generate the correct amount of noise to the size of the data set so that we can tune in the level of privacy versus the level of usability.

This prevents the identification of one individual's record, so you can't do any kind of inference attacks on that database, but it also provides the same results as the raw data would. The goal behind this is to provide exactly the same results regardless of whether one individual user's data is in that query or not. That prevents this individual exclusion query type of attack, where you're able to learn information about one user by excluding it from a query.

Don't worry too much about all of the gibberish on here, I just want to show that there's actually quite a lot of different privacy models that are available. You see the one closer to the bottom with the Epsilon value, that's differential privacy. The reason that it's preferable is because it's measurable and adaptable to the level of privacy that you need. An example of where you might use this would be a demographic study of disease presence in a certain user base. For example, if you have a database of new AIDS infections, then you would be able to get usable information about the demographics for that new infection without being able to identify any single user who was on HIV positive. You can see pretty easily just from one example, how this becomes very useful for health data, behavioral tracking without invading any individual's privacy.

I'm going to show you a couple of examples of how this looks in practice. This is an example data set with a very low Epsilon value, so that means that the data is highly private. If you run queries, the chances that you're going to match up to what you would get from that raw data is a lot lower, so it's highly private data, very secure, but maybe not quite as accurate. The results are going to vary a little bit more, but this would be more valuable for sensitive information.

This is the high Epsilon value, the other is the low Epsilon value. If the Epsilon value is high, then the results are going to match the raw data very closely. You would be able to get away with this more in an extremely large data set. For example, if you had aggregated income data, if you remove just one high-value individual, it's going to change the results more significantly when you have this kind of Epsilon setting versus the first one. You would be a little bit more likely to identify that one individual's income based on how it changed the results over several repeated queries.

Why do you even want to do this really weird esoteric thing? This gives you an opportunity to limit insider threats and increase the data usability. Somebody who has a maybe a lower clearance would be able to do valuable operations on data without learning anything about it or learning this protected information. It also allows for collaboration between multiple groups where maybe you want to allow someone else to do something with your data, but you don't want to give them the raw information.

The drawbacks of it, though, are that you really do need quite a large data set. Obviously, if you only have two users in a data set, how are you really going to add enough noise to cover up the removal of one of those individuals? Also, this really does need to be tuned fairly well, because there's a balance that you're trying to strike, so you would really need to work with a data scientist probably to find the correct Epsilon value for the application that you have.

Privacy Preserving Ad Click Attribution

This is pretty much my favorite one, privacy preserving ad click attribution. It just came out, Apple released it two weeks ago. This is essentially differential privacy for ad click attribution. It doesn't work on the same mathematical principles, but it's the same kind of concept. It's separating the ability to identify an individual while still letting companies identify that they did have an ad click that turned into a conversion. Apple was brilliant for doing this, because they basically saved the monetization of content. Now we're able to really still monetize content through the use of advertising, but in a way that isn't intrusive for users. It's pretty phenomenal, I'm really excited to see this. This prevents user ad click tracking at the individual level, and it cuts out the need for abusive cookies and third-party tracking technologies. This uses the browser itself to mediate the ad clicks.

The way that this works, it's three steps. In the first step, the ad click is stored by the browser, so when a user clicks on an advertisement, it recognizes that that click happened. It stores that information for a limited period of time, generally, no more than a week. From there, the browser itself is what identifies whether or not that ad click actually turned into a conversion sale. The browser is what knows that the sale happened, not the third-party advertisers. Then in the third step that ad click attribution that turned into a sale is sent back. It's reported as a successful conversion, somewhere between 24 to 48 hours later, which is randomized. The benefit of this is that companies are able to fairly quickly get the information about what advertising generated a sale, it's associated with a campaign ID, and that's all the information that they get.

The website where the ad click is happening and the website where the conversion happens, neither of them are able to see anything about the ad click data being stored or the user. They only are able to know basically that this is the kind of setup that they have.

If you want to look at this, you can actually play with it right now in the experimental features. If you can enable dev mode in Safari, you're able to go in there. The GitHub for this is in the resources that I have linked, so try it out, find me on Twitter, and let me know if you liked it, you hated it, you thought it was useful. I'd love to see some opinions, it's brand new, so there's not a lot of information now about how it's turned out for actual developers.

The exciting thing about this is that websites are still able to monetize content, and Apple is hoping to make this a W3C web standard. If they're finally able to do that, it might be able to do some of what they do not track, never accomplished, which really just backfired quite a bit.

Some of the drawbacks: you really are going to need to see a lot of widespread adoption to get this in place, so I'm hoping that all of you would go out and be ambassadors for this. One other thing that might be a potential challenge with this is that users are pretty used to thinking. If I'm seeing advertising, it means my privacy is being invaded. We've broken all trust in ad tech, so users might take a lot of time in education to believe in the concept of privacy-preserving ad technology.

Federated Learning

Federated learning is also known as on-device machine learning. This has been in place at a commercial level since only 2017, so still fairly new. What federated learning does is trains a central model on decentralized user data. It gets the benefit of real user data without actually importing any of that user data, only the results are viewed. It never transmits any device data that's not in the aggregate back to the central model. This is helpful because it gives iterative models and in this kind of deployment you actually are able to do testing and result modeling at the same time, and then you can continue doing this machine learning quite quickly and over a large set of devices.

One of the other key ways that this is actually implemented in a private fashion is that it uses secure aggregation, so only the aggregate is ever decrypted. The individual user results that are sent back are not able to be decrypted. How this works is that in step A here, you see that the device itself operates on the training model, that's sent out to the device, and it works to provide information about the user behavior. Android has been using this on Gboard to do the keyboard prediction technology. They first implemented it in 2017, and it's been in effect since then, I guess two years. Then the user results are aggregated and only that combined consensus change is what's actually included in that next iteration, so then the model is updated and the next iteration begins again.

This is not intrusive to the users because only eligible candidate devices, are selected for any kind of operation at any given time. That's generally not peak hours on Wi-Fi and charging, so that you're not really taking advantage of their device to do all of your dirty work. Android's Gboard prediction model - I don't know for sure if any other commercial applications are being used right now, but I think that this is really a key area where we would be able to make better use of machine learning without being abusive to users. Which is one of the primary drawbacks of machine learning is that it's been identified as such a privacy-intrusive kind of technology.

You might be able to use this for something like health diagnostics or learning behavioral preferences without tracking individuals, something like driver behavior in smart vehicles. You can speed up modeling and testing, it's quite rapid iterations and you're able to do it very broadly, you can do this on several user devices. This uses generally TensorFlow Lite, so if you're interested in that technology, then try to play with it.

In the GitHub repository I have some links to tutorials that will let you set up some test examples, so you can see how this would look. It's minimally intrusive and, again, individual data is never accessible to anybody who's working with that centralized model. Outlier information isn't going to be delivered back.

If you mess up the implementation, it's possible that you would have errors that would cause private data leakage. Before you would want to actually deploy this in a production setting, you would want to make sure that you had done your research and practice on this. It also requires quite a large user base. Again, it's very difficult to confidently aggregate information privately without having quite a large amount of people, because if you're aggregating two people's information, then it's pretty easy to tell where that came from.

Homomorphic Encryption

We're on the last one, which is also the most, I suppose, complex one, homomorphic encryption. There are two key differences between the types of homomorphic encryption that are available for use now and the more theoretical stuff. Fully homomorphic encryption would be the ability to do full data processing on data sets that are encrypted without decrypting them. This is pretty much theoretical, it's going to happen around the time that we get quantum computing. It's pretty resource intensive, it's been a theory for a long time, and we've never really been able to achieve it because of the amount of processing power that it would take to actually do this.

We do have partially homomorphic encryption. The example I'm going to talk about, you're able to execute ad count and average functions on encrypted data. The reason you wouldn't want to do this is, it's a type of multi-party computation. If you have data between three different organizations and you want to jointly collaborate on some information, you're able to use this type of processing to allow each other to make use of those data sets without disclosing your raw data. What you want to remember from this is, it allows computation on ciphertext, which is really exciting and could completely change the way that we interact with and think about data.

Similar to the federated learning models, only the calculation results are what's decrypted. There are some different use cases between why you would choose this instead of the federated learning model. Traditional encryption, you can encrypt in transit and you can encrypt at rest, but the moment you want to do anything at all with that information, you have to decrypt it, you have to expose it somewhere. This would be an example of fully homomorphic encryption where you don't ever have to decrypt any aspect of it. In partially homomorphic encryption, you have to set up quite a structured environment for it, but you are able to do those limited functions.

This will allow you to do computations on data shared across organizations using highly sensitive records, potentially processing by employees with the lower clearance. In the last week, has been released in Private Join and Compute by Google. This is open-source, you're able to go to the GitHub, there's some tutorials and information. If you do nothing else from this talk, I think this would be the most exciting one to play with because it's really novel and it's probably going to change the way that we think about data processing.

Google's Private Join and Compute actually uses two concepts. It uses private set intersection and homomorphic encryption, but it is one of the only ways that you're actually able to start testing on this concept now.

This similar to federated learning reduces insider threat, increases collaboration, and increases the data usability. It's also quite resource intensive, even the partially homomorphic encryption still requires quite a lot of processing power to actually do this. It's exponentially greater than processing on raw data. There are those limited functions, so you're not able to do many detailed calculations right now. Then again, there's no fully homomorphic encryption at this point.

Becoming a Privacy Champion

There is a lot of information here. There's everything from theoretical information to the most basic stuff possible. What I would like for all of you to do is to think about some ways that you would potentially be able to act as a privacy champion in your organization, things that you might be able to implement that would make even the most minor difference for your users. If there's something available for you to do, or some way that you can engage with your legal team, your compliance team, or your security team and improve the privacy, just try to take one action. We really need technical people to be aware of what things are available because privacy and compliance teams a lot of times don't even know that this stuff exists. They're not interested perhaps or they don't really think outside of legal compliance and governance.

Kind of like security, we want a big privacy in from the beginning, that's the ultimate goal, and we can get there with some of the solutions. If you want to look at the GitHub repository with all the links, then it's there. You're also able to talk to me on Twitter if you have any questions or you want to tell me about something that you saw in your organization, I love hearing about use cases that I've never heard of. It's always exciting to see that.

Questions and Answers

Participant 1: Many things you mentioned in the first half, like don't ask for as much data and things like that. I'm a developer and this strike me as things that I would need to convince my product or business people to do. I would like your advice on how to talk to them, how to convince them that this is something they should care about?

Welch: That's a great question and certainly the most challenging aspect of implementing privacy. I think the best way to do it is usually to talk to them in terms of risk because that's a business concept that they're a lot more likely to understand. They may not care as much about the resources, but that's another way to approach it, so try to speak to them in terms of ROI. For example, cyber liability insurance is calculated on the amount of data that you have, so if you can get rid of some of it, then you have cheaper insurance. The same thing for what you would have to pay out if you had a breach. If you have less data to be breached, then it's less damaging to your company.

That's usually the easiest way, especially if you can show that you can do the exact same thing with that data and just not collect some of the really intrusive stuff. For example, in a job search recently, I saw an HR company that wanted the last four of my social and my full birth date in order to submit an application. Why would anybody ever think that that was important for that kind of task?

Participant 2: I don't know if it makes sense but a lot of what you were talking about was privacy information for customers or people who are using your applications. What about privacy for internal organizations data or users or...?

Welch: Employees generally?

Participant 2: Yes.

Welch: This is related to the data minimization most often, where sometimes organizations will put abusive oversight on employees. I'm, especially, not a fan of mobile device management technologies that are a little aggressive. There you can use the same principles, don't ask for more than you really need to complete the task. If your task is something like preserving mobile device company information, then maybe just the encryption alone and some user training would be sufficient.

Participant 3: Maybe more of a comment than a question, coming back to the first question, how do you convince your manager to limit the data you store? If there's a possibility that a European citizen uses your application, the GDPR requires you to only store data that you absolutely need for the process, so you're not legally allowed to store anything else.

Welch: Yes, and no. The GDPR is not actually scoped on citizenship. It's on residency, the word citizen doesn't actually appear in the text of the GDPR. If you have incidental European citizen data that doesn't necessarily mean you're in scope. It's more on if you're intentionally targeting user's data subjects who are, as it says, resident in the EU.

Participant 4: You mentioned that when a user asks to delete something, it's best to go ahead and delete it instead of doing a soft delete maybe. Is it appropriate to instead of fully deleting it just mask their personal information, so that you could still do things like analytics and tracking and look back at the past and see what users' behavior is, because you don't want to lose that?

Welch: There are two components to this. Don't use the word delete if you're not actually deleting, there's just an aspect of giving them information. If the answer is no, you can't delete some of that information, then that's the answer, but they really have a right to know that. The other thing that I didn't have time to get into is de-identification and anonymization techniques. There are ways to de-identify that information. It's pretty tricky to make sure that you're doing it well, but it is possible. That would be an option if you need to preserve maybe record data. Especially if it's something that's required for legal compliance, then, too bad, we're keeping it because we have to.

Participant 5: In answer of my first question talking about ROI, determining like the risk that the company's at now is not a thing I know how to do. Do you have resources that I could go and peruse to learn how to do that or learn more about that?

Welch: The best resource for anything related to just general privacy information is the IAPP. Your second-best source is going to be your security team, because that's basically what they do. The difference between privacy risk and security risk is pretty minimal, but the concepts are the same. The threat modeling works basically the same. Tell your privacy team that you want to start thinking, but your security team that you want to do some of that for privacy, they'll probably be excited about helping you.

Participant 5: Is that IAPP or...?

Welch: IAPP, the International Association of Privacy Professionals.

See more presentations with transcripts

Recorded at:

Sep 17, 2019

Amber Welch

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?