Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles What Should Software Engineers Know about GDPR?

What Should Software Engineers Know about GDPR?

Leia em Português

This item in japanese

Key Takeaways

  • You cannot afford to ignore GDPR, but you should not panic, either.
  • When creating software, you can easily expand documentation with GDPR-required details.
  • Privacy by default should be part of any software you craft.
  • Expanded user rights require some care and support.
  • You should revisit some software-building practices like logging.
  • A software designer should try to find ways to avoid being a data processor, and still be able to do the work. In other words, don’t unnecessarily access personally identifiable data, unless you are prepared to do so.

GDPR for Software Engineers

This eMag examines what software engineers, data engineers, and operations teams need to know about GDPR, along with the implications it has on data collection, storage and use for any organization dealing with customer data in the EU. Download Now.

Are you going to create new software solutions in 2018? If so, it might be a good idea to read this. EU General Data Protection Regulation (GDPR) is moving out of the transition period next summer to become enforceable. Violating its terms might lead you to face fines up to 20 million euros — much more for large organizations. In addition to sanctions listed in the regulation, jail time is even possible for individuals responsible for great neglect or data breaches.

Obviously, this sounds severe. I have seen two extreme approaches to GDPR: one, to pretend it does not apply to you and try to ignore it and another to declare that skies are falling and that no development can focus on personal data anymore. Both approaches are wrong, misinformed, and could lead to huge losses. GDPR does not create an end-to-all-personal-data scenario; instead, it sets rules for transparent and secure handling of personal data and threatens those who ignore them with very juicy sanctions.

GDPR strongly emphasises risk-based thinking; you take every step to mitigate privacy risks until the risks become something you can tolerate. I appreciate this regulation — there is enough software that has absolutely no security or privacy built in the design. This sort of software and its breaches lead users to mistrust how their personal data is being used. It's time to change that.

Key points to understand

This topic is huge so I am concentrating purely on the process of crafting new software solutions. There is lot to be said about organizational support and legacy systems, but they are highly dependent on the starting point. The GDPR does not allow many exceptions to the rule, so big and small businesses, non-profits, and government organizations all need to know the main points.

One key point of the new regulation is transparency for the data subjects. When you have a registry — for example, a database — that contains personally identifiable data, the GDPR holds that its use should be transparent to the data subjects. This means that people whose data you are collecting should be able to find out what you are collecting, your purpose, who has access to the data, and how long the data lives within the systems. To cope with this requirement, you naturally should know all these things and document them. Along with transparency, you need to provide better access to said data. Your data subjects should be able to verify, correct, export, move, and erase their data as easily as they gave it to you in the first place.

Another important topic is privacy by design/default. This should actually be integrated into every bit of architecture from now on. It should have been an automatic element of design before this regulation, but people often don't want to pay for security or privacy until something happens. GDPR gives a powerful incentive to take care of this now — an incentive worth at up to 20 million euros. Privacy by default means a lot of things, but it essentially aims to protect personally identifiable data and its privacy, with suitable controls. This typically requires, for example, clear audit trails in the form of who did what when, including and especially read access of personally identifiable information. Additionally, you should pay attention to data when it's being stored and in transit between different layers, and apply suitable encryption to avoid data leaking from your systems.

You should also have a valid basis for processing personal data, meaning what specifically gives you the right to collect and process the information. The basis, for example, could be a law that requires you to collect and store information on individuals for a period of time. The basis for processing personal data may be a contract, agreement, or transaction.

You can ask for consent to collect and process personal data but GDPR does not let you off easy here. It is not acceptable to have a checkbox already checked with a statement like “I accept that my information may be used for marketing purposes.” Consent must be clear, precise, and understandable — and cannot be pre-set. It should be as easy to cancel the consent as it is to consent in the first place. Software designers can decide none of this on their own but need to discuss it with whoever owns the software.

Here's an interesting point. If the team members that build the software have access to actual personal data while building it, they become data processors and liable to the same sanctions and responsibilities. The same goes for the operations team. If they have access to databases and data, they are liable and responsible. You might want to think hard about that. It is possible to build and operate most systems without accessing actual customer data, after all.

Recognizing personally identifiable information (PII)

GDPR is only interested in personally identifiable information (PII). GDPR does not apply to data that is not attached to a person, such as product or accounting information. You might still classify it as sensitive and might still want to protect it, but GDPR considers it non-PII data and ignores those situations.

GDPR identifies two classes of PII data. There is data that can be used to uniquely identify a person like social-security number, e-mail address, or anything directly connected to these identifiers such as purchase history. Then there is extra-sensitive data such as medical/health information, religion, sexual orientation, or any information on/collected from a minor.

Do note that according to GDPR, combinations of information that may not be unique in isolation can potentially identify an Individual. So PII also includes identities that may be deduced from values like postcode, travel, or multiple locations such as places of purchase. Tiny datasets and rare combinations of values make personal identification easier.

Since any information attached to or collected from a person is protected under privacy rules, most databases are going to contain PII, with some exceptions. I would estimate 70%-80% of typical systems data to be PII. It's not only social-security numbers and credit-card numbers that you should protect.

There's been a lot of discussion of access logs, audit logs, etc. that contain IP addresses or surrogate keys. Are these personal data? Are they registers? Do all personal rights extend to them? How strongly should they be protected? Experts seem to disagree about the answers. We have to wait and see how this evolves. I would advise, however, to avoid hysteria and to use common sense in grey areas. This sort of information could and should be protected to some extent, depending on how much harm a data breach would inflict. But I simply don't see every web server in the world becoming a PII registry in the most demanding sense of the definition.

Designing for privacy

The cheapest way to have your software to comply with GDPR is to build the requirements right in. How comprehensively you want to do this depends on the risk level of the particular system in question:

  • Does your system contain extra-sensitive information?
  • Does your system contain something that, while not sensitive for purposes of GDPR, would be embarrassing/dangerous to publish?
  • If someone published your database content, how large risk would that be to your business?
  • How large is your database of users?

If you have few users and the information that you collect is neither sensitive nor harmful, you might consider your system a low-risk environment and use more cost-effective controls to protect it. On the other hand, if your system contains sensitive data for many users, you would want to apply stronger protection.

A good audit trail is a minimal requirement. An audit trail not only shows that you have applied controls, it also helps you limit the damages in case of a data breach. After any data breach, whether by an internal or external party, the first thing you need to do is find forensics that can show which users are affected and which data were accessed. This is the information that you need to report to data-protection authorities. Additionally, these are the users you may need to notify about the data breach. If you have no forensics, you need to assume that a breach may have affected all users and all records.

A good audit trail also features non-repudiation — in other words, it cannot be altered/damaged even by system administrators. You might want to use audit trails to see what data a system administrator was violating, for example. This has happened before, and will happen again. Audit trails are also classified as PII: they have a unique identity and data directly connected to that.

After audit trails, the next task is to limit the exposure of data. The best way to do this is to limit what data you collect and how long it is stored. By introducing some kind of archival/erasure mechanisms in your software right from beginning, you can document this for your users. If a data breach happens, it can only affect data that was actually in the targeted system at that point. Many systems continue to collect all data but never clean it up, even when the data becomes obsolete. GDPR encourages you to clearly define data lifecycles and to document them. You should also restrict access to data to only what's really necessary. This is especially true for sensitive data.

I already mentioned that you should have sufficient protection mechanisms for data that's resting in a database or file system and that's moving through a network, especially to other parties. Encryption is efficient but it has its weak spots. The most powerful encryption technology encrypts early, secures your keys, and decrypts late. Unfortunately, this is a complex and costly solution to implement. Cloud services, on the other end of the spectrum, often let you cheaply and simply encrypt an entire database with a checkbox or offer to manage keys and encryption for you. While easy, these mechanisms have weak spots. You just have to find what works for you, based on risks and sensitivity of data.

It's worth mentioning that anonymization and pseudonymization mechanisms can help you with things like test data or analysis data. Anonymization basically removes all identifiable information by deleting or masking fields. Pseudonymization replaces identifiable information with pseudonyms, which typically keeps identities separate in the data. Both practices, however, are difficult to do right and may not offer perfect means to help your GDPR compatibility. Still, these are valuable tools in your toolkit.

You might want to revisit your logging standards and guidelines. It's easiest if you can make sure that your logs do not contain PII — otherwise, they become PII registries as well with all the implications. Some logs are attached to individuals already: access logs and audit logs for example. But don’t pollute operational debug logs by writing user IDs, names, or similar values in them. It's good to clearly separate logs that can be linked to individuals from logs that cannot be so linked but contain general system information.

Documenting your systems

GDPR loves documentation. One important point of the regulation is to be able to demonstrate compliance. You can do that by showing certificates, which in turn benefit from documenting your systems. You can build up, if necessary, from the level of documentation you are used to providing, which may vary based on many factors. But there is some additional documentation that would be useful to have from now on. Here's a brief checklist:

  • Document the personal data in your system.
  • Document lifecycles of collected data.
  • Document all parties that process the data.
  • Document your basis for collecting the data.
  • Inform data subjects of their rights and explain how they can exercise them.

You should document what data you collect, your purpose, how long you store it, and your basis for processing this data. You can best do this with a combination of document types. You might (and should) already have a general-policy document that explains the rules, but I've seen many software designers start to create a grid of data columns in which they can state GDPR classification. Basically, you use whatever documentation you already typically use as your domain model but then expand it with privacy information. These documents would then serve as the basis of the data-protection policy document that you would offer to your users. The first step in guaranteeing users’ rights is to understand which of their information your system collects.

Another interesting facet is how the data moves over networks and which parties can access/process it. For this, you could create a data flow diagram that documents parties, tiers, and even protocols. In case of a data breach, you can use this too to quickly understand and limit the exposure.

Additionally, if you wish, you might want to document what controls are used to protect the data and achieve a sufficient level of privacy.

Supporting expanded user rights

Most of the rights of user/data subjects already exist in the EU’s established Data Protection Directive. Here's a simple list of how they look under GDPR:

  • right of access,
  • right of rectification,
  • right to erasure,
  • right to restrict/object to processing,
  • right of data portability, and
  • right to be notified of data breaches.

Before you start designing all kinds of crazy APIs and systems to support that, it's worth noting that GDPR does not require that these be automated, real-time operations. In fact, you only need to respond to a request within 30 days. Responding that there is no basis to erase or export the data (because of laws, ongoing contracts, etc.) can be a legitimate response — and when a person does make a request, it's very important that you identify them properly so you do not create a new data breach by manipulating or exporting some other person’s data. The 30-day response window allows you to scrape or erase the data in many ways, even handling it in systems that are simply not possible to integrate.

That being said, if your organization already has a concept of digital identity for customers/users/data subjects and you provide some self-services, it's a good idea to attach these identity rights to that self-service’s user interface. The more documentation you can cover with automated processes, the cheaper it becomes. Also, users are happier with real-time access, as opposed to making a request that takes 30 days to process.

It might be wise to prepare for data erasure and export functions when designing any new piece of software. You can achieve erasure by deleting information but it’s easier is to partially overwrite it, effectively anonymizing it. The format for data export does not seem to matter right now, but it might be a good idea to plan for it, even if your domain would not contain any GDPR user interfaces.

The most important thing to get right is the one-stop shop where data subjects can exercise their rights, leading to a process that identifies and validates the request and then to mechanisms that erase or export that data.

Data processor or not?

When you work on a software project under GDPR responsibilities, you need to answer an important question: Do you intend to be a data processor or not? By default, you would wish not to be a data processor, since being one makes you liable to any sanctions. To avoid GDPR liability, simply make sure that you will not and cannot access any personally identifiable data in any circumstances. You also need to make sure that this is clearly stated in any contracts. It might be difficult to avoid PII processing, since personal data may hide in badly written log files, test environments, and any emergency patches to your production environment. But if you wish to avoid liability, you need to resolve all this. Protect yourself from the data; protect the data from you.

Another path to take is to embrace that status of data processor. This lets you have free access to personal data, as long as you document the activity, there’s valid basis for processing, and access happens within defined boundaries. This makes you clearly liable and responsible so you have to be mindful of any sanctions. But this is the route to take if you absolutely need access to PII databases.

Most software projects do not require exposure to actual PII data, and this is definitely the recommended path to take — but it might require new skills and tools.

GDPR myth-busting

No, a data subject may not erase debts or a criminal record by exercising user rights.

No, a data subject is not supposed to get everything connected to their identity when they request an export of their data. Only directly collected information is to be included. The spirit of the regulation is transparency and the option to change service providers.

User rights are not automatically exercised. It's important to first check user identity and the validity of a request before manipulating data. This might be difficult if a database does not carry unique and secure identifiers. There might be many valid reasons to refuse a request.

No, GDPR does not require you to encrypt everything with 2048-bit keys in rest and in transit. Controls to protect that data are only used to mitigate risks until they become accessible, and risks are different for every system and situation.

No, GDPR does not stop you from collecting and processing user data. Take care of transparency, data security, and legal basis, and do not collect more data than you need and you should be fine.

No, having a data breach does not automatically subject you to 20 million euros in fines. It might — but if you have read this article and followed the advice, you should already be well on your way to lower potential penalties. Fixing what you can right now, from the start, and having a plan for the rest goes a long way. GDPR lists about dozen questions that will be used to decide scale of the penalties if that time comes.

No, sanctions are not the main reason to start doing something about data privacy. This regulation is in place because more and more data is collected every day, and more and more data breaches are happening. Having a data breach in your system can cost you much more than fees and sanctions, it can cost you your customers’ trust. But the sanctions are a good way to motivate companies to spend a few euros for security and privacy when building and purchasing software and information systems.

No, cloud services are not a big no-no with GDPR. In fact, they might actually be more in synch with privacy-by-default requirements than many traditional data centres. Of course, moving confidential data to third parties makes things mildly more complicated when it comes to contracts and documentation.

No, GDPR does not require you to audit and log everything and have tools for intrusion detection and test-data management. Such tools might make life easier when used successfully, but the core of your approach should be risk-based assessment and suitable controls.


There is not a lot of time before GDPR becomes enforceable. Already, any new systems should be built as GDPR compatible. This is not a precise definition, especially as interpretations continue to evolve and many of them will only be clarified as data breaches, audits, and sanctions occur in the future. My hope is that this article may help you to avoid being among the first to pay the price.

I think that the upcoming data-protection regulations are strongly positive and surprisingly ambitious. Finally, you have reasons to put more emphasis on security and privacy. As you improve transparency and privacy and provide more control, users of your systems will trust you more and many of them will probably happily allow you to use their information for new kinds of analysis and marketing that no one is even aware of right now. There will probably be some turmoil in summer 2018 as some of the rules will be clarified, but I believe GDPR will lead to more security and transparency in the long run and I'm all for it.

When you find yourself in the grey areas of the regulations, unsure what to do, common sense does go a long way. Is the source of confusion something you could document and honestly explain to your software solution’s users/data subjects without embarrassment or shame? If so, it's probably going to be okay. Think about worst-case scenarios like data breaches. If your database, snapshot copy, or Excel export should fall into wrong hands that publish it somewhere, how would you be able to find out exactly what was leaked, by whom, and which of your users need to be notified? Would explaining how the data was protected embarrass you? If not, you probably have done what is humanly possible, and this will probably help to mitigate the sanctions, if any. Do your best, put the rest on a roadmap. Build for a more secure world with more transparency. In the end, everybody will be happier.

About the Author

Arto Santala is a software architect at Solita Oy. He has more than 20 years of experience in crafting software solutions to enable tomorrow's world right now. His greatest passions include automating just about everything, and using agile methodologies to get the right things done just right.

Rate this Article