Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles What Do Data Scientists and Data Engineers Need to Know about GDPR?

What Do Data Scientists and Data Engineers Need to Know about GDPR?

Leia em Português

This item in japanese

Key Takeaways

  • The GDPR, effective from May 2018, fundamentally alters the way global organizations collect and manage their data.
  • Violating the regulation could result in fines of up to 4 percent of global revenue for your organization.
  • Key requirements of the GDPR revolve around managing the way data is collected, maintaining visibility into how that data is used, and enforcing restrictions on data use.
  • New tools, frameworks, and ways of thinking about data management are going to be required to pass the basic “GDPR test” and avoid violating the regulation.
  • Ultimately, the GDPR presents an opportunity to modernize your data management strategy and empower your data science programs.

GDPR for Software Engineers

This eMag examines what software engineers, data engineers, and operations teams need to know about GDPR, along with the implications it has on data collection, storage and use for any organization dealing with customer data in the EU. Download Now.

Data management is about to get a lot more difficult for global organizations, thanks to new privacy regulations in the EU. These new regulations will have far reaching effects on any programs using data at scale.

Specifically, the EU’s General Data Protection Regulation (GDPR) will come into force on 25 May 2018. And with fines of up to 4 percent of global revenue, this makes the GDPR the most consequential data regulation anywhere in the world.

While the GDPR theoretically applies only to EU “personal data,” the regulation outlines this as any data that could lead to the identification of a person. In practice, this means that any EU data at scale should theoretically fall under the purview of the GDPR, as study after study has shown that enough data of nearly any kind can shed light onto the individuals who generated it. To pick just one example, a group of researchers recently demonstrated that aggregate cellular location data (such as the number of users covered by a cellular tower at a specific timestamp) — which, in theory, sounds like it should be anonymous — can actually identify an individual’s trajectory with 73 percent to 91 percent accuracy.

So, what should data scientists and data engineers - the people responsible for collecting, organizing and using data within organizations - think about GDPR? How should they design their data strategies?

What You Need to Know About the GDPR

From a high level, the GDPR creates legal requirements that fall into three basic buckets: collection management, data visibility, and restrictions on data use.

Collection management involves managing the data that organizations gather and the ways it’s collected. The GDPR mandates that privacy be prioritized at the time of data collection, for example, with many of the restrictions on data being tied to the consent of the data subject - meaning the data subject will frequently have to understand and agree to whatever your organization wants to do with their data. This means that when an EU subject generates data that your organization collects, understanding exactly why your organization is collecting that data and tagging that data at the time of collection is going to be paramount. (More on this below.)

Data visibility means understanding what data your organization has and how long you’ve had it for (and plan on keeping it). By now, most organizations understand that data is “the new oil” and many are doing their best to collect as much data as possible. But most of those organizations don’t fully understand the data they have, or where they’re storing it, or its provenance once it’s been stored.

We at Immuta frequently come across this as a combination of compliance and IT architecture issues, with data silos and different teams and database administrators responsible for a wide variety of data and no single source of truth. With GDPR requirements in place, this level of variation can’t be the norm. If a user requires that their data is deleted - often known as the “right to be forgotten” - your organization will have to know where their data is and then delete it. Examples of this type of visibility requirement abound within the GDPR.

Lastly, and perhaps most importantly, restrictions on data use mean that your organization is going to have to enforce purpose-based restrictions on data. If a user only consents to “marketing” as a purpose for their data, for example, you’re going to need a way to track and enforce that restriction, all the way from collection to use. The GDPR lists six broad purposes that are acceptable, and each organization is going to refine their own list of what purposes their legal departments deem compliant with the GDPR. This guide, for example, suggests having only 15 purposes for data across an entire organization. Tracking these purposes - and proving that data with certain purpose restrictions has only ever been used for that reason - is going to be one of the most important and difficult requirements of the GDPR in practice.

How to Pass the Basic GDPR Test

Imagine the GDPR is already upon us, with data protection authorities across the EU enforcing the regulation.

At the moment of writing it’s clear that many of the GDPR requirements are still relatively ambiguous, and there’s much fine-tuning the regulators will engage in the following months, if not years. This means that, in all likelihood, regulators won’t be expecting 100 percent compliance with the GDPR the day it goes into effect. Rather, they’ll be expecting a reasonable, serious effort to comply with the regulation’s major tenants.

So what does passing the basic “GDPR test” mean?

It means organizations will need to be able to demonstrate compliance with each of the buckets outlined above - understanding the data they have, when they collect it, the reasons they’ve used that data for, and be able to prove all of this to regulators or data subjects, who may be entitled to reports illustrating compliance with all these requirements.

From a practical standpoint, this means that, at a minimum, every piece of data that’s collected by your organization is going to need new required metadata with the fields “purpose” and “time of collection”. This way, you’ll be able to track and enforce restrictions on its use, and you’ll be able to enforce policies on data retention, meaning you’ll delete or attempt to anonymize that data after a certain period of time.

If you can demonstrate that at every point from data collection to data usage and deletion, you understand exactly what data you have, how long you’ve had it (and plan on keeping it), and what purposes it’s been used for - and that each of these buckets are in keeping with the GDPR requirements - your data management program will likely pass the basic “GDPR test” with flying colors.

The GDPR Opportunity

All that being said, smart organizations will see GDPR as more than a new set of demands. Agile, data-driven organizations will see GDPR as a true opportunity to rethink the way they approach their entire framework for gathering and using data.

When we look at the tech giants of the world - think about Amazon or Google and others - their key differentiator lies in how calculated they are about the data they gather and use. This is not a post hoc operation, but one based on careful planning and engineering. Having the right data is what allows them to disrupt verticals from marketing to retail to grocery stores and more.

Indeed, academic literature has long demonstrated that good governance translates to better performance. The same can be said about data management programs. Better, longer lasting data-driven insights will require more deliberate thought and planning into how data is collected, and what data an organization has at its disposal.

In fact, if there’s one major opportunity presented by GDPR, it’s to finally give data scientists a centralized understanding of what data they can access and use. I constantly see that the title “data scientist” is, in practice, more akin to “data scavenger” - where a good deal of a data scientists’ time is spent simply trying to find the data they need, then to get access it, then to transform it into the right state, and only then to use it.

This process leads to a huge amount of time wasted and potential lost. Data scientists aren’t hired to scavenge for data, or to create one-off, per-project solutions to gaps in their organization’s data strategy. Data scientists are there to turn data into insight. That’s what they are good at - and that’s why they’re frequently so expensive.

Creating a holistic data strategy, and a centralized place for data management across your organization, will finally allow data scientists to do what they’re the best at - and to help your company move faster, becoming more efficient and more adaptable in the process.

What Comes After the GDPR?

Beyond the immediate opportunity presented by GDPR lies an entirely new way of thinking about data - one that is going to become increasingly important as new regulations on data emerge. Indeed, from Turkey to China and elsewhere, data is becoming more and more regulated, meaning that data management is going to be one of the most important enablers for data-driven organizations and also one of its biggest challenges.

A few insights about the future of data management:

  • There’s no such thing as a data lake. Oftentimes when it comes to data management, an organization’s first instinct is to think that putting all its data in one place will solve every problem they have. When it comes to data lakes for processing purposes (like Spark), this makes a lot of sense. But for governance and data discovery, data lakes frequently create huge problems, quickly turning into data ponds and then data swamps, as new data is added and as new tools for data storage emerge and underlying IT architecture evolves. Thinking you’ll solve your data management problems by centralizing where you store your data is a recipe for long-term troubles.
  • Diversity is your friend. Instead of attempting to standardize the way you store your organization’s data, which can be nearly impossible in large organizations, I recommend thinking about the long-term adaptability of your approach to data management. That is, assume that you’re going to have diversity across your storage systems and data science tools - indeed, this diversity is inevitable. Once you realize that standardizing where or how your data is stored is not your number one priority, you can move on to thinking about how to enforce policies on that data and what policies to support, which is the backbone of any data management strategy.
  • Audit. Audit. Audit. If you can’t audit, you can’t prove that your data management framework is working and you can’t demonstrate that to regulators. So ensuring there’s a centralized ability to audit and to create audit reports is going to be a key component of any data management strategy. And make sure to test your audit abilities before they’re needed. Organizations frequently think they’re collecting the right data for their audit needs, and all too commonly learn about log errors once it’s too late.

There are, of course, many more key tenants to a future data management framework for GDPR. But the major takeaway for your organization should be that data management can no longer be a incidental component of your data strategy - in the IT department or otherwise. The increasing importance of data science across organizations, combined with the rise in regulations on data, means that organizations will need to prioritize data management more and more.

About the Author

Andrew Burt is Chief Privacy Officer & Legal Engineer at Immuta, the world’s leading data management platform for data science.

Rate this Article