InfoQ Homepage Articles Privacy Architecture for Data-Driven Innovation

Cloud

Privacy Architecture for Data-Driven Innovation

Feb 01, 2020 20 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

By the time you start a privacy program, you have already collected a lot of data and possibly even made some privacy mistakes.
Modern data-powered businesses operate on a bottom-up model while privacy needs some level of centralized top down guidance and consistency.
A a data-driven privacy architecture includes data governance for data you collect, and data sharing models that use anonymization and other privacy-centric techniques.
You want tools that categorize data based on risk and apply techniques to protect data based on risk levels.
For privacy, you need to think of “data sharing” anytime data leaves your domain.

The privacy landscape

Laws and regulations - The world and the U.S.

The privacy legal landscape has gotten challenging and confusing.

Privacy had its first legislative milestone with the passage of the GDPR in May of 2018.

Europe’s Global Data Protection Regulation (GDPR):

Created data protection requirements and
Gave Europeans strong individual control over the collection, use, and sharing of their personal information
Expanded the scope of protected personal data,
Granted individuals the rights to access and delete their data, and
Defined what legitimate interests justify the use of consumer data.

In the U.S. as I write this, the states seem to be driving the dialogue as far as privacy law is concerned:

All 50 states have enacted data breach notification laws and many are now looking at far-reaching privacy laws.
The most important so far is the California Consumer Privacy Act (CCPA), which took effect January 1, 2020 and is already influencing businesses.
CCPA differs from the European Union’s GDPR in certain respects, but is similar in key areas

Other prominent efforts include the Washington Privacy Act (WPA), which was introduced in January 2019 and has a framework that mirrors GDPR. Illinois’ existing Biometric Information Privacy Act (BIPA) which is having a growing impact. An expanding array of companies begin collecting biometric data from consumers, including fingerprints, iris scans, and facial recognition information.

Privacy and ambiguity

The most fundamental question in privacy policy is: what information deserves or requires protection?

A key concept is "personal data" or "personally identifiable information" (PII), but definitions vary and can be vague. Getting this definition right is a key challenge.

There is no universal approach to PII in the United States or in the jurisdictions overseas that have adopted comprehensive privacy regulation.
One federal agency defines it as "information that can be used to distinguish or trace an individual’s identity"
This definition requires a case-by-case assessment of the specific risk that an individual can be identified.

Privacy and security

On one hand, privacy begins only when adequate security is in place. Without security, you cannot have any meaningful privacy.

On the other hand, sometimes the need for privacy could make security harder. We are starting to see this as privacy laws become more expansive. Often you need to access data to better understand attack vectors, which then help build tools that could help protect privacy as well.

Let’s look at an example.

The WHOIS system makes the contact information of domain name registrars publicly available.
WHOIS data has been an important tool for security and fraud prevention, and in tracking down bad guys on the Internet.
The broad scope of GDPR may have created problems in administering this vital tool.

Given the complicated and often conflicting interpretation of privacy laws, building privacy solutions has gotten to be difficult.

Additionally, privacy is hard to implement because of the dynamics around modern business:

Companies typically start with privacy after a period of growth; growth makes privacy possible (by funding it), and necessary (by collecting vast amounts of data). By the time you start a privacy program, you have already collected a lot of data and possibly even made some privacy mistakes. For many companies, a privacy program is the outcome of an external forcing function, like a privacy-sensitive enterprise customer or some legal mandate.
The decentralized nature of development and unstructured data makes it hard for a centralized discipline like privacy to move the needle. Modern data-powered businesses operate on a bottom-up model while privacy needs some level of centralized top down guidance and consistency.
For privacy advocates and in-house privacy professionals, privacy is a passion that combines art and science; for revenue-driven and growth-oriented product managers and engineers, privacy is like doing laundry or going on a diet. They often defer and delay for as long as possible. This creates a context gap.

I propose a data-driven privacy architecture that has two parts:

Data governance for data you collect
Data Sharing models that use anonymization and other privacy-centric techniques

A team that drives this work at Uber - and fast-moving companies like Uber - ideally operates in parallel to a more roadmap-driven privacy engineering that builds.

The eponymous privacy architecture team at Uber pushes these strategic cross-functional initiatives. We will now examine the work underlying them.

Data Collection and Governance

According to TechRepublic.com, 80% of the data in companies is unstructured. Because of its nature, unstructured data is more difficult to analyse than structured data and not easily searchable, all of which makes the work of protecting it much harder.

Research by Gartner (titled "Guidance for Addressing Risks with Unstructured Data") found that:

75% of respondents claimed their greatest challenges were:
- Identifying unstructured data locations.
- Getting users to conform to new technologies and processes to protect unstructured data.
Nearly 63% claimed that removing unstructured data after retention periods expired was a challenge.

In order to overcome this challenge, at Uber we follow a 3-step process to help drive our program:

Classifying data and Set Protection Standards (Planning)
Inventorying data (Execution)
Enforcing Data Privacy (Execution)

The first two steps help provide for the centralized consistency to help drive privacy enforcement at scale, and leveraging our data platform, metadata identification methods and ML-driven classification tools enables us to do so without unnecessarily slowing down our engineering teams.

Data Classification and Standards

Privacy programs can take two approaches:

Lockdown
Tooling, Training and Trust

The first model will require engineers and others to go through stringent controls to access data. While this may be practical in some companies, it can have unintended consequences, especially if applied without context.

The second model is a combination of 3 Ts

tools (encryption, Multi-factor authentication, deletion APIs, etc.)
training and
then building an overall culture of trust that will honor user privacy.

The right approach is often a combination of both. You want to lock down some data that is extremely sensitive and if leaked or improperly accessed, could hurt your customers and their trust in your business. But for an overall program, you also want tools that categorize data based on risk and apply techniques to protect data based on risk levels.

This is why you need a data classification based on privacy risk so you can tailor your data protection accordingly.
An example of such a classification would be as follows:

Tier	Type	Example
1 (Very Sensitive)	Government ID	SSN, Driver’s license
2 (Sensitive)	Biographical data	Name, DOB, email
3 (Semi-sensitive)	Vehicle data	License plates

How the process works

A data classification process is the outcome of negotiations and discovery between several teams. The key players involved:

Privacy Legal
Technical Privacy
Security
Engineering
Product Management
Data Scientists

Although privacy is widely seen as a legal area, it would be a huge mistake to only let the legal team drive this classification process.

The lawyers may take an overly defensive approach by applying the law without business context or may give engineers too much of a free hand believing in their own ability to win in court.

Either approach is suboptimal. A better approach is as follows:

Here is how I (as the leader of the technical privacy initiative) have run the Data Classification process at 3 companies, all of which had very different cultures.

First, I worked with privacy legal to get a sense of how they would classify data. Concurrently, I was working with engineering, product management and data science to understand what data they needed for operational and analysis.

Second, I would produce an initial classification based on the regulation-focused input from legal, and real-word utilization from other stakeholders.

Third, I would open the draft of this classification to comments so that all the key stakeholders can comment at once on areas of disagreement.

Organizational alignment on Data Classification

This step is critical since you will uncover areas where key stakeholders may disagree on how privacy-critical a particular data element is.

You may find, for example, that engineering believes that a data ID is not privacy-sensitive since it is internal to the company and will not identify a customer externally.

Legal may disagree since it may be possible to join this ID to information that will personally identify a customer, like an email address.

You can bridge this gap by applying Authz techniques so that only the appropriate teams can access and join these tables, while others cannot.

This is why it is vital to classify data, so that you can apply data protection at scale with context.

Data Inventory

In modern businesses, classifying data is not sufficient.

In order to protect your data, you need to apply that classification to data that you collect going forward and data you have already collected.

This process is called Data Inventory, which as you may remember, is step #2 in our Data Governance program. This process, whereby you will physically tag data to reflect its classification, is technical, and will require an infrastructural component.

The good news is that you can use the data classification exercise we just completed as a starting point for the data tagging process.

Creating the data tags

These classification tags should serve the following purposes:

Compatible with and supportive of external regulatory (e.g. GDPR, CCPA) requirements
Applicable to all data in these states: data at rest, data in transit, and data in use.
Tag definitions should be canonical, unambiguous, and machine-readable. They can be used either individually (e.g. for individual database column or API parameter), or as a group, represented as comma separated values, where applicable (e.g. for entire dataset, or API).

How tagging will affect your data

Data Inventory converts your entire data store into a queryable database.

Let’s imagine that your data store is a combination of the following:

Databases (structured data),
AWS S3 bucket,
3rd party SaaS Apps

Data Inventory is the process of tagging all the data using the tags we just discussed.

At that point, your physical stored data will reflect your data classification.

Timing Data Inventory right

A peer of mine from my Netflix days - where I led privacy engineering - had a pithy saying. "When it comes to protecting data, the best time to start is yesterday, the second best time is today."

Think of data coming into your company as a funnel. Once data enters your system, users will copy it, infer other data from it, etc. As that data moves deeper into your system from left to right, it grows in size, just like the funnel.

So, the right time to inventory your data is as early in the funnel as possible. To stick with the funnel analogy, you want to be as far left. This will help you apply the most optimum data protection techniques before engineers start using the data.

The Data Inventory infrastructure

At Uber, the Data inventory Architecture has 5 key components:

Crawl various datastores,
Discover datasets,
Make those datasets and corresponding metadata available.
Provide extensibility to add new metadata in self-service fashion.
Support the categorization of personal data

As you can probably infer, the architecture needs to be equipped to detect data that is dissipated across various systems in your company, map it to the attendant metadata and then apply the classification tags.

I have argued in real life that steps 1 through 4 are required by data science teams anyway for improving data discovery and quality.

So, the privacy and data science teams have been able to split the cost and pool our abilities to have a more improved data inventory possible.

The Data Inventory workflow

In this diagram, we highlight how you can consolidate your data and metadata in one location (in step 1 of the process). You can do that using crawlers, event listeners, etc. You may also want to provide a UI portal for engineers to manually enter their data schemas.

In the middle column of this diagram, we are highlighting:

The ability to manually categorize data for engineers and data scientists who know their data
This manual classification is also a way to train ML-based models that will apply data classification tags to your data
This combination of manual and ML-based data inventory will help reduce our dependence on manual classification

Then, there is a separate decider process to ensure a final check for the most privacy-centric data classification.

Note that all of this work is to happen before engineers and data scientists can avail of the ingested data. In this way, we place data inventory to the leftmost point possible in the data life cycle at Uber.

Data Sharing

Just as organizations collect vast amounts of data, they share data for product innovation and to grow their customer engagement. Data sharing has led to the growth of the internet, tools for consumers and profits for businesses. As we all know, however, data sharing is replete with risks.

Case study

Strava, the fitness tracking app, uses satellites to record its users' runs, bike rides, and other workouts. (Source) It also makes many of these routes available for public view on its Global Heatmap, which shows where people around the world go running and cycling. This cool feature ended up creating privacy headaches for Strava and the US military.

Share my data; lose my privacy?

US service members had been recording their runs around the compounds of their military bases. That information made it on the Strava heatmap and unknowingly revealed their locations. Twitter users figured out they could identify outlines and activity patterns on US military bases in places like Syria, Afghanistan, and Somalia.

The biggest potential threat was not the base locations themselves, which are public, but what went on in and around the bases. The map showed activity patterns within and around the base, giving away supply and patrol routes, as well as the precise location of facilities like mess halls and living quarters. Further, users could get location-specific data, allowing them to link map activity to specific profiles.

The result: You could

find out which service members were in which locations at any given point in time.
identify the soldiers even when using differential privacy, you blurred out the start and end of their runs

Lessons

Here are some key lessons:

Data sharing is not just about sharing data between one company and another
Any time data you have collected from someone else leaves your company, you are essentially sharing that information with outside entities.
In the age of social media, publicly available information, data on the dark web obtained by way of breaches and ML-based tools to combine data, identifying people has become easier than ever.

So, for privacy, you need to think of "data sharing" anytime data leaves your domain. This is true when you are a company collecting user data.

A Privacy Architecture for Data Sharing

At Uber, just as we introduce data governance early in the funnel during the data ingestion phase, we provide for privacy controls early in the data sharing process.

We have always had a Privacy Impact Assessment (PIA) process run out of privacy legal to conduct reviews and avoid privacy risk.

To this, we have added a Technical Privacy Consulting (TPC), that is an engineer-to-engineer informal resource aimed at prevention and automation early in the requirements and design process, so as to avoid an overdependence on a review that occurs once the feature is well into development.

In the context of data sharing, TPC offers engineers several techniques whereby we can share data in a privacy-sensitive fashion.

Data-Sharing with privacy in mind

When we share data with third parties, the following represent areas of due diligence:

Will the data be secure (at rest and in transit)?
How granular must shared data be?
Will 3rd

Valid use cases for data sharing: an Uber point of view

City planners and regulators need access to data from transport providers to inform and enforce policy decisions. For example, cities need to understand the impact of transport services on traffic, parking, emissions, and labor practices. Common applications for shared data include collecting per-vehicle fees, enforcing parking rules for shared bikes or scooters, and responding to service or safety issues.

Typical data requested from transport providers may include driver’s license numbers or vehicle identification numbers that uniquely identify individuals. Requested data often includes individual trip data containing precise geolocations or quasi-identifiers that could identify individuals when correlated with publicly available data.

Drop-off geolocations are collected to analyze the impact on parking and traffic flow. Trip telemetry is being collected to detect when vehicles enter prohibited areas in order to issue enforcement citations. Vehicle or driver license numbers are collected to verify that all vehicles are permitted to operate within a city.

Anonymize before sharing

Even when valid, data sharing requires a level of anonymization so as to ensure privacy. And, in my experience, there is no one technique that offers a privacy guarantee, so the recommendation is to deploy all available techniques while recognizing the expected tension between precision and privacy.

Precision vs. retention

As a general principle, when you share data, you want to ensure that the precision of data has an inverse correlation to its retention. To that end, I recommended:

Delete unique identifiers and precise times and geolocations after 90 days.
Delete coarsened times and geolocations after 2 years.
Internal, indefinitely retained data should be at least 5-anonymous or ε=1.6 differentially private.
Bulk shared data should be at least 100-anonymous or ε=4.6 differentially private.

IDs: Obfuscation and replacement

A lot of businesses create internal identifiers to interconnect data, and while there is no set best practice on sharing these IDs with vendors, I recommend avoiding that unless absolutely necessary.

If you do share these IDs, the following best practices are recommended:

Ask your partners to dispose of identifiers provided by you and optionally replace with uniquely generated identifiers.
If you are not confident of your partners following through on anonymizing these IDs, you will want to hash your IDs before sharing. The goal of hashing is to prevent your IDs from appearing in the wild, and to ensure that each vendor has different IDs so that vendors cannot connect the dots with each other, and if there was a breach, you could identify the vendor.

Lose your unique identity

One of the key identifying patterns for an Uber side would be a unique start and end time, especially if for that period of time, the number of rides is relatively small. This is especially the case if external data can be leveraged to identify a user, as with the aforementioned Strava example.

As a result, Uber works with several partners to round start and end times of trips, and pick up and drop off locations to the end/start of the street.

That way, if you have two trips:

Trip A, started at 12:21 and ended at 1:02 PM
Trip B, started at 12:19 and ended at 1:09 PM

They are both shared as having started at 12:30 PM, and ending at 1:00 PM. This again reduces identifiability of the trip.

Is anonymization enough before you share?

A lot of studies and recent reports have shown that even anonymized data is not sufficient to protect privacy, but a recent tweet showed how even rounding off times and pickup/dropoff locations may not be sufficient.

In the image above, the grey box represents 3 decimal points of precision for a drop-off location. It's easy to see which building within a hospital a person went.

For temporal aggregation, most policies group by 15 minutes. It is exceedingly rare that two trips ever fall within this box at the origin and destination in the same 15 minute window, so even imprecise data has the same identifiability as raw trip data.
In the event that we share data with ineffective anonymization, and then the data has to be produced publicly or is breached, then you might end up with a lot of personal data leaked to the internet.
Which is why at Uber we lean heavily on k-anonymization to address this issue. While sharing data, we drop the trip clusters with fewer than K uniques. The industry best practice is ~5, which means a K-Anonymity of 5.

We will be discussing how to effectively use k-anonymization shortly, and how we can "zoom out" data more for anonymity i.e. how we can look at high spatial (longer distances) and high temporal (higher time-periods) resolutions to provide better privacy.

K-Anonymity in Boston

The privacy architecture team at Uber, in order to educate internal stakeholders, did a K-Anonymity study on 40,000 Uber rides in Boston.

They yielded interesting results on the conflict between data precision and privacy. For context, iun each of the tables, the vertical dimension is the number of GPS decimal points in the location, while horizontal dimension represents the K-Anonymity value.

When you share 0 decimal points in location GPS

	2	5	10	50	10	100
0	100%	100%	100%	100%	100%	100%
1	100%	100%	100%	100%	100%	100%
2	100%	100%	100%	99.9%	99.9%	99.1%
3	99.9%	99.8%	99.5%	97.6%	95.3%	87.9%
4	97.4%	93.2%	89.3%	73.1%	59.3%	17.3%
5	68.4%	35.5%	18.3%	2.5%	1.5%	0.9%

Sharing 0 decimal points in location data coarsens the data to a degree that we were able to establish a K-Anonymity of 1000.

K-Anonymity with 4/5 decimal points

Next, we went in the opposite direction by examining the privacy impact of 4 or 5 decimal points.

	2	5	10	50	10	100
0	100%	100%	100%	100%	100%	100%
1	100%	100%	100%	100%	100%	100%
2	100%	100%	100%	99.9%	99.9%	99.1%
3	99.9%	99.8%	99.5%	97.6%	95.3%	87.9%
4	97.4%	93.2%	89.3%	73.1%	59.3%	17.3%
5	68.4%	35.5%	18.3%	2.5%	1.5%	0.9%

We observed that, the drop in the percentage of users for a specific K-Anonymity value drops significantly when we go from 4 to 5 decimal points.

For example, when you display 5 decimal points, 68.4% of users have k-Anonymity of 2 i.e. for 68.4% of users, you can find 1 other user with same trip values.

If we shave off one decimal point and offer GPS locations with 4 decimals, we now see that for 97.4% of users can find a similar ride, so you have k-Anonymity for 97.4% of users.

K-Anonymity of 5

	2	5	10	50	10	100
0	100%	100%	100%	100%	100%	100%
1	100%	100%	100%	100%	100%	100%
2	100%	100%	100%	99.9%	99.9%	99.1%
3	99.9%	99.8%	99.5%	97.6%	95.3%	87.9%
4	97.4%	93.2%	89.3%	73.1%	59.3%	17.3%
5	68.4%	35.5%	18.3%	2.5%	1.5%	0.9%

For the industry best practice of 5, we noticed a K-Anonymity of 5 for all users in our sample if we used 0-2 decimal points in the location data. When we started adding additional decimal points, we saw a drop in the number of users with K-Anonymity of 5.

This was just a sample cohort, so results will vary on a case by case basis. The larger point is that you’ll need to deploy several techniques and model your data to measure anonymity before sharing so as to preserve data and user privacy.

Closing thoughts

This article lays out how you build an internal data governance architecture early in the ingestion phase, which enables you to allocate risk to data and identify such data in your systems. You can then protect the data accordingly.

The second half of this article lays out various techniques to share data in a privacy-conscious manner.

All in all, this article reflects learnings and innovations to automate privacy at the architectural and strategic level for your business, and is applicable to all companies that use customer data to operate and grow.

About the Author

Nishant Bhajaria has started and led, over the last decade, privacy/trust programs at some of America’s largest and leading technology companies. As a cross-functional leader and advocate for privacy, he has led teams that build privacy tools and also helped shape the policy and PR message around privacy. Currently, he is building and leading a privacy architecture org at Uber that provides strategic and technical privacy governance across the company in partnership with several key partners and stakeholders.

InfoQ Software Architects' Newsletter