InfoQ Homepage Presentations Secrets at Planet-Scale: Engineering the Internal Google KMS

Secrets at Planet-Scale: Engineering the Internal Google KMS

Bookmarks

View Presentation

Speed:

Download

48:40

Summary

Anvita Pandit covers the design choices and strategies that Google chose in order to build a highly reliable, highly scalable service. She talks about continued maintenance pain points and suggested practices for an internal key management service.

Bio

Anvita Pandit is a Software Engineer at Google, where she works on cryptographic key management and data protection. She has previously presented a talk at DEFCON 2019 on the fallabilities of genetics testing algorithms, and talks at Lesbians Who Tech in NYC and Montreal on digital currencies and society.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Pandit: Welcome to "Secrets of Planet-Scale." I'm a software engineer at Google, and my name is Anvita Pandit. I'll be presenting a system here today that was built by a team of software engineers, cryptographers, and site reliability engineers over many years. Some of them may be in the audience today. I'll go into the challenges we faced as we designed, built, deployed, and currently run the internal Google KMS, key management system, which is a foundational building block for the security and privacy architecture at Google, and handles tens of millions of queries per second. In 2017, we also released a white paper on encryption at rest. There's a link at the end of the slide deck. I'll try to hold 10 minutes at the end for questions and answers.

A little bit about myself: I have been developing on part of the data protection team, which is part of the security and privacy organization at Google for two years now, both on the internal KMS team. I worked on a system for early detection of production breaking config changes and another one for reducing the latency for customers to see their config changes in production, and then also some tech debt here and there. Before that, I was an engineering resident at Google. I did a rotation on the speech team on speech to text. If you're familiar with the little microphone on your Android device, the button that turns your spoken words into text, I was part of a team that was on that pipeline. You might have caught me earlier this year at DEF CON 2019 in Las Vegas where I co-presented a talk at the Biohacking Village with my friend AnneKim, that's her Twitter handle (@herroannekim), a talk called "Hacking Race."

To head off any potential confusion, despite the similarity in names, the system I'm talking about today is not the Google Cloud KMS that you might have interacted with. The clients of the Cloud KMS were external customers such as yourself, whereas the clients of the internal KMS are only internal Google systems, including the Cloud KMS. Since our service underlies authentication and storage systems at Google, these storage systems build on us, and they are in turn used by familiar names like Photos, and Maps, and Gmail. If you hear me say key store, we're just going to have to erase that from your memory at the end of this presentation.

We have a big agenda here today. I'll be covering the situations in which you might need to use a key management system and then walk through the essential product features. We'll go through the encrypted storage use case, which is our major use case, and then cover our system specs and architectural decisions. We'll walk through a major outage that we had back in 2014, what happened, what we did to fix it. Then we'll do some more architecture stuff and then cover one of the challenges that we faced and how we overcame it, which was safe key rotation in a distributed environment.

Why Use a KMS?

You may not use us directly, but you'll definitely recognize when our service is down. Gmail is a highly available service with over 1 billion monthly active users. On January 24th, 2014, Gmail was down for between 15 minutes and two hours, depending on where you were in the world. Gmail, like many other Google services, uses our KMS to store cryptographic keys that encrypt and decrypt user data.

During this talk, we'll learn why and how this outage happened. By the way, we were not the cause of all the Google Docs outages earlier in the summer. Just the one. In cryptography, a key, also known as key material, is a series of bytes that are used to encrypt and decrypt data. Keys are usually the most effective when they're kept a secret. Why do you need a KMS at all? The core motivation is that often your code needs secrets, and there aren't any good alternatives to using the KMS. Secrets are, for example, database passwords, third party API tokens, OAuth tokens, encryption and decryption keys for data encrypted at rest in storage, signing and verifying keys. Any secret material that needs to be accessed programmatically that needs to be stored somewhere where your binaries can retrieve it.

What are your potential options for storing the secrets? You could store them directly in your code repository. You could do this. It would work, but it kind of sucks from a security point of view. For example, who here has accidentally committed their password to their GitHub repository? You're not alone. It's actually really common for people to commit passwords and API tokens, AWS credentials to their GitHub repository. In fact, when I gave a practice presentation for this talk last week, a friend of mine revealed to me that they had dropped their AWS credential in their GitHub repo a couple of years ago, and they only found out when they got a $40,000 credit card charge in the mail. Thankfully, Amazon dropped that for them, but pretty scary stuff. By the way, don't do this. There are mirrors and crawlers all over GitHub, so even after you remove the information from your repo, they still might exist in those mirrors and crawlers, not to mention it'll exist in the original commit data. You need to change your password if that happens to you - a key takeaway from this talk.

Even if your team manages to avoid publishing your secret keys online for all of perpetuity, at some point it just becomes reckless to expose the keys to everyone with access to the code base. You will run into a nefarious individual, a disgruntled employee, or even someone who just leaves their laptop unlocked at a bar, and Google, at the point of around 60,000 software engineers, is way past that point.

Another option you could use is to store secrets directly on production hard drives to be manually deployed, but this kind of sucks from both the security and the operational point of view. Anyone with physical access to those hard drives, as well as malicious programs running on production machines, can retrieve your keys in the clear, not to mention you still have the issue of deploying those secrets out to production. Our available options in code repository on hard drives are not very exciting. Instead, we let our users store secrets in a managed way, and we allow for audited and authenticated access.

Centralized Key Management

Centralizing your key management solves key problems for everybody. Users can import their secrets to us or ask us to generate keys on their behalf. Then our KMS securely store secrets encrypted with our own service master key before putting them in durable storage. During the user service operations, their keys are just one RPC call away. Centralizing that kind of service gives two really nice benefits. It lets you separate out the key handling code, and it also offers separation of trust. When you separate key handling code, you reduce the size of the code base that needs to handle sensitive information like keys. You've probably heard of "don't roll your own crypto." We take it a step further. For separation of trust, that reduces the number of systems that need to be secured.

For example, take the case of a user that needs to store its database passwords with us. During development and testing, they don't need to worry about where their secret keys are or if they've been stored securely, and in production, they can ask us for their production password where we give it to them. It could be used and stay only in RAM where it is mostly safe, and they're good to go. They don't need to take great precautions as a regular user on what kind of machine they're running on or who has access to the persistent storage on those machines. It excludes lower layer storage systems from the trust boundary.

Of course, there are some downsides to centralizing your keys. You've created this big honeypot of really juicy secrets for any attacker to come and try to steal. We believe that we've worked towards creating a situation where even a highly privileged and knowledgeable attacker could not unilaterally do damage to those keys, which we'll learn about further when we go into the root of trust slide in a bit.

Now that we understand the benefits of the KMS, we'll go into the most important product features. We form a single central chokepoint for access control, auditing, and logging. That allows us to provide safe defaults to customers with minimal security awareness required on their part. I'll go over the two most basic product features, the most important ones you'll need to have, which is access controls and auditing.

Access control lists, also known as ACLs, determine who is allowed to use the key, and then, separately, who is allowed to make updates to the key configuration. ACLs can be a list of humans or service identities or predefined groups of the same. Since our keys are under centralized management, we can make strong guarantees on who has access to those keys down to the list of names, which you can't do if your keys are scattered all around the codebase. The identities here are managed by an internal authentication system separate to us. They have a white paper out known as ALTS. There's a link. The KMS is able to observe the identity of our RPC issuers and then use that to enforce the ACL.

It's pretty straightforward so far, but it can get tricky. For example, sometimes you'll need to make a sweeping change to the security posture at your company such as adding another hoop to jump through for access. Like GDPR introduced the concept of geographic region for some people. With thousands of users and no centralized management of keys, that work involved can quickly spiral.

For the second feature, auditing, you need to be able to answer, who touched my keys, and what did they look like? The KMS is tied into another internal system that provides binary verification and binary providence. Binary verification gives you the ability to check whether someone who's contacting you is running unreviewed or unsubmitted code. With that kind of ability, our users can then make additional enforcements like, "I only want verified binaries contacting this key," and that gets enforced in addition to the ACL. That ensures unreviewed code doesn't contact privileged keys, whether maliciously or by accident. The KMS also provides logging of key accesses, like client details, server details, geographic zones. You'll want everything required to debug a cryptographic request except the secret key material. This can be surprisingly difficult in a distributed system when you might have a separate logging service, a separate key service.

I'm going to talk about what our secrets look like and the encrypted storage use case. There are two kinds of secrets that we store. The first kind is given to us by customers and returned to them so they leave our service boundary. These API tokens and database passwords fall into this category. The second kind is most heavily used by the storage case. They're called key encrypting keys or master keys, and they never leave our service boundary because they're only used to encrypt other keys which are passed in. I'm going to further explain the storage use case.

To store data, storage system is subdivides it into chunks across multiple machines. A unique data encryption key, or data key, is generated for each chunk using Google's common cryptographic library. That partition of data means the blast radius of the compromise of that single data key is limited to just that data chunk. Then each chunk is encrypted with its data key before being distributed and replicated in storage. Then, the data keys are wrapped with the storage system's master key or key encryption key - the KEK here - with a call to our KMS. Then the wrapped data key is stored next to the encrypted chunk while it's being replicated and persisted along with some authenticated metadata on who can retrieve it. That's done mainly for performance so that a key unwrapping operation is very efficient, and the number of keys we need to centrally manage is much smaller.

Google’s Root of Trust

Now we understand what our keys look like, we're going to do a tutorial on where we sit in Google's root of trust or security hierarchy. There's a hierarchy of key management systems at Google that parallels the encryption key hierarchy. I'll be focusing on the availability and performance of the system here in green, which is a direct dependency for the storage systems. Our main security goal with this somewhat extensive hierarchy is to avoid keeping secret material in clear text on persistent storage. It's really hard to ensure that your secret material is safe when it's on a hard drive somewhere. There's just so many ways to snoop the key bytes, from stealing the hard drive to side-channel attacks, and all of this is just multiplied when you have a very large distributed system.

We learned about how the storage systems create data keys and use it to encrypt chunks. Those storage systems keep their master keys in the KMS, that's the green box, along with a bunch of passwords from other services. The KMS exists on tens of thousands of production machines globally. The KMS master key shown here in red, which is used to encrypt all keys before they're put into durable storage, that red key is stored in another service called the root KMS. The root KMS is much smaller than the KMS, both in number of keys and in number of jobs. It runs only on dedicated and hardened machines in each data center. The root KMS has its own master key shown here as the gold key. That gold key is stored in a third service known as the root KMS master key distributor.

The root KMS master key distributor uses a gossiping protocol to distribute and hold the root KMS master key only in the RAM of dedicated machines all around the world. To guard against all instances of distributor dying simultaneously, we also keep a couple of copies of that root KMS master key in clear text on hard drives and physical safes. That's the only point in the chain where keys are un-encrypted in non-volatile storage, and we have a very thorough policy in place securing any sort of access to those keys. It is backed up on secure hardware devices stored in safes in locked-down locations, in three physically separated worldwide Google-controlled locations. Fewer than 20 Google employees have access to that safe at any given time. There must be at least two of them present at a time, and there's a physical audit trail.

Here, each succeeding level in the hierarchy manages fewer and fewer keys, but it can't be turtles all the way down. At the root of the hierarchy, you end up with a handful of keys, and since we've managed to avoid durable storage until the very root of the hierarchy, we can orchestrate a through and auditable process for any action that touches those keys.

Design Requirements

Now that we understand where our service fits in the security environment, we'll start diving deeper into our system requirements and the service architecture. I'll run through the critical design requirements we needed to satisfy and how they influenced our design decisions. Although we are not a Google cloud service, we support cloud services and so we're driven by availability and latency metrics that are cloud-wide. Most notably, the expectation that major cloud services have five nines of global availability. Five nines allows for at most one error in 100,000 requests, or approximately just over five minutes of downtime a year. Expectations differ by region, so something smaller than global, for example, a zone, approximately a cell in a data center, would only have three nines.

Availability is actually our most important system requirement, as you probably realized. When our system is down, users like you can't access your data at Google. That places a floor on the proportion of time that our service isn't usable.

As for latency, as a point of comparison, human noticeable latency is about 100 milliseconds. The encryption and decryption ops can be in the critical path for handling user requests. Additionally, since we're pretty deep down in the serving stack, a higher level operation like a file open might involve multiple KMS ops.

Scalability, not much more to say. As for security, of course, keys have to be safe, but we'll also cover an additional requirement, which was safe key rotation.

Key Wrapping and Stateless Serving

To encrypt everything, we needed a design for key management at scale with high availability and without sacrificing security. How are we going to do it? We decided not to be an encryption and decryption service. The KMS only operates on small blobs, that is keys, of a uniform size, and that means we can place really strong speed and reliability guarantees on when requests finish. We decided not to be a traditional database and so we only hold a small number of keys. I'll outline two of our features, key wrapping and stateless serving.

For key wrapping, instead of managing all the data keys, of which there are millions and millions, we only manage the customer master keys or key-encryption-keys and let customers handle the data keys themselves. The customers can change the data keys very quickly. They can make them very fine-grained to match by file, whereas we can change the customer master keys pretty slowly. Reducing the number of keys we centrally manage improves our availability but then requires more trust in the client because their data keys can be lost or corrupted or stolen without our service being able to prevent it.

An insight here is that since our keys are changing slowly, cryptographic key bytes are not mutable state at RPC time. Clients can't make changes to their KMS keys at RPC time. When we couple this with slow updates of the key material and key wrapping, we get a stateless server which allows for trivial scaling. That's because the server doesn't have to coordinate with any other instances or mutate state directly. Key wrapping allowed us to have our cake and then eat it, too. Since the KMS holds a few thousand keys and only their associated metadata, it can hold it all pretty easily in RAM, and then that allows us to easily meet our latency requirements.

What Could Go Wrong?

It was all going so well. What went wrong that day? The root cause of the outage was a truncated configuration file which caused our KMS servers to temporarily forget many encryption keys. This was a rare blip in our service, and although there was only a 15-minute downtime on our part, it led to cascading failures over two hours-long in some cases, including Gmail, where other services couldn't access their keys and hence couldn't decrypt user data or serve anything. We're going to walk through the step-by-step config push process that led to that outage.

During normal operation, the KMS serves from a local configuration file. Each instance has a locally stored file with keys, which it uses to serve clients. Let's back up and see where that config came from.

Each team at Google is responsible for maintaining their own KMS key configuration. Our engineering and operations teams set guidelines and do reviews and help out, but in the end, their configuration stands alone. All the KMS key configurations are stored in Google's monolithic version control repo, and any key material in those files is encrypted with our service master key. Then a regularly scheduled cron job automatically gathers all the teams' configuration files, merges them to a single file, which is validated, and, of course, teams like to iterate quickly so that cron job also runs quickly.

Then a fast live data distribution system, in 2014, copies all the new config files to every single serving shard of the KMS in Google production within minutes, approximately five minutes in this case. Then on that fateful day five years ago, an inopportune preemption in the Borg job scheduler triggered a bug in our config merger job, which then produced a configuration file missing many keys, which was then dutifully copied to all of the KMS shards leading to sad clients and sad users.

A centralized KMS, by nature, is always kind of a single point of failure. But we had become a startup dependency for many services and then also a runtime dependency for many services, so this outage was pretty painful. We got a lot of Google-wide attention, and we learned that we could not fail globally. We made some changes. We introduced a slow and controlled rollout of binaries and configuration, and teams that need fast access to configuration updates, for example if ther service isn't critical, they can access a limited pool of servers that get those updates, but there's no more fast-all-everywhere config updates. Then we introduced regional failure isolation, so if one region is getting overloaded or unhealthy traffic, it won't spill over to other regions. Then also isolation by client type. We isolated the encrypted storage users to their own pool of keystore servers so that they could not affect the more general pool of keystore servers.

For our second takeaway from this talk, a service is only as reliable as the weakest of its dependencies. We realized we really had to minimize our dependencies, and we had to make sure all of our dependencies also had regional failure isolation. The resulting system had a number of characteristics that met and exceeded our performance requirements. We've effectively had no downtime since that Gmail outage in 2014 January. The error rates are significantly lower. We now see dozens of errors per trillions of requests measured over one-hour intervals. Basically, you can call us Tekashi because we have six nines. Our latency is that 99.9% of the requests are under six milliseconds. Symmetric cryptography, and holding keys in RAM helps here. Our 50th percentile is 170 microseconds, so pretty happy with that. Scalability-wise, we have tens of millions of QPS served using tens of thousands of processor cores.

Safe Key Rotation

We've covered all the performance aspects, and now we're going to go into the security aspect that really impacts availability. That is key rotation. Once our core system was in place, we really wanted to enable users with best practices by default. Key rotation, being able to change your key regularly, is super important, but a lot of people don't do it because it's hard, time-consuming, and dangerous. What happens if you don't rotate your keys? Your keys can be compromised and leaked at any time. Additionally, flaws, or sometimes intentional backdoors, are discovered in cryptographic algorithms all the time. With such an attack, access to the ciphertext is enough to lose your data. Rotating keys limits your window of vulnerability if a key ever gets lost. Instead of an attacker being able to decrypt your user data from the beginning of time, they only have it for when that key period was in use.

Rotating keys also means there's a potential for permanent data loss. Basically, if the encrypting key for some ciphertext is lost, then you just cannot recover that ciphertext anymore. You might have discovered this the hard way if you ever lost the password to an old website. I know I can no longer access my Neopets account. It was a sad day finding that out. It's called crypto shredding, and you really don't want to do it by accident.

We had three goals for key rotation. One, we needed users to design their systems with rotation in mind. Rotating your keys should not be done for the first time in an emergency! That's the third takeaway for this talk. That is going to be a bad time for everyone. Rotating regularly exercises your associated code paths for rotation and then also your operational procedures, which are both very important. The goal is that we wanted multiple key versions to be no harder to use than a single key. That is, when you're rotating, you will have different versions of the key available at a time, or a key will actually be a key set, which we'll go into in a moment. The third goal was that we wanted to make it pretty much impossible to lose your user data.

From the user's perspective, they can choose the frequency of rotation. Thirty days is most common. You can do 60 days, 90 days, or a couple of months. Users also choose the time to live of the ciphertext. The time to live is the period after which a user has generated a ciphertext that it still expects to be able to decrypt it. If you need a ciphertext to last longer than, say, 90 days, you need to re-encrypt it. Google internal systems like Bigtable will do this automatically as a result of normal operation. Given these parameters, our KMS guarantees a safety condition: all ciphertext produced within the time to live can be decrypted by a key in the KMS.

For the second goal, we're pretty tightly integrated with Google standard cryptographic libraries that provide multiple key versions in a key. From a user's perspective, the timing of the rotation can be invisible. Additionally, each key version can be a different cipher. That's really useful for being able to upgrade your keys to a newer algorithm every few years, which is important for long-lived keys. Tink is one cryptographic library. It's open-source now.

For the third goal, to make it hard to lose data, we devised a scheme for key rotation in a distributed system. Over here, it's a simple scheme that can be improved upon. We have keys V1 and V2, which are two different versions in a keyset. Time increases from left to right. Keys can be in three different statuses. That is, active, primary and scheduled for revocation. The important thing to remember is that only primary keys can be used to encrypt, but all three can be used to decrypt. Starting at T0, V1 is generated, and it is pushed out to all readers, all instances of the KMS by the end of T0. It's active, so it's not being used to encrypt anything yet.

Then, in the next interval, T1, V1 is promoted. The version V1 is promoted to primary and can now be used to start encrypting data. The KMS guarantees that a key version is available to all readers before being promoted to primary. Transactional semantics are actually not required here. It allows for version skew transparently and safely because readers who see the key at T0 and T1 can safely interoperate with each other. At T6, V1 is demoted to SFR revocation.

The SFR revocation is a key status. We keep one key version beyond the expected lifetime of the key and alert if it's used.There's this trade-off here between holding our users to the contract, and preventing an outage. The contract is "You weren't supposed to use this key beyond this time," but then if our users don't have access to the key, they're trying to decrypt data, they can't access the key to decrypt it, they're going to fail, and that's probably going to be customer visible. Trade-off between security and availability of our client's systems. Over here, V2 was introduced to the keyset, and T2 is an active key. It took over as primary in T3 and continues. Note that the generation and deletion of key versions here is completely separate from the serving system. I won't go into it in detail, but feel free to ask me about it afterwards.

To recap, we have this availability versus security trade-off both in rotation itself where you need to rotate your keys, but rotating introduces danger in keeping a key version and actually acting on the rotation. On the KMS side, we derive a number of key versions to use, and then we add, promote, demote, remove the key versions as necessary.

Key Sensitivity Annotations

We made a recent push to start capturing in a structured format what would happen if our users' keys were to be compromised. This requires some analysis from user teams. Oftentimes, the keys have been made in the distant past - our service is eight years old at this point - by someone who may or may not be at Google, and everyone else on the team doesn't know what the key is used for anymore. Moreover, the systems often evolve over time, and their use of the key evolves to become larger or something different entirely because people will see that their team already has one key, they'll think, "I'll just repurpose it for this other thing I have." That introduces challenges because it creates a lot of risk for the team. If no one knows what the key is used for, then how do you know what to do when it gets compromised? That creates a lot of headaches for our threat analysis teams as well.

At Google, we don't believe in security by obscurity. We have teams assess the security properties of their keys, that is, determine what to do if they were to be compromised using the CIA triad. Not that CIA. The CIA triad is a standard model for information security, and it posits three properties of your security stance. Confidentiality means that your key hasn't been seen by anyone other than the authorized parties. Integrity means the key hasn't been altered except by authorized parties. Then availability means the key is there for your users when they need to access it. We work very closely with a team of users and security engineers to determine the correct consequences for a key. Sometimes it's not that easy to know what the key should be classified as. For example, we had a user email in a couple of weeks ago asking how should they label a key that they had in our KMS which was controlling remote access to a large physically movable machine in a machine shop. We learned a lot of novel uses for our keys with this push.

Structured consequences let you set policy for each key based on its severity. For example, you could say only verifiably built programs could contact a key that leaks user data if it's lost. You don't want to unnecessarily impose a lot of restrictions on users who just want a testing and monitoring key or a key for a noncritical service, but then you also need to ensure that keys that leak user data or that deny service to customers are locked down. Note that we currently surface these as recommendations, but in the glorious future we're going to make them defaults. This feature lends itself really well to both security and scalability of the service because when you have 2 to 5 teams onboarding daily, it's way more efficient for these to be built in to the use of your service than hashed out every time during privacy reviews.

Google KMS – Summary

To close off the talk, implementing encryption at scale required highly available key management. At Google Scale, we're working towards five nines of availability. To achieve all these requirements, we have several strategies, best practices for change management, staged rollouts, minimize dependencies, isolate by region and client type, and then combining immutable keys with wrapping to achieve scale and a declarative API to rotate keys.

Questions and Answers

Participant 1: Outages caused by DNS and cert failures in either the direct systems or infrastructure surrounding it, creating the outages that reduce you below five nines. I'm curious how you tackle those issues.

Pandit: I wish I knew more about this topic. I know Google doesn't use DNS directly, but it has a Google DNS service on top. I assume we use that to sort of gloss over some potential DNS outages. I feel like we're so compartmentalized that I don't really know specifically more. I can look into it more and get back to you.

Participant 2: If keys keep expiring over time, does it mean that Google needs to re-encrypt all the data all the time, like my five-years-old email content?

Pandit: Yes. If your data is living longer than 30 days, it's getting re-encrypted regularly. A lot of the time to live of ciphertext is actually built in at multiple levels, so we also have time to lives built into the storage system itself where the file will be stored with like "destroy after 30 days" tag.

Participant 3: I have two questions. One is, you said that the key store only stores relatively small blocks of data. I'm wondering, just to get a sense of how small that is…

Pandit: Less than one kilobyte. You had another question?

Participant 3: I was going to say that 30 days seems like a long time for a key to not be rotated, but I guess if there are a ton of keys, that's not really a problem because there's very limited data that that key could access.

Pandit: Right. Additionally, with the tiered structure, it's the key-encryption key that's getting rotated whereas the key-encryption key is regularly generating daily hundreds of millions of data encryption keys. Those keys are much more partitioned to a small chunk of data.

Participant 4: You mentioned that individual teams can manage their own keys. Then there's a chance that there is a key which is being used by 20 applications within the team. How you have that visibility that this particular key is used by these 20 applications, and one of that may need longer duration and other may need less duration? How you manage that?

Pandit: The key bytes are not visible to anyone, including us, but the key configuration is, and the configuration includes the ACLs and whatever auditing information that the client wants. The customers will send us config reviews where they can be, "I want to generate a key with this ACL. Does this look ok to you?" Then we would provide advice.

Participant 4: You mentioned that there's a key set where you retire one, and you'll get the new key as active key, so all the primary keys. Does it mean that data that is encrypted by that key which got retired can still be decrypted by the new key? Or, you cannot retire all the data and keep the better key is over their time limit?

Pandit: The idea is that multiple key versions exist at the same time in the keyset so that way as you're retiring one key version, as it nears the 30-day deadline, the key that the customer used to encrypt the original material will get retired, and then the next key version will be primary so it'll be used to encrypt. At that point, if the customer asks to re-encrypt their key material, it'll use the new key version.

Participant 4: They de-encrypt with the new key?

Pandit: Yes, so then it'll continue to live past that re-encryption until that next key version is rotated out.

Participant 5: Do you actually re-encrypt user data, like my data in Google Docs, or you re-encrypt the data encryption key, the DEK, that is attached to the data? Because it's a significant difference in size. You may be speaking about the length of a key versus a big chunk of data. Do you actively re-encrypt user data?

Pandit: No. The user data isn't passed to us. That all stays in the storage system that manages it. They will generally create a new data encryption key at the time they want to re-encrypt the user data and then decrypt and re-encrypt with that new key. That never gets passed to our service. That would be a lot.

Participant 5: The user data is actually re-encrypted.

Pandit: Yes.

Participant 6: How do you generate the keys, by the way? Are hardware appliances used for generating that? Do you come with real good cryptographic keys for encrypting data?

Pandit: I believe we use a randomly generated seed that's per the NIST's standard taken from the Linux kernel randomness, which uses a couple of different sources of entropy. It's all there in the tank docks.

Participant 7: I was super fascinated by the 20 employees traveling around the world with hard drives to put them in safes. We don't have that kind of infrastructure. You probably can't tell me much about how it actually works at Google, but for smaller organizations who are using sort of off-the-shelf key stores, do you have approaches that sort of mimic that that are more appropriate on a smaller scale that isn't the shortlist of bunkers under Swiss mountains or whatever it is that you store your hard drives in?

Pandit: You could still go with the idea of a shortlist of people. Perhaps multiple safes, different group of people allowed to access each safe. Coordination required between the groups to do any sort of key rotation, new key generation action, I think would be pretty good for a midsized organization.

Participant 8: Maybe what he's really asking is, is there some way from outside of Google to be somewhere to use your code in some way? I think you said that in your first slide.

Pandit: If you figure out a way, we're hiring. It seems like a good application to me.

Participant 8: Doesn't Google cloud externally use your key management system underneath it?

Pandit: Yes.

See more presentations with transcripts

Recorded at:

Jan 14, 2020

Anvita Pandit

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?