InfoQ Homepage Presentations Reducing Risk of Credential Compromise @Netflix

DevOps

Reducing Risk of Credential Compromise @Netflix

View Presentation

Speed:

Download

49:52

Summary

Will Bengtson and Travis McPeak talk about Netflix Infrastructure Security.

Bio

Will Bengtson is senior security engineer at Netflix focused on security operations and tooling. Travis McPeak is a Senior Cloud Security Engineer at Netflix.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Bengtson: My name is Will Bengtson, and I'm here with Travis McPeak. We're both members of the security tools and operations team at Netflix. We're not like your typical sec ops team or your security operations team, in that we aren't necessarily just incident response for the entire security group. We do do incident response, but it mainly focuses around cloud security. Our team owns our AWS infrastructure and the security therein, all the way from the IAM levels up at the very AWS-centric levels. And today, we're here to talk to you about building a Netflix security pizza.

McPeak: Wait, we're going to make pizza?

Bengtson: Yes, man, pizza.

McPeak: All right, I got this. Let's do it. Let's make this pizza.

Bengtson: We're ready now. So today, we want to build the analogy of building security controls and layering them as if you were building a pizza from scratch. So over the last few years, we've built some very special ingredients at Netflix, and we'd like to actually introduce you to those today, and let you decide which ingredients you like to build your own pizza. We're going to show you how we've built our very attractive, yet disgusting pizza for an attacker. So we're going to cover things like credential compromise detection, Repokid, which Travis built in open source last year, a thing that we call Role Protect, which deals with our delivery engineering tool, some anomaly detection techniques that we've built around some of the tools and monitoring that we have, how we got rid of static keys in our AWS environment, a thing called API protect, and much, much more.

We're going to, hopefully, leave you salivating after this talk, fully stuffed, knowing exactly what a terrible pizza looks like, but how you can actually layer security different ingredients together to make the pizza of your choice at your company. And to get us kicked off, Travis is going to start with talking how we actually built out our accounts.

Segment Environment into Accounts

McPeak: Okay, so we're starting a pizza. What do we need? We need the freshest, most doughy crust that we can get our hands on. For that, we're going to talk about segmenting the environment into accounts. Now, the way I like to think about it is that we obviously don't want our accounts to become compromised. We never want that to happen. But if they do, and we have the proper segmentation, then we can prevent a relatively big deal from becoming catastrophic. I think about it like a firebreak. And firebreak is basically like this. You have - hopefully, it doesn't happen - but if the forest catches fire, a firebreak will prevent a little segment of the forest from burning down, from turning into the entire forest burns down. It looks like this.

Since we're in San Francisco, I'll share a little fact that I learned recently. And that is, that in the 1906 great earthquake in San Francisco, at 5:12 in the morning, there was a 7.9 magnitude quake. Relatively shortly after that, a few fires broke out. And shortly after that, those fires became one big fire. By the evening of the next day, that fire became so big that they thought it might burn down the entire city. So the army came in and they had a great solution. What they did was, they started dynamiting mansions along Van Ness Avenue, and that became a massive firebreak. Because of that, the fire did not take the entire city, the city was saved, and so that's a pretty cool use of a firebreak.

Now, I think that's a perfect analogy for this because we're going to do the same thing with our AWS accounts. Let's say that we have a team that wants to be power users, and they need broad permissions. Now, most of the time, we, as a security team, build tools that make it really easy for developers to do the right thing. We don't want developers to have to worry about how to manage these AWS resources. So we create tooling. We call that the paved road. There are cases, however, where developers might need to do something that's not on the paved road. Power users, they might need to Terraform something, for example. That requires really broad permissions, much more so than we normally give out. And for that, we can just put them in their own account. They're completely segmented in there. They can be essentially power users. If the account gets compromised, it's contained.

Similarly, we can use it to separate duty. So the power users that we were just talking about can be in their own account, and then other power users can be in a different account. And both sets can be separated from our main product. It's also a really useful control when applications are really sensitive or when you have sensitive data. For example, we have a separate account just for our payment card stuff. It's super lockdown, almost nobody has access to it, and that's a great control that we can layer on for our most sensitive assets. Now, if you want to use this control, you are probably going to want to invest in some tooling. Creating and deleting accounts can be a very manual, tedious process. And so, if this control seems tempting to you, I would recommend creating tooling to make it easier to make these accounts and update them and stuff like that.

Remove Static Keys

Next step, what do we need? Nice, fresh marinara sauce. For this, we're talking about removing static keys. Now, the problem with static keys is that they never expire, and so that has led to some very unpleasant surprises for developers. We, on the security team, are in the business of preventing our developers from having a bad day. I would like to shed light on some bad days that other developers have had, and why we've decided to remove static keys, such as this person who says, "My AWS account was hacked and now I have to pay $50,000," or this poor developer who says, "I was billed for $14,000 in AWS after I checked in static keys into Git." Or even Ryan Hellyer's AWS nightmare, leaked keys led to a $6,000 bill overnight. These are all bad things that can happen, and the problem is, is that credentials can get leaked. You can put it somewhere where you don't want to, you inadvertently check it into Git, and when you do that, the access is permanent for an attacker, so they can run up tons of damage, attack people, all kinds of bad things we don't want to have happened.

What do we want instead? We want short-lived keys that are delivered securely and rotated automatically. Now, in our case, we're going to attach those keys, the dynamic keys that we've created, to roles that applications use and roles that our users use. But in either case, we don't have the static keys that led to these nightmares that other developers are having.

Permission Right Sizing

Cheese time. Love cheese, right? The gooey-est, most sauciest, good cheese that you can get your hands on. That's what we want for our security pizza at Netflix. What we're talking about here is permission right sizing. So we continuously and automatically remove permissions as they're no longer required from our accounts. The way that I like to think about this is it's like getting fitted for a custom suit. So you'll have an off the rack suit that fits most people, but you want to get it so it fits you perfectly. The first step in doing that is that you have excess fabric. For us, the way that we do that is, every new application at Netflix gets a base set of permissions. And the base set of permissions are things that we've noticed over time that developers frequently need. It's going to cover 90%, 95% of these cases. We know, by definition, that this is over provision. Not every app is going to need all of those permissions. So what's the next step?

We have a tailor, we're going to measure the application and see what are the permissions that it has that it needs, and what should go away. To do that, we use Access Advisor and CloudTrail, which are two awesome services that Amazon provides us. They give us a lot of data about which permissions a given role is using. Finally, once we have those perfect measurements for the suit, we can go and remove the excess fabric. In our case, we're using a tool that I'll talk about in a second, called Repokid, that removes the fabric.

Now, the cool part about this is that, not only do we have least privilege, which is awesome for security team, not only do we have this system where developers don't have to go and manually ask for permissions, and it converges to least privilege, but unused applications converge to zero permissions. This is really important because unused applications are a huge pocket of risk in the environment. Think about it. You have these developers, they're using an application, they're actively developing it, and then what happens? They leave the company. They go work on something else. People forgot it exists. It's not getting patched anymore, but it's still there. It still has their permissions. For us, as soon as we stop exercising those permissions, they go away, which is really cool.

This is the tool that I mentioned, it's called Repokid. This is my baby. I spent a lot of time working on this. I would love feedback though. If you're interested in it, I would love to have a conversation. If you've been using it, awesome. With that, I'm going to pass it over to Will, and he's going to talk about our next spicy ingredient.

Paved Road for Credentials

Bengtson: Thanks, Travis. I love cheese pizza like the rest of you, but I love to work out. I need my protein. So we're going to add some pepperoni to this pizza. Travis mentioned that we got rid of static keys in our environments. Everything is with a role, and you might be asking yourself or wanting to ask me, "Well, how do you do that? How do developers get access to your accounts if they do not have static keys to actually use with the different accounts?" For us, it's through a tool called Consoleme. We have a central place to gather credentials within our environment that gives you access to console, gives you access as an actual application, whatever your need may be as a developer, it's centralized, and we do so for many different reasons.

This is a picture of the Presidio Modelo. It's the Panopticon effect. You have a central guard shack that's able to see into all prison cells, see the activity, grant access to different cells. This is how we visualize how we do credentials at Netflix. We have a central tool called Consoleme that has access into every account, and gates that access of who can actually get into what accounts, what role. How valid are those credentials? Are they locked to certain computer or not? The central shack, Consoleme, has access to monitor and grant the permission needed for our developers.

So if you take a look at how Consoleme might actually work in our environment, we have our app Consoleme here in the middle. We have an identity service at Netflix called Pandora. So you could think of it as a merger between your HR system, your Google Groups, any sort of AD if you might have it. This system is a central authority of what identity actually means, who you are, who your manager is, who your teammates are, what groups you have membership are. You can ask questions like, "Is Travis a member of sec ops? Is he good?" You can't ask that question, but you can ask the membership questions that you need.

So Consoleme, when you're wanting credentials say, I want to develop as Repokid, I can ask Consoleme for those credentials. Consoleme has a list to look up with talking to our delivery tooling to see who should have permission to actually get Repokid permission or credentials. It will then ask Pandora, "Hey, is Will a member of this group?" Pandora can say yes or no. In the case that Pandora comes back and says, "Hey, Will is not a member of this group." Consoleme will then reply to me and say, "I'm sorry, you are not authorized to get those credentials." In the case that I am authorized, Consoleme will then reach out to the IAM service within AWS.

Now, the IAM service, for those that aren't familiar, has a tight pairing with the security token service, which allows you to create temporary credentials that Travis mentioned. These credentials can be valid from 15 minutes up until 1 hour, depending on how you're actually doing credential chaining. We got rid of SAML Federation in our environment, so we no longer have credentials valid longer than an hour, even for the console. But you can, if you are doing SAML Federation or something like that, get credentials that are valid for up to, I believe, 36 hours these days. But in this case, Consoleme is reaching out to AWS and saying, "Hey, can you please give me some credentials that are temporary for Repokid?"

The most important thing here, though, that we do is- I don't know if you can see in the back- but we inject IP permissions on these credentials. So AWS has a little known fact - when you're doing an assume role, you can actually inject sub-policies on that role. When I ask Consoleme for credentials, it's going to give me credentials that are locked to the VPN that I'm actually logged into. Those credentials are only valid from my environment, from my machine. If I accidentally check them into Git or leak them somehow through a vulnerable service, they are not valid outside of Netflix, which is very, very powerful for us. It gives us the assurance that we can go to bed at night and our developers can develop securely as their applications without much risk of being compromised, should they do something like expose a credential accidentally.

Tthe IAM service will then give us those access keys back and Consoleme will transparently pass those through to the developer. This creates a pretty seamless flow. We've even gone as far as developing a metadata service that you can run locally on your laptop as if you're running an EC2. This metadata service will then talk to Consoleme continuously on the back end and automatically refresh these tokens, so that you can do long-lived jobs like S3 downloads, and develop all day without actually having to run command line things all the time to renew your credentials. It creates a very seamless environment, but most importantly, a secure development environment from the security standpoint.

If you think about a central place to gather all credentials, it provides us with a central place to audit and log. I can see how many times Travis has requested a credential. I can see if Travis has tried to request credentials from my application when he doesn't have permission. It allows us to detect abuse and make sure that everything's running smoothly.

Now, we've given talks before on some anomaly detection techniques that we've done called Trainman, which learns our environment and what applications we use. The way we view accounts in roles are as a single app. If you think about our production accounts, the role admin, for us, that's considered an app. So every account enroll that I log into, you can think of it as a new app. We start to learn which applications you actually use as an employee at Netflix. Then we can start getting anomaly detection like, "Hey, you've never actually used this application before. Or, is this valid? Do I need to be concerned about this?" So we can do some automatic analyzing on that or surface it up for our teams to actually view and see if something seems suspicious. That's pretty powerful for us.

Prevent Instance Credentials from Being Used

All right, enough with the protein. Let's get some olives on there. Any all olives fans out here? I see some smiles in the front. So we have credentials, they rotate. But is there a way that we can actually say and assert that the credentials in our EC2 instances, the credentials providing API access for applications, cannot be used off instance? And the answer is yes, given certain situations. That's something that we've done. What we wanted to do is that if an attacker were to steal credentials, they would not work from their environment.

We do this by understanding how application API calls work and the network flows from the given server in AWS, and what the environment looks like. It's not as hard as it sounds. It's essentially enumerate your account, describe the net gateways that you have, and capsulate those IPs into a list, describe your VPC IDs and your VPC Endpoint IDs, and craft a managed policy or a policy that you inject, that basically says, "Deny everything unless these conditions are true." And the conditions that you're checking for are the source IP is one of my Netsiders. Your VPC ID equals one of the VPCs. And your VPC Endpoint ID equals one of the VPC Endpoints. It's important to cover all three due to the different ways that IAM is evaluated based on the service that you're calling. But with those three components, you can actually lock a credential to your environment.

Now, I mentioned understanding the network flow. For us, the way that we can apply this is if our application is deployed on the internal subnet only, we can apply this. You can still expose your application publicly through an EOB deployed on an external subnet. But the reason that caveat exists, is we need to know the IP addresses that our calls would come from. Our external presence is so large that we have many, many IPs. We can't possibly know them all the time, and then encapsulate that into a policy. So for us, we apply that to internal only roles. We acknowledge there's a gap. As you'll know or notice in this talk, we're going to talk about lots of ingredients and how we layer those together. And hopefully, you'll see how the different layers address these gaps as we continue.

So you might be asking, "Hey, does this really work?" The answer is yes. In real life, we had a scenario once where a third party application running in our environment had a vulnerable plugin. A bug bounty researcher was able to trick that application into requesting credentials from the metadata service. Those credentials were returned to the researcher, who took them to their local machine and started to use them. He got lots and lots of denies even though the actions that he was using, the role had permissions to, the valid checks for the actual environment they were coming from did not work, or his environment did not equal ours, so he was denied. We actually had worked with this researcher in the past, so we asked him, "Hey, if you're able to actually bypass this protection, we'll pay you more." So we got some free enumeration out of it. We got lots of alerts that went off. But what it did for us, is check that box and say, "Yes, this control works and it's pretty cool." So something to think about. This could be a very, very powerful ingredient in your own kitchen. But yes, it's been really, really effective for us.

Delivery Lockdown

On that note, delivery lockdown. Let's add some mushrooms to this pizza. I'm not a mushroom fan, but I'm hoping my attacker isn't either. So let's add it to it anyway. It's probably well known that we use a tool called Spinnaker at Netflix. Has anyone in the crowd heard of Spinnaker? For those that haven't, Spinnaker is a delivery mechanism. It's a tool that's open source by Netflix originally, but has many, many community involvement and contributors these days. We use Spinnaker to just about deploy everything. Most of the time, our employees at Netflix never have to log into the AWS console. They log into Spinnaker and get a single view of all of our accounts, and can manage their applications there in.

It's very nice, but probably what's not well-known is how we actually integrate security controls in Spinnaker. Travis mentioned that we run with roles, and we actually have broad set of permissions from the beginning, and then we tailor that suit back. We do that by integrating with some of our security tooling with Spinnaker. When you deploy the application in Spinnaker, it actually calls out to a lambda function that we own. That lambda function will go create a role with a base set template in the account that you actually want to deploy in. From that moment forward, you have that permission set to go forth and do great things and continue rolling, and then Repokid will come in behind you.

Once you have that application and role deployed, Spinnaker is actually applying a Spinnaker tag to it, to both the application and role. What that is actually doing is saying, "Hey, only Travis's application can deploy with Travis's rule." If I'm Will's app, I can't actually use Travis's role. So we're actually locking IAM roles to a given application. I'm not sure how familiar you are with IAM policy within AWS. But one of the most important things to hand out in AWS if you want to launch a service, is the IAM pass rule permission, which can be very, very powerful. It's a common way to privilege escalate.

If Travis has S3 permissions and I have DynamoDB permissions, but I need to get a file from S3, it's 2:00 a.m. in the morning, I'm definitely really late and I just need it now, the common thing that I might do is, "Hey, I'll just borrow Travis's permissions for the night." With this protection that we've put in place, it's impossible to do that. And it's been really, really powerful for us. We can go on every day with other tests, not worrying about are people abusing the roles in our environment?

On top of that, when you actually deploy an application, the typical Spinnaker mode is, if you have access to an account, you have access to all applications in an account. We actually took a contribution from Google called Fiat, which gives you authorization controls at the application level. We embedded this into Spinnaker, in our environment, and now we're able to actually say, not only can you lock a role to an application, but you can actually say that only my team can manage my application. No one can edit my pipeline and deploy their IAM instead of mine so that they would still be able to privilege escalate with the role. I can't use Travis's role, but maybe Travis can use my IAM and there I go, I use Travis's role. With these two things together, we have successfully mitigated a potential privilege escalation in our environment, which has been proven to be very powerful on something that we'd been worried about for a really long time.

We've rolled this out environment-wide. We've learned some things. As with anything, we've made some assumptions, we've broke some things. I think a peer could probably attest to how we broke Cassandra. But it's out, we figured it out, and we failed fast. And hopefully, now we're in a much, much better place. Similar to Consoleme, with Spinnaker being our central deployment engine, it's a central source for us to get that logging and detection. So if Travis tried to launch my role, we could actually see that because Spinnaker will actually publish a security event to us, and now we know that so and so's trying to launch with a role that they're not supposed to. Most of the time, it's harmless and we reach out to the developer and understand, "Hey, what are you trying to do? How can we help you?" So it's a way for us to just see what's happening in our environment as well as be proactive to the developers. Sometimes we get the, "Hey, I'm impressed. How did you know that I was trying to do that?" So, it's kind of self-assuring for us.

Detect Instance Credentials Used Off-Instance

But now, we're going to spice things up. Who wants to learn about detecting instance credentials off the instance? Everyone raised their hands, I saw it. So adding some peppers to the pizza. This year at Blackhat, I presented on the new methodology for how to detect credential compromised in AWS. What's important with this detection mechanism is, I'm talking about your AWS. Your kitchen detecting compromise from your pizza. I stole the pizza from your kitchen, not my kitchen. Think of it that way.

We already talked about how we can block credentials and force them. We talked about a gap and that you can only apply those to internal subnets, depending on how your network routing is. But how can we detect if it's being abused? You might say, "Hey, Will, can't you just detect explicit deny error messages?" If you're not familiar, some services within AWS when you have a IAM policy that says, "Explicitly deny this call," it will actually pass that through to the error message. But it's not consistent across services. Some services might just tell you unauthorized. And that might just be a developer trying to develop a new piece of technology. It might be Repokid took a permission away that's only used seldomly, and so it was just normal business. So for us, it's not necessarily a good signal, it's mostly noise. So we had to come up with a different mechanism.

Back to the temporary credentials. When a EC2 service or server launches an AWS, it goes out to the STS service and assumes role to the IAM role that you want to provide to that server. That actual call is seen in the CloudTrail audit trail. So what we do is, we actually track that call, push it into a table that we're going to keep track of, and that gives us instance level tracking. Because the way that EC2 works is every time it launches a new server, it assumes the role for that particular instance and actually passes the instance ID in as the session name. So you can actually see what temporary credential belongs to which instance in your environment. As we're analyzing CloudTrail, we see when those temporary credentials are used. The first use that we see, we lock to the IP that is given in the audit trail. Then from that point forward, every event after that we see being used by that temporary credential, if it's not the same IP, that's a potential compromise. So it's something that's very easy to put in place, it learns as your environment spins up. If you start it today, within six hours, you'll have full coverage of your environment, so pretty powerful and has worked really, really well.

A good example of this is when a developer once tried to actually pull credentials down to their local system. Our detection here was able to see that the first call from that credential was from the server, and then when they tried to make a call from the VPN, we got to actually see that, "Hey, this is actually happening in our environment." So it's pretty neat to actually say that you can now detect credential from being used outside of just your application or your Netflix environment, and not just AWS in general. That's one thing that I like to call out, is it learns your environment. So if you're an attacker and you're trying to avoid something like guard duty and put your instance credential on your box, this detection mechanism will actually detect it. So that's why I say this spices some things up. But now, Travis is going to add some veggies.

Detect Anomalous Behavior in Environment

McPeak: Onions. I love onions on pizza but Will doesn't, and let's hope that the attackers are more like Will. So we're talking about detecting anomalous behavior in our accounts. In this case, what we do is we track that baseline behavior. So I mentioned earlier that we have Repokid, and Repokid uses CloudTrail to understand what's normal for an application, what's being used. An additional benefit of that is that we have a really good idea of what our accounts are doing on a normal basis for a whole bunch of different cases.

For example, some regions in our accounts should never be used. Now, I know what those regions are, Will knows what they are, [inaudible 00:27:34] knows what they are, but the attackers do not. So if you're an attacker and you come into the environment, you have to be really careful because if you go and do something in the wrong region, we're going to notice it immediately and start quarantining, kicking you out of the account, whatever. Similarly, some resources sprinkled through our environment, obviously I'm not going to say which ones, but there are some that should not be used. If we see an attacker using those resources, requesting them, we know that developers aren't doing it, and we know that we're going to go and quarantine the attackers.

Finally, some services, we don't use at all. For an attacker, you would have to know, “Not only do I not want to step on the wrong region, not only do I not want to hit that one resource, but what are the services that Netflix uses?” You might be able to guess some of them, but if you get the wrong one, we're going to know immediately. This is one of the benefits to us continuously watching CloudTrail. It's a little bit expensive at our volume. CloudTrail becomes expensive to ingest and analyze, but we get so many benefits out of it that we think it's worthwhile.

Now an example is, an attacker, in this case, a researcher so, an attacker got in our environment, they exploited some application vulnerability and they wanted to get a bigger payout. So what did they do? Start screwing around, trying to get AWS, exploit us, go further. They ran an AWS exploit framework tool that began immediately enumerating a bunch of common payloads, AWS calls, to see what it could do. Now, unbeknownst to this attacker, one of those payloads is something we never use, and so we immediately knew, "Hey, there's something bogus going on here.” And we were able to stop it. So it's kind of a cool benefit to knowing what's normal for your account and then seeing a quick deviation from that.

Detect Anomalous Behavior by Roles

Who likes anchovies on their pizza? Really, you do? I was going to say, "Nobody likes anchovies on their pizza," but apparently, that's not correct. When we showed this slide deck to our VP, he said, "That doesn't even look like anchovies." He said that looks like whole barracudas on the pizza. Do you like barracudas on your pizza? Okay, in any case, we're going to put a bunch of barracudas on the pizza and hope that the attacker can't finish it. So what we're talking about here is detecting anomalous behavior by roles. We were already talking about doing that for the account level, which regions and resources, but we can do the same thing for application rules. It turns out that applications have relatively consistent behavior. Once they're bootstrap, they do the same things, day in and day out. That's exactly the principle that Repokid works on. The application is going to do the same things and we take away the excess stuff.

But we can also notice when an application deviates from that. It's a really good signal. So you can look for common first attacker steps, for example. Let's say we have an application, it's pulling along, it does Dynamo stuff, and then all of a sudden, it does S3 list buckets. That's a common first attacker step. It's pretty benign. There's a lot of applications that have it. An attacker might just say, "What are the buckets that I have access to?" Now, that's fine, but if your application never does S3 stuff and we see that, we're going to know something's up. We could look for STS get caller identity. That's something that very few applications use, and if we see it on an application that's never used it before, that's pretty sketchy.

For example, we saw exactly what I said. We saw an application that had very consistent behavior. All of a sudden, at 8:00 I got a signal that this application called S3 list buckets, all of a sudden. So I freaked out a little bit. I didn't freak out, I'm a professional. But yes, I freaked out a little bit. I reached out to the developer and I said, "Hey, what's going on with this application?" And it turned out, in this case, it was just the developer screwing around, doing a little bit of research, relatively early in the evening. So that's good, no reason to worry. But I have no doubt, at some point in the future, we are going to catch somebody doing this just by virtue of the fact that we know what normal looks like for our applications.

Now we're going to shift gears a little bit. We're going to start talking about future controls. And we're going to shift to photo-realistic pizza and much less of the barracudas. The future control that I'm most excited about is one role per user. Now, we talked about earlier how we are federating users into common classes. For example, we might have a regular user in an account or a power user, but they're all sharing the same role. In the future, we're going to split them out. And so every user in an account gets a role that is completely custom to them. We're going to take the same approach we talked about earlier.

We're going to grant the benign set of default permissions. We're going to use data and custom fit that suit to the exact user. So they will have the exact permissions they need and none extra. That's good from a least privilege standpoint, but it also gives us the ability to do anomaly detection. We can see what's normal for that user. And if they start doing the S3 list buckets, all of a sudden or something that's equally odd for them, we can notice it and quarantine them, kick them out of the account, ask the user what's going on, maybe like swing by any office or something. In some way, we'll be able to respond to this and do something to quarantine the attacker. So I think that's pretty cool. Next up, I'm going to pass it to Will with this juicy veggie pizza.

Future: Remove Users from Accounts They Don’t Use

Bengtson: I don't like veggies, but I might eat this one. So we've talked about some pretty powerful things, one role per user. But is there something that we can do now before we actually get to that step? If you think about one of the biggest problems in security, as you move teams, you gain access to things, and you continually gain access to different accounts, different groups that you're never removed from. This is a problem everywhere. It's a really, really hard problem. If you have a solution, go start a company, you'll make millions of dollars.

But now that we have a central way to log into accounts, we can detect who's using what. We can see who's actually accessing certain accounts. If you think of the scenario where I have access to 10 accounts because I got 5 of them from back when I was on the dev tools team, and I got 5 more when I joined security. Over the last quarter, I've only accessed the security accounts and I stopped accessing the dev tool accounts. Why not automatically remove me from that group providing me access and make it a more least privilege for me. Instead of tackling the actual role itself that I'm using, just auto remove me from the AWS groups that I shouldn't be in anymore. So this reduces the risk of user workstation compromise or just me accidentally pushing a credential somewhere. If I don't have access to all the accounts that I used to have access to because I haven't used them, then a stolen cookie from my environment might not be as bad or a complete RC on my laptop and Chrome headless won't be as bad either.

This is what we think is going to be pretty powerful from an initial standpoint. We can go look and say, "What haven't you accessed?" We're going to remove that and we're going to provide a way for users to get the access back securely, without much friction, so that we can arguably tell them, "Hey, we're going to remove this, but if you needed it back, just come ask. It'll take like a minute. We'll give it back to you and you'll be off and rolling again." What we've experienced is we think our developers will actually be okay with this, we'll ask at dinner tonight what he thinks because this is new for him, but we think it's going to be pretty powerful from reducing our risk from an actual workstation perspective.

Future: Metadata Work

This one I'm most excited about. I'm biased because it's my personal baby. Travis has Repokid. I've been focusing on the whole IAM protection space. We talked about locking credentials to an actual server. How can we detect when a credential has been compromised. But what if there's a future that you could actually detect or prevent, rather, a credential from being compromised all together? There are several common class of vulnerabilities that bug bounty researchers that know, if you're running an AWS, they target you first. There's one called Server-Side Request Forgery, which we'll walk through here shortly. And that's essentially tricking an application to request a URL on your behalf.

The common attack pattern that we've seen is a researcher will ask the application to request credentials from the metadata service. This is a very well-known port, it takes two URL calls to get credentials. One, to understand what role you're running is, and then the second one to say, "Hey, give me credentials for that role." As a bug bounty researcher, that's your number one hope that you can find. Getting AWS credentials in an environment in a bug bounty program is typically a critical vulnerability, and you can read many, many public bug bounty posts that lead potentially to complete account compromise.

What we want to do is actually prevent compromise at the credential source. So we started investigating, is this even possible? How would we do it? This is what SSRF actually looks like. A normal person requests a given URL from a web application. That web application, in turn, requests something from a remote application, it responds. Then the web application combines that information together and provides a single look to you. In the SSRF world, your attacker has actually injected a URL to the metadata service. This can be very hard to actually block in a web application firewall. There are many, many different URL representations for the 169.254.169.254 metadata IP. On top of that, if you blacklisted everything, I just throw a bitly link in there and then I bypass your firewall.

So it's very hard from an application web firewall to protect, but this is what it looks like. An attacker tricks your application to request in the metadata service. Here I have the metadata service pictured separate from the application, but you can imagine this encapsulated on a single instance because the metadata service is just available locally on the EC2 server. Your web application will then go request credentials from the metadata service, it will return them, no problem. And then the web application is going to combine those results together and give it back to the attacker. The attacker will then take those credentials, pull them to their laptop and hope that they work.

We've mentioned earlier that we had that gap, internal only for our API enforcement, so how can we actually protect it all together? We approached AWS with this problem and we said, "Hey, let's try to solve this together." If you're a developer or an AppSec engineer, you might think, "Well, hey, just put a header on the metadata service." With Server-Side Request Forgery or XML External Entity injection, you cannot control headers. So if you require a header on the metadata service, that mitigates your risk. That’s cool. So we approached AWS, said, "Let's work through this together and do this." We couldn't come up with a path forward on how to make that happen, so it fell of the ground.

I had done some research on how to actually rotate credentials in AWS, given a crunch was exposed from a server, and found that there were some things needed in the SDK to actually support that. I already had a relationship built with the SDK teams, so I approached them and said, "Hey, can you add a header for me? If you do this, then I can actually build a proxy in front of the metadata service and mitigate this attack." They went back and forth with their different teams, and then came back and decided, "We don't feel comfortable supporting something that the metadata service doesn't support out of the box. So, sorry, we can't help you right now. Maybe in the future."

I was kind of bummed by this. I went and built a metadata proxy just to see what data looks like. Can I log all the paths? Can I do some sort of anomaly detection and say, "Hey, I know that only this path on the metadata services are ever visited by this application, what does that actually look like?" And as I started seeing the data come through the proxy, I had a eureka moment and thought, "Why didn't I think of this from the beginning?" What is common with every web request? Are there any developers who are AppSec engineers? What does every web request have? The answer is a user agent. I didn't think of this in the beginning, but I saw it.

I started spinning up different applications with the different AWS SDKs and noticed that they weren't setting the user agent at all. It's whatever library they were using. So in Python, on the Boto3 world, it was a Python request object. In Java, the default Java library puts Java and then the version of Java that you're running. So I started making PRs to the AWS SDKs, their open source. I started with Ruby, I put a PR, convinced them to accept it, they merged it. I took that PR to the Java land. I wrote Java code and I hadn't written Java code for a long, long time. I took the Ruby PR, I got them to accept it, and I just took that momentum on.

Once I got the different SDKs for the languages that we use in Netflix to accept it, I went to the global SDK team and said, "Hey, I need your help. I got your teams to accept these. Can you help me and make sure they don't change?" So they actually said yes. What does this actually look like? It's super exciting. But if you were to imagine this running on a single server, we're running an application that integrates with AWS, it's a Python app. So when the Python AWS SDK is actually requesting credentials to create a session or client object, it is now setting a user agent that starts with Boto3. So my proxy can whitelist what user agents are actually allowed in. In this case, I have a user agent that is allowed, I proxy that to the metadata service, it returns the credentials and I pass this forward, all is well.

But what does this look like in an SSRF scenario? If I'm running a Python application that's vulnerable to Server-Side Request Forgery, you are now able to actually block this added proxy level. As I mentioned earlier, you cannot control headers. So when you've tricked the application to make the request to the metadata service, it is actually going to use whatever library you're using for fetching URLs. The most common one that I know of, at least in Python, is called requests. Its user agent starts with Python-requests. So in this case, the metadata proxy sees Python-requests, and then you can block that and return whatever you want. In this case, in my diagram, I'm saying 401 I'll authorize, you can return a 403, a 514, what's your ZIP code or area code? Whatever you want to return, return it. But the most important thing is, we've actually prevented the ability to compromise credentials in our environment, and we covered that last piece of the gap that I mentioned earlier. So this is exciting and being rolled out and being proven to be very, very successful for us. Now, Travis.

Hot & Ready

McPeak: We've got this lovely pizza. It's full of onions and barracudas and all kinds of stuff. What we've hoped we've done here is, we've taken individual ingredients that by themselves are pretty cool. And you can totally make a dinner out of cheese and pepperoni. I wouldn't be ashamed to do that. I think it's pretty good sometimes. But if you take them and you put them all together, and you bake them, and you let them mix together, you get this awesome pizza bite that we hope is so big and so hard to tackle that the world's best attackers won't be able to eat the whole thing.

Bengtson: Don't forget disgusting.

McPeak: And also very disgusting, thanks to the barracudas. The way I think about it is Netflix show pitch time. This is the ultimate beast master if you haven't seen it. What it is, it's an obstacle course show. The idea is that you have these individual obstacles, and they're pretty hard by themselves. I couldn't do a single one of them, I promise. By themselves, they're pretty difficult. You have to be in really good shape. But if you put them all together, you end up in this situation where the world's best athletes have a hard time getting through this course.

That's exactly what we want to do with our security controls in our pizza. We want to layer individual controls so deeply that the world's best attackers will decide that it's not worth the effort for them to mess with us, and they'll go elsewhere. That's what we hope we've done. If any of these individual ingredients or any of the stuff we talked about today is interesting to you or you'd like to contribute with us to work on the next ingredients together, we'd love to chat. And with that, I would like to thank you very much for attending our talk and staying till the end today.

Moderator: I think we have time for questions.

Bengtson: Yes, the one thing to note is like we have put all the ingredients on our pizza, you don't have to. Whatever fits your team and your environment, and makes sense for you based on your risk profile is up to you. But we hope that these ingredients might prompt some thoughts and get you thinking of how you might go back to your kitchen, AKA office, and start doing some of these in place.

Questions & Answers

Participant 1: Great presentation. My question is regarding the recent Facebook-related hack that happened, wherein the access tokens were stolen and, of course, Netflix allows Facebook as anidentity provider. Would any of these practices have prevented...? With that access token whoever ended up getting, could they end up doing any malicious activity with Netflix?

McPeak: I'm not sure well enough about the technical details regarding how the Facebook attack actually worked. I know that it was several layered steps chained together. So I can't speak to whether it would or wouldn't have been effective as an entry point in our environment. But I can say that, the point of layering these controls from the infrastructure perspective is that even if the attacker is able to get into the environment, we can make it so difficult for them to do anything impactful or we can notice it so quickly that I hope we can stop them from doing anything that's really going to hurt us. Thank you for the question.

Participant 2: I was wondering if you got any blowback from internal employees, as far as all that tracking of activities and being able to discern what they work on and what type of resources they access, etc.? Also how much friction this introduced at the beginning?

Bengtson: Yes. Travis could speak more to the Repokid thing, because we actually told employees - there was a lot of debate internally, like, "Do we just take permissions away? Do we let them know beforehand?" The thing we did was let them know beforehand. I don't know if that created a lot of friction, but it created a lot of awareness of what was going on, and asked a lot of questions. We thought of some things that maybe we didn't think about from the beginning, but it's proven to be pretty effective.

Once we gave them the context of like, "Hey, here's why we can tell this is what your application does and why we're not going to break you," I think people felt better. But with the actual locking credentials down to an actual system, that created some friction, in that, "Hey, this call actually makes this other call in AWS and that's breaking it." So those kind of things that we're just unaware of always create some sort of friction, but I think, overall, our employees understand that, "Hey, if Netflix went down ..." I don't know if you've ever seen the few chances or the few times that we've gone down, like Twitter, it's like everyone's life is over. I'd hate to be the cause of that. When I rolled out API protection, it was the most stressful six hours of my life, because I would push out protections to some roles in our environment. I'd watch and hope that our streaming number didn't just drop. Because, as everyone knows, at the IAM level it's global. There's no sort of thing as actually affecting a single region. There's no failover for me if I did something wrong. Have you experienced any?

McPeak: Yes, I would say, one of the things we've tried to do is really message ahead of time what we're doing and why we're doing it, and even why it's beneficial to developers. We want to have a secure environment, and developers want us to have a secure environment. They just don't want to be very inconvenienced. We tried to have a lot of good answers in place for like, "What happens if we screwed up?" We have an automatic rollback system, we can press a button and it gets fixed. Let them know ahead of time what's going to be happening and why, and even give a system to opt out.

If we explain to you what's going on and you really don't want it, that's fine. We're not going to force it on you. We try to explain to developers that we're actually helping to de-risk their application and it should be very transparent to them. But of course, unfortunately, we have inconvenienced some developers and we take a lot of precautions to make sure that we're not doing that, as much as possible.

Bengtson: Yes, I would say one of the first principles we have is, "We won't do something unless there's an easy way for you to get it back," kind of deal. So we won't take the permission away unless it's easy for us to give it back, because the last thing we want to do is create more burden for us from an interrupt standpoint in operations. So we're trying to make tooling to make it easier on us as well. We often dog food it for like many weeks, sometimes months, before we actually roll it out to everyone.

See more presentations with transcripts

Recorded at:

Apr 20, 2019

InfoQ Software Architects' Newsletter