Transcript
Ramani: I'm going to go through what I think might be a familiar experience for some of you. The year is 2010, this is PacFeed, PacFeed is a newly created independent digital media company. PacFeed is most known for its list of goals, like "21 Ways to Lose a Ghost." The technical team is small, only one to two developers, and they work on the website, and the source code is stored in a monolithic repository.
A couple years later, PacFeed grows to support other types of media content, like video. It's now known for producing viral videos, a lot of them about cats. With its growth into different types of content, PacFeed's technical team has tripled in size over the years to be able to support a content team. Adding more business logic to PacFeed has become cumbersome as growth continues. Deploying any change, big or small, requires so much coordination between developers, it can be comparable to anarchy. Something needs to change, so they turn to microservices.
PacFeed has split out into multiple microservices that make up different parts of the system, and interact with each other in different ways. PacFeed grows to support even more types of media content. They're now known for their award winning investigative journalism team and their brands, like Tasty, videos to make cooking more accessible.
With this media expansion, PacFeed's technical team grows. Its responsibility is to make sure that content is accessible and discoverable across a wide variety of platforms, from PacBook, to PacTube, to its own PacFeed platform. With the move to a microservice architecture, it's easier to build these services, which is great.
Then there becomes a necessity to create internal tools for employees, and tech, and other parts of the business organization, to aid their workflows. With the creation of internal tools introduces a new question, "How do we secure our services so that only employees who are authorized to access tools can access them?" Today, that's what I'd like to talk to you about.
My name is Shraya, and I work at BuzzFeed, which might have a few similarities to the problems that PacFeed had. I work as a software engineer on security infrastructure tools, one tool that I worked on is something that we used to solve the problem of securing internal services.
I'm going to start off with some background on our infrastructure and the evolution of auth at BuzzFeed. I'll go through the different types of auth that we considered, and what we initially went with before we started to feel some scaling problems. This evolution ultimately led us to build our own solution to securing services called SSO. I'm going to go through the different parts of SSO, and what the user experience looks like, then, SSO is recently open sourced, so finally I'd like to talk a little bit about that and why and how we ended up going through an open source process and I'm going to talk a little bit about how we maintain the project.
Infrastructure Goals
Let's start with how we have historically approached auth at BuzzFeed, and the auth infrastructure problems we have faced as we continue to grow into the organization that we are today. BuzzFeed reached a point where it was apparent that it would need to be able to build fast to be able to support its business growth. In order to do so, we invested time in building scalable infrastructure. The first big improvement to our infrastructure was the introduction to a homegrown tool to standardize the way we deploy and test our services, called Rig.
This is what Rig looks like, Rig is a platform our engineers use that creates an opinionated workflow for creating, deploying, and testing services. It leverages containerization and is built on Docker and Amazon ECS. A user can change the source code to the monorepo on GitHub, and then a Docker image is built and stored in Amazon ECR, and then the service with the built image can be deployed from a separate UI that we call Deploy UI, onstage in-production environments.
Having Rig made creating new applications easy, and allows developers to be able to focus more of their efforts on actually developing software. Rig was developed with some goals in mind, and these same goals have influenced the way we write much of our infrastructure and develop our tools. We want the tools we create to provide consistency across the org. Having consistency allows for a unified developer workflow that we can document, and makes it easy to onboard new developers, and makes it easy for mobility across teams.
From an operational standpoint, we want our infrastructure engineers to have a simple experience, automating as much of the workflow as possible. From a security standpoint, we want access control to our infrastructure to be as granular as possible, so that only those who need access have access. Lastly, we want the experience for developers using these tools to be as pleasant as possible.
Going through our goals with Rig, we get consistency in our deployment workflow across all services, managing and operating this infrastructure becomes much easier because it's all automated. We have granular access control because we have different keyrings and permissions on deploying services. Lastly, developers are happier because they can deploy their code themselves and get built and tested results faster, and be more productive overall.
With Rig, we were able to build even faster. Here's a bit of a breakdown of the current state of our tech ecosystem. We have around 600 services that are deployed on Rig, and they can be broken down into a few different categories, 250 UI services, 150 API services and 150 message queue reader services for publishing and consuming messages.
Let's focus on these 250 UI services, these are front end internal tools, like a UI for deploying services or a tool for video producers to be able to upload and edit their videos. This means that they need to be secured, which brings us back to this question, "How do we secure our services?" Well, we considered a few auth options against our infrastructure goals, with the ideal solution being something that would check all of these boxes.
The first option that seemed like it would be a reasonable choice was a VPN. Many organizations have their employees connect to a VPN to be able to access whatever services they need within the private network. In terms of infrastructure goals, a VPN didn't seem like the best option. A VPN may provide consistency in our infrastructure, but I'm also sure many of us know how annoying it is to always have to connect to a VPN, let alone have to set up, and maintain, and monitor the infrastructure for one. We'd also still need to have an additional solution to granular access control, as a VPN would not give that to us on its own.
Another option was to punt on coming up with a way to secure our services, and allow service owners to bake that into their applications however they might see fit. This inconsistency was something that we did not want. From a security observability perspective, we would have no insight into how secure our services were. There could be granular access control, but that would be up to the service owners. Instead of solving the problem with one solution, every developer would need to relearn how to solve this authentication problem over and over again.
Identity Aware Proxy
Our last option that we considered was something called an Identity-Aware Proxy. This comes from Google's BeyondCorp philosophy based on the principles of zero trust network. What are all these things, you may ask? Well, let's go back almost a decade ago, to early 2010 again, where Sneaky Panda, the Elderwood Gang, and the Beijing Group are suspects in a series of cyber-attacks dubbed Operation Aurora. These attacks compromised the networks of 30 companies, including Yahoo, Adobe, Morgan Stanley, and Google to steal their data. Google was actually the only company to disclose the breach in a blog post released in January of 2010.
As an interesting aside to this, they mentioned in their blog post that they had evidence to suggest that the attack was to access the Gmail accounts of Chinese activists, this is actually what led them to cease their operations in China.
Why is this relevant? Well, this event was the impetus for an industry wide shift on our approach to security, moving from a model of relying purely on a very secure perimeter to a more distributed and granular form of security. It was during this time that the philosophy of zero trust networks came to be.
Basically, Sneaky Panda were able to figure out a way to get past the perimeter in this network, either because of an unreliable VPN provider, or stealing Pacman's credentials, or a poor encryption algorithm on a security protocol. The whole network would be vulnerable to attack once Sneaky Panda got through. This is what zero trust networks aimed to prevent, based on these three tenets.
First, network locality is not sufficient for deciding trust in a network. The network is always assumed to be hostile, and external and internal threats exist on the network at all times. Second, every device, user, and network flow should be authenticated and authorized, meaning granular access control. Lastly, policies must be dynamic and calculated from as many sources of data as possible. Having good monitoring and flexibility in the system is important.
Using this philosophy, Google created their BeyondCorp philosophy, which is defined in a white paper that was written in December of 2014. From this white paper came the concept of an Identity-Aware Proxy, which is considered a building block towards a BeyondCorp architecture model.
The purpose of an Identity-Aware Proxy is to shift authentication and access control to be something that's based on the user, rather than on what network the user is in, and doing this by having the proxy service sit between the user and the upstream. It uses a third party provider, like Google, to authenticate and authorize the user trying to log into the service.
With this kind of model, if Sneaky Panda gets into the network, he still can't really get access to any service within the network secured by this Identity-Aware Proxy, without maybe having the user's credentials and MFA. With granular access control, if Sneaky Panda gets Pacman's credentials and MFA, then only those services that Pacman had access to would be vulnerable. Hopefully, critical services that have admin access would be protected and inaccessible to Sneaky Panda.
This was everything that we wanted, consistent, easy to maintain, granular access control, low developer setup overhead. Now we just needed to figure out which one to use, like any good developer, we looked to open source. The implementation we ended up using is a Go binary from Bitly called Oauth2_proxy. Every service would have an auth proxy service in front of it, which runs the Oauth2_proxy binary with the appropriate configurations.
Scaling Problems
This was great for a while, when we only had a few user-facing services, but we soon started to question its scalability as more user-facing services, therefore more auth proxy services, were created. We started to see more and more scaling problems, felt all around. To start, the users accessing the services behind Oauth2_proxy would have a frustrating experience. Many of our users use multiple tools in our microservice ecosystem. For example, a video producer might use one tool to upload and package their videos, another one to see when they'd be published to the appropriate platforms, another one to get metrics on how the performance of the video did.
This was not only frustrating for users, because they'd have to click through and sign into every new service. It also enforced bad security practices, as users would blindly click through all the auth proxies without actually checking that they're using their credentials for the correct service.
Developers and operators also felt scaling problems. For developers creating a new service, the process of adding auth meant copying over this boilerplate Oauth2_proxy template and modifying config values. To most developers, what these configurations meant was confusing and completely opaque to them. Then maintaining these Oauth2_proxy services was a frustrating experience as well.
For example, there was a critical security fix for Oauth2_proxy, and that meant updating our service in every single Oauth2_proxy service, and graphing our code base for every single Oauth2_proxy service. This was not only tedious for the engineers maintaining the services, it also created a larger surface area for the potential of an attack.
Also, Oauth2_proxy did not have any metrics tracking baked into it, so debugging an issue like that critical security fix was not easy since we had no visibility into that part of our system. Adding that visibility would be tedious, because it meant changing it in every single Oauth2_proxy service. We were hoping that our next solution would be able to improve our infrastructure.
With that we realized that even though we got consistency and granular access control, doing anything with the Oauth2_proxy service was a pain. We decided that we wanted to build something new. When coming up with our new solution, we did not want to stray too far from what we already had. We decided to use the logic of Oauth2_proxy to create what we like to call the SS-Octopus or SSO, a single sign on version of Oauth2 proxy.
What Is SSO?
What is SSO? Well, this is SSO, is an implementation of an established standard protocol called the CAS Protocol, which stands for Centralized Authentication Service. It consists of a CAS server, which is responsible for authenticating users and granting access to the services, and a CAS client which protects the service and uses the CAS server to retrieve the identity of the user. SSO has these two binaries, sso-auth is the CAS server, and sso-proxy is the CAS client.
This is a sample auth flow, I'm going to go into a little bit more detail of all the different parts of this flow now. First, there's sso-proxy, it acts as an Identity-Aware Proxy for the upstreams, sitting between them and the user. It stores a short-lived session cookie that refreshes every minute or so with sso-auth, and this whole refresh process is done behind the scenes.
Then there's sso-auth, which acts as the auth provider for sso-proxy, and it uses a third party provider like Google to authenticate the user, and stores the user identity information in a long-lived session cookie that lasts about 15 days. When this session cookie expires, the user is redirected to a sign in page, and then must sign in with Google again.
This is because the third party auth provider is the source of truth on information about the user. It provides the user identity information to sso-auth, which then in turn provides the user information to sso-proxy. We use Google as our provider, and we're working with the community to support other providers, like GitHub and, often, Azure AD.
Then there are the upstreams, and these are the services that are secured by SSO. They're defined in a YAML file, because we wanted to get on the YAML bandwagon as well. Developers can add their services into this shared file, which starts off with a mapping from the internal service address to sso-proxy's address. Then we have other options like allowed groups, which allows for group space authorization or request time outs. They can also setup overrides for addresses to allow for a better user experience, that a user could just go to pac-land.com instead of having to go to pac-land.sso.pac-world.io, in this example.
This is the new experience for someone logging into two services. As a service that I use all the time, in this example, I'm logging into our Deploy UI service to begin with and get directed to log in with Google. After signing in with Google, I get directed to the Deploy UI page, and then can go ahead and deploy some services.
Now, say, I want to check my service that I just deployed, I'm going to open a tab and go to Httpbin, which is a sample service that we have behind SSO. This time, I immediately get redirected to Httpbin. Sso-proxy refreshes every minute, so if my account were to get hacked by Sneaky Panda, by the time I leave the stage, it could be locked down, and nor I or Sneaky Panda would be able to access either service.
Having SSO in production checks all the boxes for our infrastructure goals. With centralized auth, operating our infrastructure is a much simpler streamlined process. Rather than having to maintain hundreds of individual auth proxies, our infrastructure engineers only to focus on one code base and have those changes be reflected in all of our services. This has made it easier for bug fixes and to be able to add better monitoring and metrics tracking, and to generally audit all of our services that we have secured by SSO, without having to graph for days and days.
Open Source? Why?
That that's a little bit of SSO, it's been in production for a few years, and we have hundreds of services behind it. After a couple of years of having this in production, we made a decision to open source. This decision was not taken lightly, open sourcing security is scary, you don't want to be giving away the keys to your kingdom. We realized that open sourcing SSO shows the lock to our kingdom without actually giving away the keys. Still, we had many justifications to make before open sourcing SSO.
First, SSO was born out of an open source project, and so it only seemed natural to give back to the community. We understood, from talking to folks in similar roles at other companies, that the need for centralized auth was a common problem among platform engineering teams. We learned that many teams had built their own solution internally, because there was no ideal open source solution. We hoped to work together in the open to tackle this. We also knew empirically that Oauth2_proxy, from which SSO was originally forked, has a large and active community of users. We felt confident that it would achieve similar traction with SSO.
Finally, we believed that granting access to our code base would improve our security practices, I'm about to discuss in the next section about how we did this. Security encompasses a variety of risk factors that you can't really prepare for. We know that it's really difficult to get right, we hoped that this transparency would shine a light on our security footprint.
Open Source? How?
When thinking about how we're going to execute this open source, we made sure to consider what would make an open source project successful, and what we could do to mitigate the risks surrounding open sourcing a significant piece of our security infrastructure.
To start, we made sure that we had significant and substantial documentation to make getting started with SSO as easy as possible. We focused a lot of our efforts on a quickstart guide to lessen the barrier of entry for users to use the project. We think that having this quickstart guide was a major part in its success for getting people to start contributing and using SSO.
Another thing that we decided was important to do was to change the name of our project, the original name was COAP, for Centralized Oauth Proxy, and we didn't think this was a suitable name, and there are two reasons for this name change. First, we wanted it a step away from this acronym, because we wanted the name of this tool to be something that was inclusive of all communities that might want to use an open source tool. We also thought that for the future of the project, we wanted to expand past Oauth, and so rebranding would allow us to do that.
I started by polling engineering on naming suggestions, and got some really interesting ones, Soap, Cap, this is what happens when you don't really ask a marketing team, you just ask a bunch of engineers to come up with names. We ended up going with something simple and self defining, SSO, because it seemed like the right thing to do.
In addition to these refactoring and documentation changes, when beginning the open source process, we had larger conversations around what risks we were going to be taking by open sourcing SSO, and what we could do to mitigate these risks. We decided to take a three pronged approach to auditing the security of our project.
One tool that we take advantage of, for many different parts of our code base, is a third party bug bounty program called HackerOne. It allows us to pay for security vulnerabilities found by security experts. Since the start of having SSO in production, we have been using HackerOne for SSO. Before open sourcing, we contacted a few known hackers who we worked with, and gave them access to the code to see if they could find any vulnerabilities by having the two side by side. We also hired a third party consultant to pen-test SSO and provide code review.
We vetted a few companies based on cost and process, and we decided on one that worked for us, and gave them access to the code base for a one-week review. While nothing significant came up from this, we were happy that we had the peace of mind of getting this review done.
Lastly, we have an in-house security consultant who did an architectural review of SSO, and actually found something interesting in the way we encrypt our session states and cookies.
Some background, we encrypt our session states, which contain the user email, access and refresh tokens generated by the third party provider, and we store this encrypted session state in cookies on sso-auth and on sso-proxy, which is what powers this whole auth flow. We previously used AES-GCM encryption, but our security architecture consultant informed us that this type of encryption is not resistant to an attack called a nonce-reuse misuse attack. If a nonce, a number used once, is reused, the XORs of the plaintext messages are leaked, so Sneaky Panda can see your token and hijack your auth session. Not good.
There is a different type of encryption, called AES-SIV, that has nonce-reuse misuse resistance. Ciphers with this type of resistance don't fail catastrophically when the nonce is reused. The only information that's leaked when the same nonce is used twice under the same key is just the cipher text, rather than the plaintext. We ended up going with an open source package called Miscreant that implements a AES-SIV. This was a great learning experience on security best practices for our team, and made us feel more confident going forward with the open source.
After going after going through this security audit process, we felt more prepared. Beyond preparedness, understanding that security is never completely done, was crucial. Our team has a learning and growth mindset about all of our work. This includes acknowledging that unknown unknowns exist, and that we're going to have to continuously adapt. Nothing is ever 100% guaranteed to be secure, but careful planning, good communication, and clear expectations allowed us to assuage our initial fears.
Then we finally open sourced SSO, and it went really well, we are really proud of it and everyone who's worked on it. The open source process has been really beneficial for us. While it often seems like open sourcing an internal security project involves more risk than reward, we've had an incredibly powerful and positive experience overall.
Maintaining SSO
After the open source happened, we recognized that there was a lot of work to be done maintaining this project, and we made this a priority for our team. To make maintaining SSO sustainable, we took a few approaches, first, we created a maintainers working group that meets once a week. It's made up of people from the infer teams, other stakeholders, those who are just interested in learning Golang, and those who are just interested in contributing to SSO. In our meetings, we discuss all of the issues and pull requests that come in during that week. Sometimes we also do knowledge shares on new and existing features.
From this working group, we also came up with an on-call rotation that rotates through the members of the group, and we established internal SLAs for responding to the issues in the PRs. During that time that we established those SLAs, we also made sure to have an external contributing doc with guidelines on how to contribute to the repository.
This is our contributing guide, it includes a link to the code of conduct, step by step instructions on how to get started. Our goal is to figure out ways to lessen the barrier for entry to be able to contribute and use SSO. We try to be clear about what we will and will not accept on issues in PRs, and try our best to lean towards an optimistic merging policy.
This term was coined by Peter Hinton's, and essentially says that it's better to accept and merge all pull requests that come in in a timely manner. This can be difficult, especially when maintaining a critical piece of our security infrastructure. We tend to fall somewhere in the middle of that, and a rigid merging policy, or a pessimistic merging policy, because we want to make sure that contributors know that they're making a difference by contributing, which will hopefully encourage them to continue to do so.
Maintaining an open source project is a lot of work, and required us to invest time and energy in approaches to make it sustainable. In the end, knowing that the work that we put into maintaining the project has made people's lives even a little bit better is worth it.
That's a little bit about how we approach our security infrastructure and why and how we ended up building an open source solution for microservice auth at BuzzFeed. With the SS-Octopus, PacFeed is able to have a centralized solution to microservice auth, following best infrastructure practices and making sure that only authorized users can access the tools that they need to.
If you'd like to try out SSO, it's there. We work on actively maintaining it, and we would love any feedback in the form of issues or pull requests.
Questions and Answers
Participant 1: After centralizing all the authorization you have before distributed, did you face any issues with any latency in the request or something?
Ramani: No, not really, Oauth2 proxy and SSO have the same mechanism, so there wasn't really any latency, because the cookie session is stored in the client and then in sso-auth, it's the same flow, even though they're just cookies that are encrypted and stored in the headers.
Participant 1: From what I saw there, you are specifying all the permission stuff in a YAML file. Do you have any way to make the checks dynamic or something?
Ramani: That's something that actually we've gotten some requests for in issues, and something that we haven't really been able to prioritize, but would love to have people contribute.
Participant 2: Could you elaborate a little bit on what made you select CAS as your authentication protocol in contrast to the others, Summer, or ATC, or whatever?
Ramani: Yes, I think it was because it worked the best with the with our existing Oauth2 proxy solution. When we were developing it, we wanted something that wouldn't be too out of what we already had going on. We did consider other options, but I think that was a big draw to even to push people to justify us to build our own solution.
Moderator: Just to clarify something, you showed the administration, how you configure the service for the upstreams, when you manage which users have access to that. Is that now like a Google Group management for the emails? Because you just showed the emails, I wanted to make sure.
Ramani: Yes, there's group-based authorization, we manage that using the Google Admin API. We've also been getting a lot of requests for on having just email based authorization in the YAML file itself. That's something that we've been discussing, and I think we've had some pull requests open up for that, but we're trying to get that in.
Participant 3: Assuming that Sneaky Panda is already on your network, how do you stop them directly accessing any of the internal services without going through the proxy?
Ramani: If you don't have the Oauth proxy, how do you...?
Participant 3: I assume that the Oauth proxy is external facing, and then it's redirecting internally to those services. If they were already on your network, why can't they just directly connect to those services?
Ramani: Because we have IAM rules on our tasks, we have everything in our infrastructure also secure, we have additional security in our infrastructure, in addition.
Moderator: I think if I can take a crack at it, the challenge is you can isolate a network inside. I mean, you can prevent anybody accessing it, but you can't prevent anybody accessing it because you need your developers to access it. The challenge is, "Ok, now you need to open it up," and the natural motion is to say, "Ok, I'll just open it up so that a privileged machine, like a developer's machine, will just be able to SSH into it," and that's when you get in trouble.
What this does is it says, "Ok, the systems don't open up to any network” They don't care about network access, these are the upstream that are protected. They allow access through a specific form of authorization, and that form of authorization is, in turn, authorized or not through this proxy. If you compromise the developer's machine, if they haven't logged in with their two-factor auth and all of the components, they wouldn't have access, similarly, a malicious attacker who compromised that developer's machine wouldn't either.
The network segregation is already past network segregation, it's already its own network, and the question is, "Who can access that network?" It's not about the service being compromised, it's about the developer accessing the system being compromised.
Participant 3: Ok, so the access token is getting forwarded on to the actual internal service, and that's where the authorization is happening.
Moderator: Yes, I think so. Maybe we can do another update.
Participant 4: Once you add this repository of open source, did you find it expensive to maintain, considering you might have to go back to someone like HackerOne to see the progress that I made with all the contributions, if you go back and find out from them if they're still compliant with good security practices?
Ramani: Because we use HackerOne for a lot of different parts of our organization, we haven't really seen an uptick in HackerOne reports being opened up after open sourcing SSO. Yes, who knows? Maybe we'll get a security issue at some point in time, we haven't really had any cost upticks after open sourcing SSO.
Moderator: Maybe I'll add that HackerOne is free for open source, so in this case, BuzzFeed was also a solution. I think BuzzFeed uses HackerOne premium internally. If you have an open source project, I don't know if they're here or not, but I believe that they're free, or at least there is some scheme for them to be free in open source.
Participant 5: How do you actually protect flows that don't involve external users, that is service-to-service communications?
Ramani: For our service-to-service communications, we don't use SSO. We actually have an internal API gateway that we use. This is mostly just for user facing services that employees can access the UI services that they need to.
Moderator: I'd say, probably the leader in the open source society is Envoy Proxy, maybe it's part of Istio, that is the one that's most discussed, I haven't had experience with it.
What other popular, like you mentioned Rig and SSO, what other sort of top open source projects do you consume that you want to share about, that you've published?
Ramani: Rig isn't open sourced yet, but maybe one day. This is actually the first big open source project that BuzzFeed's released, we have some Android libraries that we've open sourced in the past. I think we're trying to use the same model to open source some of our storage infrastructure applications, actually, Felicia works on that stuff.
Participant 6: Thanks for the talk, first of all. I wanted just to ask about one point, when you mentioned, regarding the Oauth2, where the proxy had to be set up manually for each application. How does it work with SSO? Is it like a single key, handwritten?
Ramani: Yes, instead of having to add all this boilerplate code, we have just a shared YAML file, if a service is created it can add just like a blob of YAML that will have it configured and ready to use with SSO.
See more presentations with transcripts