InfoQ Homepage Presentations Architecting a Production Development Environment for Reliability

Architecting a Production Development Environment for Reliability

Bookmarks

View Presentation

Speed:

56:52

Summary

At Meta, developers use servers (devservers), including virtual machines and physical hosts, and On Demand containers to perform their daily work. In this talk, we discuss their software architecture and the mechanisms we employ to ensure that they closely address our engineering needs, are kept up-to-date, remain reliable and available, even in the presence of maintenance workflows and outages.

Bio

Henrique Andrade is a Software Engineer, currently disguised as a Production Engineer, who leads the Developer Environments production engineering team, focusing on the reliability and stability of the development platform used daily by most of the software engineering workforce at Meta.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Henrique Andrade: My name is Henrique Andrade. I am a Production Engineer at Meta.

We will be talking about the development environment infrastructure that we have at the company.

The main focus of this conversation is going to be on what engineers can expect when it comes to the development environment that they are going to be using to do their work as software engineers joining the company.

It doesn't really matter here how senior an engineer is. The environment we put in place is something that is designed to tackle the needs of software engineers, production engineers, irrespective of their seniority. This environment is designed to be generic and most of our engineers are going to be exposed to it.

We are also going to be talking about the main tenets that make this environment reliable, so you, as a developer, joining the company, don't have to worry about OS upgrades, your server going away due to random failures, or any maintenance operations that might change the server's state.

All of these operations that can potentially be disruptive to you in a self-managed environment, won't usually be of any concern. In this way, you should be able to work on your software projects without much friction.

We are going to cover a lot of ground in this talk. One very large theme is on providing the infrastructure that makes this environment as reliable as it can be.

Finally I want to say that I am here as a messenger and as someone who has used and helped build this infrastructure. Naturally, there is a large team of engineers that have been developing and supporting our development environments for many years now. I also want to take this opportunity to thank them for their hard work and their thoughtfulness in striving to continually improve the software stack behind our services.

Outline

I am going to give you an introduction on what this talk will cover.

First, we will talk about options. As someone who is just joining the company, what are your alternatives in terms of the environments that you have at your disposal to do code development at Meta?

Second, and before we go deeper into the design of these environments, what is the role of production engineers in helping develop and operate these services for the software engineering community at the company? In essence, why are PEs part of the DevEnv team, the team that provides these environments?

Subsequently we are going to talk a little bit about the development environment architecture. We will talk about servers, our devservers. We are going to be talking about containers. We are going to be talking about tooling. We're also going to be talking about the user interface where you interact with these environments.

Following that, we are going to be talking about a few challenges behind supporting this development environment architecture at the scale as we do.

Towards the second half of the presentation, we are going to be talking about designing for reliability. As a software engineer, you really don't want to be concerned about how reliable your development environment is - you just want it to be reliable.

On that note, what makes this environment reliable?

As part of this discussion, we will be talking about how intentional we are in terms of disaster preparedness.

A lot of the reliability framework behind the design of these environments is because we are constantly thinking about what can go wrong, and how we can smooth the user experience when facing outages, whether an outage is the result of a disaster recovery exercise or an actual external disaster.

Finally I am going to conclude this talk with a few lessons learned and a discussion in terms of our future challenges.

What Is It Like Developing Code at Meta?

How is it like developing code at Meta?

If you are just joining the company, as many of you know as this is also the case for most companies, you will go through onboarding training.

Meta refers to this training as bootcamp and it provides an overview of the technical landscape, including our developer environments, to the incoming software engineers. In other words, a new engineer is not going to go directly to their team to start working, they will spend a few weeks becoming familiar with our internal terminology, technology, as well as our internal processes.

While going through this bootcamping exercise, they will be introduced to some of Meta's core technologies, the processes to get things done at the company, and everything else that an engineer will need to know to be productive when they actually join their team, from day one.

At Meta, we have a large community of internal developers. Developers can be software engineers or they can be production engineers, like myself. These two are the main groups that the developer environment infrastructure is designed for. But we also have data engineers, and data scientists. All of these individuals will be using the same developer environments. We also have enterprise engineers and hardware engineers, and our team's services also provide partial support for their functions as well.

For those who don't know, enterprise engineers are the software engineers that design and implement the internal systems that support functions like HR, finance, and other corporate organizations.

We also have hardware engineers working on designing chips, accelerators, and all of the gear that Meta has in terms of the hardware infrastructure for our data centers as well as consumer products.

Given this large community of developers, the main focus of the development environment organization is on providing highly usable services and scalably supporting the user community and the fleet of servers behind these services.

Our main goal is to minimize friction, so as a developer, you are be able to instantiate your own development environment, fire up VS Code or your favorite editor, and be productive right away.

How do we make this happen?

The first order of business is to quickly onboard the developers. As an engineer, you get all that you need, on day one. And it should all be ready to go.

In other words, you are not going to spend a week trying to configure your environment or install a piece of software that you need. All of these things should be there, automatically, for you.

This means having access to up-to-date tooling - and it all should be there for you, automatically.

When it comes to the actual coding, you want to be able to put in your code changes as easily as possible. The tooling to access source control, repositories, the tooling that is necessary for you to create pull requests (or diffs, as we refer to them), all of these things should be there for you right away.

Nevertheless, if you do need to make configuration changes to the environment that you are working on, all of that should also be automatically persisted. You shouldn't have to do anything special from then on should you move to another devserver.

More importantly, you should (mostly) be insulated from external environmental changes that could otherwise affect your development workflow.

The environment is meant to be as stable as possible. If there is maintenance going on, if there are changes in the source control system, or new versions of tooling being installed, all of these operations should be more or less transparent to you.

As a developer who just joined or who has been with the company for a while, you will have something tailored to your needs, stable, and ready to go more or less immediately.

Main Offerings

What are the main offerings that we have in the development environment platform?

There are basically two choices. The first option is what we call devservers. The second option is what we call On Demand containers. We are going to discuss what their tradeoffs and differences are.

The devserver option consists of a Linux server that is located in one of the Meta data centers. Engineers have a choice of which data center to use. In other words, a devserver can be selected based on the geographical location that is more suitable with respect to their own physical location to minimize latency.

We offer different flavors of devservers, from VMs, which can have different sizes as far as CPU processing power as well as different amounts of memory and storage, to physical servers, for engineers who need that. For instance, engineers who are working on low-level kernel development, native device drivers, or on OS optimizations might need physical servers.

Then there are certain special flavors of devserver hardware. For example, if an ML engineer is working on software that requires access to a GPU, or if they have the need to access certain types of accelerators or network interface hardware, there are certain flavors of devservers that they can pick from, which might be more suitable for their work.

In terms of a devserver lifespan, when an engineer reserves one of them, they might request it temporarily with a short-term lease, or they might be permanent.

For instance, suppose that an engineer is working on a short-term project to improve a device driver. In this case, they might need a physical server just for a couple weeks.

But an engineer also has the choice to permanently reserve a server, because she is going to be using that server, continuously, throughout her career at Meta.

The interesting thing about devservers is that they run in the production environment and network. As an engineer is testing and debugging their code, they are doing so in the same environment that gives them access to everything that powers the Meta infrastructure, easing the ability to reproduce and debug problems in a setting that is similar to where the software will ultimately run.

An engineer can also choose to have certain utilities and tooling pre-installed. We will talk about provisioning in more detail later, but there is a way for an engineer to request every feature and tool they need and to have all of these tools pre-installed to other servers they might get in the future. In other words, if you need to get a new devserver later on, all of these tools will be automatically pre-installed.

When it comes to using a devserver, an engineer has remote terminal access, so they can SSH directly into the box. Alternatively, they can use VS Code and connect to the devserver remotely and work on their laptop like they are connected directly to that devserver.

Every devserver is managed by a provisioning system. This means that they are continuously kept up-to-date. In other words, if there are updates to external or internal software, these updates are, for the most part, automatically deployed. In a nutshell, all of the upkeep that is necessary to maintain a devserver is done for the owner, automatically.

As we had mentioned, the devservers have default access to internal resources, but they do not have direct access to the Internet. There are tools and infrastructure to provide that access when needed, but that is not necessarily available out-of-the-box.

In general, we try to minimize wide open access to external resources, because such access introduces potential risk as well.

There is also the ability to install pre-defined development "features". By features, we mean software packages and the infrastructure around these packages that might help or be needed to an engineer's development workflow.

There is also what we call one-offs. These are user-specific configurations or tooling that you, as a developer, might be using. For instance, a spellchecker that you like or an editor that you are more familiar with.

You can also set this up so that you have it installed and configured to any devserver that you get from this point forward.

Devservers can also be shared. Sometimes you are working with a team and need to closely collaborate with team members by sharing a server.

For instance, when you were just hired, you might be working with someone else in your team or in a different team. Having a shared server might reduce friction when it comes to testing and debugging code and, hence, you might choose to share access to a devserver with that team member.

There is also the ability to migrate between devservers. Suppose that for one reason or another, you need to get a bigger devserver, or you might need to get a devserver in a different region. In these cases, you can migrate from one to the other quite easily.

One thing that is important to highlight here is that some devservers are virtual machines, layered on top of the same virtualization infrastructure that powers Meta. There isn't anything special about them and this is something that will be important later on as we discuss how rolling maintenance takes place in our server fleet.

The second offering we have is the On Demand containers.

The interesting thing about containers is that they are pre-warmed and pre-configured with respect to source control repositories. In other words, the repositories that an engineer might be working on, linters, and other software development tooling is all there, with their most recent versions, when they log in. But the container instances themselves are ephemeral, lasting a little over a day.

Another interesting aspect is that On Demand containers present an ephemeral environment, for a specific development platform. For instance, if an engineer is doing iOS development or Android development, this engineer is going to get all of the tooling that they need to do development for that particular platform.

The On Demand containers are also available with multiple software stacks and hardware profiles. This means that their memory amount, whether they have GPU access, whether they have access to certain accelerators or source code repositories are also options that the engineers can choose from.

On Demand containers are primarily accessible via the VS Code IDE but can, alternatively, be accessed via a CLI, which provides regular terminal access.

Choosing a suitable developer environment really depends on how an engineer prefers to work.

On Demand containers are optimized for certain workflows and the software stack an engineer might be working on. As we said before, iOS, Android, Instagram, mobile development, or Jupyter Notebooks, whatever an engineer might be working with usually has an associated On Demand type.

Some On Demand containers also include a web server sandbox (as do devservers). This sandbox replicates the prod environment for Meta products available to large surfaces. So, suppose that an engineer is making changes to the Facebook desktop backend software. The sandbox on an On Demand container provides a replica of that environment at their fingertips, which they can use to test their code changes, locally.

While this is also true for devservers, the point here is that an On Demand container is both fresh and ephemeral. It's up to date and it's ready to go at a click of the mouse.

Similarly to devservers, you can also further configure your On Demand environment. Suppose that you need certain development features. Again, features in this context is a piece of software and the configuration associated with that and you want that delivered to your container, you can have that as well.

This container infrastructure is layered on top of Meta's Twine, which is somewhat similar to Kubernetes. Twine provides both the container infrastructure and the orchestration that goes with it. If you're interested in learning more about this, there is a good talk presented at @Scale 2019, that goes deeper into Twine's architecture and capabilities.

Production Engineering

Why do we have production engineers involved in supporting the Development Environment services?

I am part of the team supporting the development environment services. As a production engineer, I want to highlight the fact that, as is the case for many internal services, PEs and SWEs work together on the software stack necessary to support the DevEnv services, sharing the codebase, support and oncall load, and working on the design of new features.

The interesting thing about Meta's internal services is that in many groups, production engineers are an integral part of the organization supporting those services.

In general, production and software engineers have different focus areas. While production engineers are software engineers, they are usually more interested in the production-related aspects behind a service, including the integration with other services and its day-to-day operations. PEs are usually the ones responsible for managing the service deployment and overall health, its interaction with other services, troubleshooting, and are also generally concerned with its scalability.

PEs also tend to have a bias towards working on a service's reliability, capacity planning, and scalability issues.

They are always focused on a service's deployment issues, on running upgrades efficiently, on configuration management, and also, on a day-to-day basis, performance tuning and efficiency.

Many teams at Meta have PEs embedded with them. Other companies have similar organizations like the SRE organization at Google or at Bloomberg, where I used to work.

These SWE/PE organizations provide an interesting mix of engineering talent necessary to efficiently run distributed services at scale. What do PEs do, specifically, in the DevEnv team?

Our main mission is to ensure that our internal developer community can work efficiently.

In companies like Meta, Google, as well as in other companies that are software intensive, the company's overall productivity is predicated on how productive the engineers who are writing and maintaining code are.

DevEnv PEs focus on the same top-level issues as any other PE team at Meta usually do. Yet, in our case, we have a particular concern with developer efficiency.

Not only do we want to make the services awesome, but they should also be as frictionless and reliable as possible.

For instance, as we mentioned before, if you are joining the company as a new software engineer, we aim to provide a development environment that is functional for you - from the outset, relieving you from having to spend a month trying to figure out how to get your server to build a particular piece of code. In a nutshell, everything is provided to you from the get-go.

The second thing is, we are obsessed with automation. Since we are a relatively small team, even if we count the combined SWE and PE team members, and tasked with supporting a very large community with thousands of software engineers, automation is a must.

And with our community of software engineers being both talented and very opinionated, we want to make sure that the services are always reliable, work as expected, and are fast - and expose all of that good stuff necessary to get our community to be as productive as possible.

Indeed, the close integration and engagement between PEs and SWEs in the combined DevEnv team is actually part of the reason why we can provide such a reliable infrastructure to our community.

The services are implemented as a white box, meaning both PEs and SWEs understand the software stack and codebase behind the scenes.

The combined team makes contributions to the same code base, have combined code reviews, planing and design workflows, and most processes are shared across PEs and SWEs. Our combined leadership also strives to maintain very good synergy between the teams.

In other words, it is a shared pool of resources and talent used to maintain and evolve the services that we have in place.

Finally, our combined expertise even extends to sharing the oncall workload, working together by assembling the weekly oncall sub-teams, which are always comprised of a mix of PEs and SWEs.

Development Environment Architecture at a Glimpse

Let's talk about the development environment service architecture.

Since the central point of this talk is on how to make the development environment infrastructure reliable, we will discuss a few of its architecture underpinnings.

The first design principle was on reusing as much of the internal services as possible. In other words, the software stack used by DevEnv is not, on the whole, special-cased for our specific services.

The code base, and its testing and deployment framework is organized in the same way as any other project or product at Meta.

At their lowest layer, we have the server hardware and the infrastructure for provisioning these servers. As mentioned earlier, an engineer can get an On Demand instance or a devserver in most regions where Meta has data centers.

Our services are layered on top of the same infrastructure that any other service at the company employs. This means that when it comes to server monitoring and server lifecycle management, we are under the same regimen as the hardware fleet supporting any other product.

When it comes to server provisioning, for example, there are mechanisms in place that allow us to specialize the provisioning system for our own purposes. These mechanisms consist of well-supported plugin interfaces to the software stack that will provision any server in the company, very much like a server that you would have in place to host a database system, or logging service, or any of the myriad of other distributed services in the company.

When it comes to the service management that we are using as part of our infrastructure, the basic monitoring infrastructure that is in place for us is also available to most other services in the company.

The same is true for our lifecycle management of our services. Many of the company's distributed services run as Twine tasks. They are monitored, employ the logging and metric-collection frameworks, and all of that relies on the very same infrastructure that everybody else has access to.

When it comes to the servers that our engineers have access to, whether it is a physical server or a virtual machine, or even a container, our developer environments are either sitting on top of the physical servers, or the virtualization and containerization infrastructure that powers the rest of the company.

Then on top of that, we have the actual services, the devservers themselves and the On Demand containers as well as the software stack that powers them.

Finally, on top of this whole stack, we have the developer and the development tools that they are going to be using, which can be installed via RPMs or via a common internal packaging process.

There is very little that is special about the overall DevEnv environment when it comes to the tooling and the software stack compared to what is generally used within the company.

Designing for Reliability

Now let's talk about designing for reliability.

If you are a software engineer working for Meta, the last thing you want to worry about is on how to maintain your devserver - you never want to ask yourself: did I install the latest updates to my devserver?

You also don't want to worry about performing periodic backups. In a scenario where your particular devserver crashes, which would potentially lead to a loss of work, you simply want to pick up where you left off on a different server.

There were many architectural decisions when it came to designing for reliability. It all stems from our desire to make our developers as productive as possible. As we indicated before, the whole software stack behind our services relies on the internal infrastructure that supports other services in the company, which means using hardened code and software components.

We also design our services to be scalable and reliable from the outset.

Why are those two aspects of our services important?

Scalability is important. The company has been growing headcount at a very high pace, onboarding new developers, which would go work on creating more services and expanding existing ones. We needed the operational capability and infrastructure to ramp up all of these incoming engineers that were being hired, at a very fast clip.

We also have to provide a reliable service. Our team is small. We don't have the bandwidth to handhold every single developer, so most of the operations related to using the developer environments have to just work or be self-service-able.

Providing reliability in a company like Meta or in many of our peers requires designing for a dynamic and potentially unreliable hardware and software infrastructure. Switches and servers die, and software is shipped with bugs, causing localized outages.

All of this means that our services have to be designed to insulate our user community from all of these dynamic and, in some cases, disruptive changes that happen when you have a very large fleet of servers, powered by an ever-changing code base.

We rely on a bunch of internal services that were themselves designed to cope with an unreliable world. From DNS and all of the infrastructure that powers these internal systems, these components are themselves highly reliable, available, and scalable.

In part, this reliability is provided by Service Router, Meta's global service mesh, which is a framework that allows users or clients of a service to find available servers that can service requests on behalf of other services or tooling.

For instance, we rely on the Meta MySQL infrastructure, which at a very high level, provides fully managed database servers running in master-slave mode, supporting workload distribution and all of the good stuff that makes for a reliable and performant data storage layer.

We also rely on the company-wide provisioning infrastructure. As a user, you can instantiate a new devserver very quickly as, in general, the servers available to end-users are pre-provisioned and ready-to-go.

The general provisioning framework supports the customization of a server's provisioning steps based on the role this server is going to fulfill. Thus, all of the infrastructure to bring a fresh server to its running state can be done in a couple of hours. This process is fully automated to always ensure the availability of pre-provisioned devservers ahead of demand.

We also rely on the company-wide virtualization infrastructure. It is very easy to turn up a new virtual machine to potentially supply additional devservers to our users.

Similarly, our On Demand service relies on the company-wide containerization infrastructure.

Furthermore, there are many other services we rely on. Another framework we rely on is the auto-remediation infrastructure, which supports the creation of automated steps that can be used to automatically apply fixes to common failures a server might face.

In other words, when something goes wrong, there is a predefined set of logical steps that will automatically run to rectify that particular failure. Finally, the DevEnv organization also relies on other internal processes designed to provide and increase the reliability of our services. For instance, one of the integral parts of the reliability and better engineering culture at Meta is what we call SEV (or Site Event) reviews.

Every time we have an outage, whether this outage is caused by a failure in our own software or in one of the services that our services depend on, this is formally tracked and acted on until it's formally mitigated and, eventually, closed. As part of a post-mortem analysis of SEV, we review the conditions that led to that site event, aiming to improve the software and/or the operational processes currently in place.

The rationale behind the SEV review process is, very much like what is present in the aviation industry following accidents. If something goes wrong, we want to be able to learn from it and improve software and processes to ensure that something similar does not reoccur.

Another important aspect of designing for reliability consists of fitting in with the overall company posture when it comes to disaster preparedness and recovery.

Meta as a company has a well-defined process for running disaster recovery exercises, and for documenting and, potentially, automating the steps one might need to perform in the face of a natural disaster or while executing a disaster recovery exercise.

If interested, there are externally available talks that discuss disaster preparedness at Meta in more detail. Suffice to say that the architecture and tooling supporting the development environment services fit in very well with the overall disaster recovery strategy for the company.

For instance, there are multiple strategies in place when it comes to ensure operational business continuity. In terms of team-facing strategies, we have continuous oncall coverage. The oncall engineers have access to runbooks describing what to do when facing outages. We continuously curate and maintain well-defined runbooks that the oncall engineers can rely on during their oncall shift.

Naturally, there is continuous coverage with good reporting and hand-off procedures and workflows dictating how the outgoing oncall team should transition pending/ongoing issues from one oncall shift to the next.

There are also well documented workflows for most of the oncall operational tasks, which helps all engineers whether they are someone who has been doing this for a long time or someone who has just joined the team. We constantly spread operational knowledge and communicate internally on an ongoing basis to help the team holding the fort during a given oncall shift.

Then there are certain strategies we employ, that are user-facing, and contribute to projecting a sense of overall stability.

For instance, how do we ensure that we have the perception that the developer environment services are reliable?

The main strategy here is to communicate with our users. We work constantly on developing and improving high signal communication mechanisms to interact with the user community, from manning support forums to employing tooling to directly communicate with our end-users.

For example, if we know that there is going to be a service outage in a particular region, because we are running a company-wide disaster recovery exercise, we employ tooling to communicate with the user community to alert them that they should be on the lookout for disruptions and that they should prepare for that exercise.

More importantly, there are operational procedures in place to minimize the pain when our services are affected by a disaster or a disaster recovery exercise. For instance, transparent and automated user backups, which allow our engineers to quickly migrate from one devserver in an affected region to another one in a region that is functional.

As hinted at before, we design our software for an ever-changing world, where OS and software upgrades are a constant. And this is not optional. Without these strategies in place, how could we do these upgrades without disrupting the users and giving them a reliable environment to work on?

User-Facing and Infra-Facing Automations

Let's talk a little bit about user-facing and infra-driven automations.

Some of these automations can be intrusive, but ultimately they ensure that the development environment services remain reliable. We try to design them in such a way that they don't disrupt too much the typical development workflows that make up an engineer's day-to-day work.

In general, most common maintenance operations are touchless from a server's owner's standpoint and they are mostly designed to be self-service.

From a user's standpoint, if something goes wrong, many of our tools have a mode in which they can self-diagnose and auto-correct common problems. In other words, for most of the tooling a developer interacts with, there will be scripts and additional tools that will go through a set of validations, which will eventually identify and correct whatever is wrong on your server for you.

In the worst case, some of these tools provide a "rage" mode, which, in addition to attempting to self-correct problems, will also collect evidence and logs in case the self-correcting measures don't work. With that information, the team that owns that particular tool/component can look at logs and user data to ultimately help rectify the problem.

On the side of team-facing infra-driven automations, we have internal tools like the Drain Coordinator and the Decominator. These are software tools that help with common maintenance operations that would otherwise require manual intervention.

Suppose, for example, that you have a server that is going to undergo maintenance (e.g., some of its hardware needs to be replaced). The Drain Coordinator can perform a bunch of choreographed steps to minimize the disruption to the end-user who currently owns that server. One of the things it might do is to live-migrate a virtual machine. For instance, if an engineer has a devserver that happens to be a virtual machine, it can potentially move it to a different physical hypervisor without disrupting that end-user.

There is also the Decominator, a tool that automates the process of sending a devserver to the shredder. For instance, if a devserver has hit the end-of-line and needs to be decommissioned, it will alert its owners and perform other tasks that are necessary to drain that particular devserver and indicate to the users that they have to move over to a different devserver, ultimately taking that devserver out of circulation.

Preparing for Disasters

The next thing that I want to talk a little bit about is, again, disaster recovery and disaster preparedness.

If you're a developer, you don't want to be doing your own planning for disasters when it comes to your developer environment. More importantly, you don't want to be concerned with when disaster preparedness exercises are going to be run.

How do we prepare for disasters in such a way that we don't negatively impact developer efficiency and productivity for the company as a whole, as well as for individual developers?

The first aspect to consider is capacity planning. We design our services to run replicas on multiple regions and also to have spare server fleet capacity under the assumption that, if we need some engineers to allocate servers in other regions, because there is maintenance or loss of capacity in a particular data center, we can do that efficiently. If they're using a devserver in an affected region, they can easily request a devserver in a different region and migrate to it.

For On Demand instances, people can just migrate transparently from one region to another. Note that On Demand containers were designed for ephemeral access. In fact, the majority of our developers tend to use these development containerized environments, which are always fresh and short-lived. Every time that an engineer gets a new one, they get the freshest setup possible, again, because these are short-lived containers (or tasks as we call them).

They live for a day and a half and are then disposed of. When you get a new one, you have the latest version of the source control system, linters, and whatever tooling that comes pre-installed in the container. It all comes to you brand new.

As it must be clear by now, we run these disaster recovery exercises periodically. We have two kinds of exercises that impact the developer environments.

We have "storms". Storm is a term we use internally. It comes from actual storms, which tend to hit the East Coast of the U.S. rather frequently. They can be as disruptive as taking down a whole data center.

We also have "dark storms" which, during a typical exercise, will employ tooling to potentially wipe the contents of random devservers. Their objective is to ensure that we have the infrastructure and people prepared to cope with these random losses.

Another important aspect when it comes to disaster preparedness is to have adequate tooling to help manage the outage, at your fingertips. For instance, the DevEnv engineers must be able to drain servers from a particular region quickly. They must also be able to communicate about what is happening and what is being done to our end-users.

In other words, if we are not going to have access to the resources in a particular data center, we need to make sure that we don't let anybody use servers and services in that particular region. Specifically, if an engineer is trying to get a new devserver or an On Demand container, they should be steered away from the servers in that region, transparently. For the engineers that are currently on servers in the affected region, you want to basically drain them out as soon as possible, so they don't lose any work as the exercise takes place.

We also invest a lot in writing and curating both manual as well as automated runbooks.

A runbook is like a cooking recipe. In the context of disaster recovery exercises, they, for example, describe the steps of sending out end-user notifications as well as the steps to perform service draining.

As mentioned before, we also have the capability to perform VM-based devserver live migration. This means that we have the ability, by relying on Meta's server virtualization software stack, to move one devserver from a physical hypervisor to a different one, without disrupting the end-user. When a live migration is taking place, an engineer doesn't even have to power down their VM.

We also invest in automated (and transparent) backups, which support our devserver migration workflow. For instance, if an engineer loses their devserver, they have the ability to, as quickly and painlessly as possible, allocate a new one.

Finally, we have tooling in place to survive internal DNS failures. If that does occur, we have a tool our engineers can use to get to their devserver bypassing DNS lookups.

One last thing we want to highlight is our ability to communicate with our end-users, at scale. For instance, if the company is running a disaster recovery exercise, we have the ability to email and send chatbot notifications to our end users. We can also open tasks (in our Jira-like environment) when we need them to perform a manual operation on their devservers.

Specifically, we can send chatbot notifications to indicate to developers that they have devservers in a region affected by a disaster, indicating that they have to temporarily get a new one in a different location, and potentially restore their latest backup, so they are again up-and-running as quickly as possible.

Storms and Drains

Let's talk a little bit more about storms and drains.

There are two types of exercises related to disaster readiness: storms and drains.

Storms are the exercises where we completely disconnect the network from the data center. It simulates a complete loss of that particular data center.

We also have drains. A drain takes place when we selectively remove, in a controlled way, the normal workload directed to a data center and fail it over to a different site. During drains, the network remains up.

Why do we do these exercises?

First and foremost, we want to periodically test all of the infrastructure together and collect signals regarding components that might not be able to tolerate the loss of a single data center.

Why do we do them periodically?

Because once you work the kinks out of the system, the overall infrastructure won't necessarily remain in good shape going forward. The reality is that our own software stack as well as what we depend on is constantly evolving.

It is possible that after addressing all of the design shortcomings observed in a DR exercise, someone might introduce a feature that might create a single point of failure.

That's the reason for going through this process frequently and continuously.

The main objective really is that we want to be prepared for large-scale power outages, network incidents, and even self-inflicted regressions that might occur as the software and hardware landscape evolve.

In other words, we do this on a periodic basis to continuously validate and ensure that our design decisions, our architecture, and everything else is in place to provide the highly available environment that we aim to provide to our end-users.

What types of signals do we collect when we run these exercises?

First and foremost, capacity and availability. Do we have enough of it - such that it would be possible for affected engineers to migrate quickly from devservers that are no longer accessible to devservers in a different region? Or, in the case of the On Demand service, do we have enough spare containers in other regions to accommodate the loss of servers in a region affected by an exercise?

When we have failovers, some of the services that we run in that data center will become unusable. Do we have the ability to fail over to different regions and support the increased load on the remaining regions?

On the trailing end of an exercise, are we able to recover from these failures? Do we have all of the process orchestration in place to make sure that everything will remain operational once capacity is returned, post a disaster or DR exercise.

These are some of the reasons why we run storms and drains!

Runbooks

Let's talk a little bit more about runbooks. In particular, let's talk about runbooks in the context of our own organization.

As we mentioned before, a runbook is a compilation of routine procedures and operations that an engineer needs to carry out in the face of a particular problem.

To make this more concrete, let's again talk about what needs to be done in the face of a disaster.

The goal of a runbook is that all of the engineers in our own group should be able to perform these remediation steps in a repeatable fashion. One of the implications of using a runbook is that we should attempt to automate as many steps as possible, if not the whole thing, to minimize the chances of human error.

Meta has a runbook tool that allows its steps to be written as code, with all of the steps that need to be carried out being fully spelled out, including manual steps.

A runbook can be nested. For instance, one runbook can invoke another runbook as a single step. A step can be an operation or a logical barrier. With a barrier, you are basically waiting for a bunch of prior steps to be completed.

When it comes to runbook development, there is a whole environment supporting this, which, for instance, allows you to validate and debug it. You can also rely on pre-existing templates for operations that are more or less general-purpose.

At runtime, when an engineer invokes one of these runbooks, there's tooling that automatically captures the dependencies that will allow for precise step orchestration as well as tooling to capture the execution timeline and corresponding logs. This is all in place so engineers can actually do a future post-mortem of a runbook invocation following an actual disaster or a DR exercise should something not work as expected.

Comms

Let's talk a little bit more about comms.

A good example again is when the company is running a DR exercise even though that's not the only circumstance where a good communication infrastructure is necessary.

These exercises themselves can potentially be highly disruptive. For instance, when it comes to developer environment services, they can be disruptive to users who have physical devservers, which, eventually, become inaccessible due to the loss of network connectivity to the data center where they are located.

One of the investments that we have been making so our users are quickly informed about the state of their devservers and also about other events that might be disruptive to them is on having a well-defined, high signal, timely, and scalable strategy in terms of how and when we communicate with them.

The aim of this effort has been on maximizing the efficiency of individual developers, so they can remain productive, even if the company is running a DR exercise. While DR exercises occur without prior notice, we try to communicate with our users as soon as the information becomes available to us, whenever that's possible.

Obviously, when the company is running a disaster recovery exercise, the whole point is that it should look and feel like an actual disaster. Therefore, as part of fostering a culture of preparedness, we also want to educate our users themselves so they are aware that they might potentially lose their devservers at a random time, with very little notice.

In other words, we want our users to be aware of what they themselves need to do to be able to survive an outage, whether the impact is temporary, just lasting a few hours where they might not have access to their devserver because of a DR exercise, or due to a real emergency.

Another issue we want our users to be aware of is that they should never be running a production service on their devserver because it's an environment that can disappear without notice and it's also an environment that provides no redundancy or fault-tolerance support.

Finally, we also want to empower our users to be able to correct any problems that might occur on their own and continue to work.

It should be clear by now, but why is all of this so important?

In a nutshell, we want the developer efficiency to remain high, even in the face of potential server losses. Ultimately, getting actionable information to the users quickly with an indication that there is an actual disaster, a DR exercise, or any other kind of outage via emails, tasks, and chatbot messages that directly target the affected users and indicate what they need to do, for instance, temporarily obtain a new devserver, allows them to minimize the amount of time that this outage will affect them directly.

Live Migration

Another automation that I want to talk about a bit more is our ability to live-migrate virtual machine-based devservers.

For the devservers setup as virtual machines, they are organized in a "virtual" data center. We use the term "virtual" data center because every server in such a data center has a mobile IP address, which enables us to migrate them from a physical hypervisor to a different physical server without interrupting the user workflow.

This capability is very useful when it comes to supporting data center maintenance workflows in a way that is transparent to the end users. For instance, a hypervisor might have a hardware problem, for instance, a fan might have died, and a DC technician might have to physically repair that part, which might necessitate stopping the hypervisor. Because we can easily migrate all of the devserver VMs that are on that physical host to a different one, it is possible to enable the maintenance workflow to take place without disrupting the users that own those VM devservers.

The ability to live migrate VM devservers rely on a software component called ILA. There's a very good talk about it at Network@Scale 2017.

Learn From Quasi-Disasters and Other Kinds of Outages

What is the point of running disaster exercises if we don't learn from them?

The main reason behind running these exercises is that we want to be able to learn from them.

Every time we run a DR exercise, we open a preemptive SEV specific to our services to track relevant events throughout the duration of the exercise. This is used to document its impact on our own software stack and also the impact on our end users. Ultimately, this SEV is used to broadcast in real-time what is happening to our own team, our users, and to interested external parties.

Then, after the exercise has taken place, that SEV is closed and we summarize all of the information related to that event in-depth in the SEV management tool.

Subsequently, we have a SEV review presentation meeting. Most SEVs at Meta are reviewed, at the very least, at the team level. The owner of that DR-tracking SEV, in the case of the DevEnv team, the engineer who was oncall during that exercise, will put together an incident report. The SEV management tooling ensures that we always do this reporting in a consistent way.

This report is then reviewed by a group of senior engineers and whoever else might be interested in the impact and learnings from that SEV. During the review, the facilitator of that meeting will, when necessary, open tasks to drive the process of fixing and/or improving whatever needs improving, based on what was observed during the DR exercise.

It is possible that there will be critical tasks for issues deemed very important and which require immediate engineering action. The work outlined by these critical tasks is reviewed in a timely fashion to ensure that it is indeed addressing the root cause of whatever did not work as expected.

We might also have medium priority tasks created as part of the SEV review. These tasks outline work that will allow us to mitigate problems that surfaced during the exercise in the near future. They might also capture work that will allow us to remediate minor problems or prevent problems from escalating.

Finally, the review group might also opt to open exploratory tasks. One such task will capture work geared towards driving future architectural changes, which might include, potentially, redesigning services that have shown to not be totally reliable in the face of disaster-like scenarios.

The key observation here, again, is that we have a process by which we learn from outages and that drives continuous improvements throughout our software stack.

All of this is to ensure that we can provide developer environments where Meta engineers are always productive and don't need to spend time worrying about its stability even in the face of outages.

The Future

We have many plans going forward!

As it should be clear from what you have heard so far, our mission is to ensure that all of our engineers remain as productive as possible. To deliver on this, we are continually hardening our services to ensure that our infrastructure can tolerate outages and failures without disturbing our end users. We are currently in the process of better integrating our services with the company's reliability program maturity model. We are improving our on-calls. We are investing a lot in terms of observability, incident management, and also in terms of our response to critical outages.

What are the things that we can do to better respond to potential failures that we might have in our development environment services?

The DevEnv services are critical in handling SEVs in other products and services. Oftentimes, having access to one's devserver is a fundamental requirement for a PE or a SWE who is working on an actual SEV elsewhere. In many cases, these engineers will eventually need to make code or configuration changes as part of the process of addressing the root cause of these SEVs and making these changes is normally done via one of the developer environment products.

Therefore, our services have to be bulletproof to enable our engineers to work through outages that might be affecting other parts of Meta's computational infrastructure.

Because of our overall system architecture, there are interdependencies between our services and other components, managed by other teams.

Earlier in this talk, we mentioned our own dependency on a functional internal DNS infrastructure.

Devservers are Linux servers. How does someone connect to a Linux server if DNS is down?

As mentioned before, we worked with other infrastructure teams to ensure that Meta engineers can work around this limitation as part of our ongoing projects on increasing the resiliency of our services.

Similarly, we are working with the build and source control teams to ensure that even with degraded access to source control, continuous integration, and continuous delivery that our engineers can still make and ship code and configuration changes as, in most cases, to fix the problem behind a SEV, engineers have to ship code or, potentially, undo changes that have been incorrectly deployed.

Finally, we are making investments to improve our own reliability and security practices. We are investing in upleveling our team with respect to architecture, security, and code reviews. These efforts are being put in place to make sure that, from the outset and as we add new features, we are not creating potential failure points in our services.

Furthermore, there is work related to carrying out periodic reassessments of the state of our production services to address questions related to healthy and reliable operations. For instance, how do we make sure that our software components, services and everything else that makes up our developer environments don't decay, just because we were not paying close attention?

There is also work on identifying recurring issues and problem areas. What are the problems that our oncall engineers are seeing day-in and day-out? We are putting effort on continuosly addressing these issues as well as sponsoring better engineering initiatives where we intentionally address tech debt.

Again, all of the work and all of the architectural effort that you saw here is in place to enable software engineers, production engineers, data scientists, to work as efficiently as possible without having to worry about the developer environment they are using.

Thank you!

See more presentations with transcripts

Recorded at:

Dec 19, 2023

Henrique Andrade

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?