Key Takeaways
- When managing dependencies for distributed microservices, you must consider different types of growth when evolving the product, such as the number of users, user behavior, and the interactions between services and subsystems.
- Stateless services are often easier to manage than stateful ones.
- Colocate service components for greater performance, easier failure isolation, and SLO alignment.
- Having isolated serving stacks may prevent a global outage of your service. This includes the dependencies of your service run by 3rd parties or in the Cloud.
- Some architectural strategies may allow your service to have a graceful degradation during an outage instead of immediately returning errors to the user, e.g., having a cache between the API and the database.
- When building an SLO, take into account the current SLO of all backends and the different user journeys, including edge cases for bad responses and degraded experiences.
- Work with backend owners and allow extra time for resource allocation and architectural changes.
Last year, during QCon Plus, I shared some of the Pitfalls and Patterns in Microservice Dependency Management that I encountered while working at Google for over 10 years. Rather than being focused on any particular product or team, this talk was about sharing my own experiences and personal learnings as a software engineer at Google.
With that in mind, I presented. Each of these scenarios had key moments of epiphany, where I realized the importance of various aspects within the microservices environment. There were also moments where things failed or went wrong. I made sure to single out these complications and tell what I did to avoid similar situations in the future, so you can look for the signs leading to a similar potential failure within your own environment.
All these scenarios actually happened while I was working with people in other roles in engineering. At one point I was a software engineer working with a project manager. At another point, I was a site reliability engineer working with other developers. And in the last scenario, I was working with basically everybody on the team with the goal of building reliable services.
Each of these scenarios can be useful for a variety of roles within a team: the success (or failure) of building a reliable microservices environment isn't solely dependent on one person or one role.
Every time you change your system, the change will affect many other parts, and many other components of the whole product - these components might be run by your company, in the cloud, or by a third-party provider. System changes create a ripple effect that goes all the way to the customer: the person that you want to keep in mind always. For that reason, you need to view systems from a holistic perspective - which is also part of what I tried to convey during my talk.
Before going through these scenarios (and all that can be learned from them), let’s take a quick look into how the industry transitioned from service monoliths into microservices, and then take one last jump to include services running in the cloud. We will also look at patterns and traffic growth, failure isolation, and how we can plan reasonable service-level objectives (SLOs) in a world where every backend has different prompts.
Monoliths, Microservices, and into the Cloud
Our journey (and we'll use a generic service here) starts with a single binary, which eventually (and rapidly) evolved to include more complex functionalities such as databases, user authentication, flow control, operational monitoring, and an HTTP API (so our customers can find us online). At first, this single binary was meant to run in a single machine, but as the business grew it was necessary to replicate this binary in multiple geo-locations, also allowing for extra room for traffic growth.
Not too long after replicating our monoliths, several reasons required them to be decoupled into separate binaries. Perhaps you can recognize some of these reasons. One common reason is related to the complexity of the binary. As we added more complex functionalities to it, the codebase became almost impossible to maintain (let alone add other new features). Another common reason to break a monolith into separate binaries is related to individual requirements of independent logical components. It might be necessary, for example, to grow hardware resources for a specific component without impacting the performance of the remaining others.
Scenarios like this ultimately motivated the birth of microservices: a collection of loosely coupled services that are independently deployable, highly maintainable, and organized to form (or serve) a complex application. In practice, that means multiple binaries deployed and communicating over a network, where each of these binaries implements a different microservice but all of them serve and represent a single product. Figure 1 shows an example of an API product decoupled into five independent microservices (API, Auth, Control, Data, and Ops) that communicate to each other over the network. In a microservice architecture, the network is also an important part of the product, and as such you must always keep it in mind. Each service - now both a single binary and a component of the application - can grow the hardware resources independently, and its lifecycle can be easily controlled by engineering teams.
Figure 1: An API product decoupled into five independent microservices
Benefits of Microservices
Running a product in a microservice architecture provides a series of benefits. Overall, the possibility of deploying loosely coupled binaries in different locations allows product owners to choose among cost-effective and high-availability deployment scenarios, hosting each service in the cloud or in their own machines. It also allows for independent vertical or horizontal scaling: increasing the hardware resources for each component, or replicating the components which has the benefit of allowing the use of different independent regions.
Another benefit is related to the development lifecycle. Since each service is logically decoupled from the others and has low internal complexity, it is easier for developers to reason about changes in their implementation and guarantee that new features have a predictable outcome. That also means independent development of each component, allowing for localized changes in one or more services without disturbing others. Releases can be pushed forward or rolled back independently, promoting a faster reaction to outages and more focused production changes.
Challenges of Microservices
Despite all its advantages, having an architecture based on microservices may also make it harder to deal with some processes. In the following sections, I'll present the scenarios I mentioned before (although I changed some real names involved). I will present each scenario in detail, including some memorable pains related to managing microservices, such as aligning traffic and resource growth between frontends and backends. I will also talk about designing failure domains, and computing product SLOs based on the combined SLOs of all microservices. Finally, I will share some useful tips that hopefully will save you time and prevent eventual customer outages.
Scenario #1: PetPic
Our first scenario revolves around a fictional product called PetPic. As shown in Figure 2, PetPic is a global service that provides pictures of dogs for dog lovers in two distinct geographic regions: Happytails and Furland. The service currently has 100 customers in each region, for a total of 200 customers. The frontend API runs in independent machines, each located in one of the regions. As a complex service, PetPic has several components, but for the purpose of this first study, we will consider only one of them: the database backend. The database runs in the cloud in a global region and it serves both regions, Happytails and Furland.
Figure 2: The global PetPic service
>Problem: Aligning Traffic Growth
The database currently uses 50% of all its resources at peak. Taking this into consideration, the product owner decided to implement a new feature in PetPic so it could also serve pictures of cats to their customers. Once the new feature was implemented, the engineers decided to launch the new feature in Happytails first. That way, they could look for any unexpectedly enthusiastic user traffic or resource usage changes before making the new feature available to everybody. Considering that the user base is the same size in both regions, this seemed to be a very reasonable strategy at the time.
In preparation for the launch, the engineers doubled the processing resources for the API service in Happytails and increased the database resources by 10%. The launch was a success. There was a 10% growth in customers, which could indicate that some cat lovers had joined PetPic. The database resource utilization was at 50% at the peak, again showing that the extra resources were indeed necessary.
All signals indicate that 10% growth in users requires a 10% growth in the database resources. In preparation for launching the new feature in Furland, PetPic engineers added an additional 10% of resources to the database. They also doubled the API resources in Furland to cope with the request for new customers. These were the exact same changes made when the new feature was launched in Happytails.
They launched the new feature for Furland users on a Wednesday. Then, during lunchtime, the engineers started receiving lots of alerts about users seeing the service running HTTP 500 error codes - meaning that the users couldn’t use the service. This was a completely different outcome from the Happytails launch. At this point, the database team reached out to engineering and mentioned that the database resource utilization had reached 80% two hours ago (shortly after the launch). They were trying to allocate more CPU to handle the extra traffic, but that change was unlikely to happen today. At the same time, the API team checked the user growth graphs and reported no changes different than expected: the service was at a total of 220 customers. Since nobody on the engineering team could figure any apparent reasons for the outage, they decided to abort the launch and roll back the feature in Furland.
Figure 3: Unexpected repercussions of traffic growth
The feature launch in Happytails had a 10% customer growth aligned with a 10% traffic growth to the database. However, after analyzing the logs, the engineering team observed that once the feature was launched in Furland, there was a 60% traffic growth to the database even without a single new user registered. After the rollback, unhappy customers who were eager to see cat pictures during their lunch break opened several customer support tickets. The engineers learned that customers in Furland were actually cat lovers who never had much interest in interacting with PetPic when only dog pictures were available.
>Tips
The scenario above tells us that the cat picture feature was a huge success in engaging existing customers in Furland, but the rollout strategy in place for the new feature could never have predicted that overwhelming success. An important lesson here is that every product experiences different types of growth. As we saw in this scenario, growth in the number of customers is different from an increase in engagement from existing customers - and different types of growth are not always associated with each other. The necessary hardware resources to process user requests may vary according to user behavior, which can also vary depending on a number of factors - including geographic region.
When preparing for launching a product in different regions, it's a good idea to run feature experiments across all regions to have a fuller view of how the new feature will impact user behavior (and, consequently, resource utilization). Also, whenever a new launch requires extra hardware resources, it is wise to allow backend owners extra time to actually allocate those resources. Allocating a new machine requires purchase orders, transportation, and physical installation of the hardware. The rollout strategy needs to account for this extra time.
Scenario #2: Failure Isolation
From an architectural perspective, the scenario we just studied involved a global service operated as a single point of failure, and a localized rollout that caused an outage in two distinct regions. In the world of monoliths, isolating failure across components is quite difficult, if not impossible. The main reason for this difficulty is that all logical components coexist in the same binary, and thus, in the same execution environment. A huge advantage of working with microservices is that we can allow for independent logical components to fail in isolation, preventing failures from widely spreading across the system and compromising other components. The design process that analyses how services fail together is often called failure isolation.
In our example, PetPic is deployed independently in two different regions: Happytails and Furland. However, the performance of these regions is strongly tied to the performance of the global database serving both regions. As we have observed so far, customers in Happytails and Furland have quite distinct interests, making it hard to tune the database to efficiently serve both regions. Changes in the way Furland customers access the database can resonate poorly with the user experience of Happytail users, and vice-versa.
There are ways to avoid problems like this, such as using a bounded local cache, as shown in Figure 4. The local cache can guarantee an improved user experience since it also reduces response latency and database resource usage. The cache size can be adapted to the local traffic rather than global utilization. It can also serve saved data in case of an outage in the backend, allowing for graceful degradation of the service.
Caches may also introduce problems unique to the application or business requirements -- for example, if you have high data freshness or scaling requirements. Common issues include caches slowly increasing latency due to resource restrictions and consistency when querying various caches. Also, services should not rely on the cached content to be able to serve.
Figure 4: Failure isolation using a bounded local cache
What about the other components in the product architecture? Is it reasonable to use caching for everything? Can you isolate services running in the cloud to specific regions? The answer to both of these questions is yes, and you should implement these strategies if you can. Running a service in the cloud does not prevent it from being the source of a global outage. A service running in different cloud regions can still behave as a global service and thus as a single point of failure. Isolating a service to a failure domain is an architectural decision, and it's not guaranteed solely by the infrastructure running the service.
Let's consider a different scenario using PetPic, but this time focusing on the control component. This component performs a series of content quality verification. The development team recently integrated an automated abuse detection routine into the control component, based on machine learning (ML), which allows every new picture to be validated as soon as it’s uploaded into the service. Problems start to arise when a new customer in Happytails starts uploading a large number of pictures of different animals into PetPic, which is expected to serve only pictures of dogs and cats. The stream of uploads activates the automated abuse detection in our control component, but the new ML routines cannot keep up with the number of requests.
The component runs in a pool of 1000 threads and limits the number of threads dedicated to the abuse routine to half of them, i.e. 500 threads. This should help prevent thread starvation in case a large bulk of long-processing requests arrive together, like in our example here. What engineers didn't expect was that half the threads ended up consuming all memory and CPU available, causing customers in both regions to start experiencing high latency when uploading images to PetPic.
How can we mitigate the pain users are experiencing in this scenario? If we isolate the control component operations to a single region, we can further restrict the impact of abuse situations like this one. Even if a service runs in the cloud, making sure that each region has its own dedicated instance of control guarantees that only customers in Happytails are affected by the stream of bad image uploads. Note that stateless services can easily be restricted to a failure domain. Isolating databases isn't always possible, but you may consider implementing local reads from the cache, and occasional cross-region consistency as a great compromise. The more you can keep the processing stack regionally isolated, the better.
>Tips
Keeping all services in the service stack colocated and restricted to the same failure domain can prevent widely spread global outages. Isolating stateless services to a failure domain is often easier than isolating stateful components. If you can't avoid cross-region communication, consider strategies for graceful degradation and eventual consistency.
Scenario #3: Planning SLOs
In this final scenario, we will review the PetPic SLOs, and verify the situation of the SLOs provided by each packet. In a nutshell, the SLOs are the service-providing objectives that can be contractually bound in an SLA with our customers. Let’s take a look at the table in Figure 5:
Figure 5: PetPic's SLOs
This table shows the SLOs engineers believe will provide PetPic customers with a great user experience. Here, we can also see the SLOs offered by each internal component. Note that the API SLOs must be built based on the SLOs of API backends (such as Control and Data). If a better API SLO is needed but impossible, we need to consider changing the product design and working with the backend owners to provide higher performance and availability. Considering our latest architecture for PetPic, let's see if the SLOs for the API make sense.
Let's start with our operational backend (which we'll refer to as "Ops"), which is the part of the backend that collects health metrics about PetPic APIs. The API service only calls Ops to provide monitoring data about the requests, errors, and processing time of the operations. All writes to Ops are done asynchronously, and failures do not impact the API service quality. With these considerations in mind, we can disregard the Ops SLO when designing the external SLO for PetPic.
Figure 6: Aligning read-SLO with the database
Now, let's take a look at the user journey for reading a picture from PetPic. Content quality is only verified when the new data is injected into PetPic, so data reads won't be affected by the control service performance. Besides retrieving the image information, the API service needs to process the requests, which our benchmarks indicate to take about 30 ms. Once the image is ready to be sent, the API needs to build a response, which takes about 20 ms on average. This adds up to 50 ms processing time per request in the API alone.
If we can guarantee that at least half of the requests will hit an entry in the local cache, then promising the 50th percentile with a 100 ms SLO is quite reasonable. Note that if we didn't have the local cache, the request latency would be at least 150 milliseconds. For all other requests, the image needs to be queried from the database. The database takes from 100 to 240 ms to reply, and it may not be colocated with the API service. The network latency is 100 ms on average. If we consider the worst-case scenario with these numbers, the longest time a request could take is 50 ms (API processing) + 10 ms (accounting for the cache miss) + 100 ms (network) + 240 ms (data), summing up to a total of 400ms. If we look at the SLO in the left column of Figure 6, we can see that the numbers are well aligned with the API backend structure.
Figure 7: Aligning write-SLO with Control and Database
Following the same logic, let's check the SLO for uploading a new image. When a customer requests a new image load to PetPic, the API must request control to verify the content, which takes between 150 ms and 800 ms. In addition to checking for abusive content, the Control component also verifies if the image is already present in the database. Existing images are considered verified (and they don't need to be re-verified). Historical data shows that customers in Furland and Happytails tend to upload the same set of images in both regions. When an image is already present in the database, the Control component can create a new ID for it without duplicating the data, which takes about 50 ms. This journey fits about half of the write requests, with the 50 percentile latency summing up to 250 ms.
Images with abusive content usually take longer to process. The time limit for the Control component to return a response is 800 ms. Also, if the image is a valid picture of a dog or a cat, and assuming it's not already in the database, the Data component may take up to 1000 ms to save it. All numbers considered, in the worst-case scenario it may take almost 2000 milliseconds to return a response. As you can see in the left column of Figure 7, 2000 ms is far above the current SLO engineers projected for writes, indicating that they might have forgotten to include bad scenarios when proposing the SLO. To mitigate this mismatch, you could consider bounding the 99th percentile SLO with the request deadline. This situation may also lead to poor or wrong service performance. For instance, the database may finish writing the image after the API reports that the operation deadline was exceeded to the clients, causing confusion on the customer side. In this case, the best strategy is to work with the database team on either improving the database performance or adjusting the write SLO for PetPic.
>Tips
It is important to make sure your distributed product offers the correct SLOs to customers. When building an external SLO, you must take the current SLOs of all backends into account. Consider all different user journeys and the different paths a request may take to generate a response. If a better SLO is required, consider changing the service architecture or working with backend owners to improve the service. Keeping service and backends colocated makes it easier to guarantee SLO alignment.
About the Authors
Silvia Esparrachiari has been a software engineer at Google for 11 years, where she has worked in User Data Privacy, Spam and Abuse Prevention, and most recently in Google Cloud SRE. She has a bachelor's degree in Molecular Science and a master's degree in Computer Vision and Human-Computer Interaction. Her current focus at Google is to promote a respectful and diverse environment where people can grow their technical skills.
Betsy Beyer is a Technical Writer for Google in NYC specializing in Site Reliability Engineering (SRE). She coauthored Site Reliability Engineering: How Google Runs Production Systems (2016), The Site Reliability Workbook: Practical Ways to Implement SRE (2018), and Building Secure and Reliable Systems (2020). En route to her current career, Betsy studied international relations and English literature, and she holds degrees from Stanford and Tulane.