BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Pitfalls and Patterns in Microservice Dependency Management

Pitfalls and Patterns in Microservice Dependency Management

Bookmarks
20:26

Summary

Silvia Esparrachiari shares stories on how a small change can impact a system, discussing the importance of having a broad view of a system to better understand how a change can impact a system.

Bio

Silvia Esparrachiari has been a software engineer at Google for 10 years, having worked at User Data Privacy, Spam and Abuse Prevention, and most recently in Cloud and Infrastructure/SRE. Her current focus at Google is to promote a respectful and diverse environment where people can collaborate and grow their technical skills.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Esparrachiari: My name is Silvia. I'm going to share with you some pitfalls and patterns in microservice dependency management that I bumped into while working at Google for over 10 years. The content and examples in this presentation are based on my own experiences as a software engineer at Google. I'm not focused on any particular product or team.

We're going to start by taking a quick look on transition from service monoliths, into microservices, and then taking one last jump to include services running in the cloud. We will continue our journey through patterns and traffic growth, failure isolation, and how we can plan reasonable SLOs in a world where every backend has different prompts.

Monoliths, Microservices, and into the Cloud

In the beginning, we all wrote a single binary, often called Hello World, which evolves to include more complex functionalities like database, user authentication, flow control, operational monitoring, and an HTTP API so our customers can find us online. This binary runs in a single machine, but could also have many replicas to allow for traffic growth in different geo-locations. Several reasons pushed for our monoliths to be decoupled into separate binaries. A common reason is the complexity of the binary that turned the code base almost impossible to maintain and add new features. Another common reason is the requirement for independent logical components to grow hardware resources without impacting the performance of remaining components. These reasons motivated the birth of microservices, where different binaries communicate over a network, but they all serve and represent a single product. The network is an important part of the product and must always be kept in mind. Each component can grow the hardware resources independently, and it's much easier for engineering teams to control the lifecycle of each binary. Product owners may choose between running their binaries on their own machines or in the cloud. A product owner may even choose to run all their binaries in the cloud, which is often associated with higher availability and a lower cost.

Benefits of Microservices

Running a product in a microservice architecture provides a series of benefits like allowing for independent vertical or horizontal scaling, or growth of the hardware resources for each component, or replicating the components in different regions independently. Better logical decoupling and lower internal complexity, which makes it easier for developers to reason about changes in the services and guarantee that new features have a predictable outcome. Independent development of each component, allowing for localized changes without disturbing components that are unrelated to a new feature. Releases can be pushed forward or rolled back independently, promoting a faster reaction to outages and more focused production changes.

Challenges of Microservices

Although having an architecture based on microservices may also make some processes harder to deal with. We will see some useful tips that hopefully will save your time and some customer outages. Some memorable pains from my own experience in managing microservices include aligning traffic and resource growth between frontends and backends. Designing failure domains, and computing product SLOs based on the combined SLOs of all microservices.

PetPic

Let's start by understanding our example product. PetPic is a fictional product that we will use to exemplify these challenges. PetPic serves pictures of dogs for dog lovers in two regions: Happytails and Furland. It currently has 100 customers in each region, summing 200 customers total. The frontend API runs in independent machines in Happytails and Furland. The service has several components, but for the purpose of this first example, let's consider for now only the database backend. The database runs in the cloud in a global region and serves both regions, Happytails and Furland.

Aligning Traffic Growth

The database currently uses 50% of all its resources at peak. PetPic owner decided to launch a new feature to also serve pictures of cats to their customers. PetPic engineers decided to launch the new feature in Happytails first, so they could look for user eager traffic or resource usage change before making the new feature available to everybody. This looks like a very reasonable strategy. In preparation for the launch, engineers doubled the processing resources for the API service in Happytails and increased the database resources by 10%. The launch was a success. The engineers observed a 10% growth in customers, which might indicate that some cat lovers had joined PetPic. The database resource utilization is at 50% at peak, again, showing that the extra resources were indeed necessary.

All signals indicate that 10% growth in users requires a 10% growth in the database. In preparation for the launch in Furland, engineers added 10% more resources to the database again. They also doubled the API resources in Furland to cope with the request for new customers. They launched it on a Wednesday, and waited. In the middle of lunch time, pagers started bringing alerts about users seeing 500s. Yes, threads of 500s. What's happening? The database team reaches out and mentions that the resource utilization has just reached 80% two hours ago, and they were trying to allocate more CPU to handle the extra traffic but that's unlikely to happen today. The API team checks out user growth graphs and there's no change, still 220 customers. What's happening? They decided to abort the launch and roll back the feature in Furland. Several customer support tickets are opened by unhappy customers who are eager for some cat love during lunch break. Engineers scratch their head and look at the monitoring logs to understand the outage.

In the logs, they can see that the feature launch in Happytails had a 10% customer growth aligned with a 10% traffic growth to the database. Once the feature was launched in Furland, the traffic to the database rose 60% even without a single new user registered in Furland. They learned that customers in Furland were actually cat lovers, and had never had much interest in interacting with PetPic before. The cat picture feature was a huge success in regaining these customers, but the rollout strategy could never have predicted that.

Tips

What can we do better next time? First, keep in mind that every product experiences different types of growth. Growth in the number of customers is not always associated with more engagement from customers. The amount of hardware resources to process user requests may vary according to user behavior. When preparing for launch, run experiments across all different regions, so you can have a better view of how the new feature will impact user behavior and resource utilization. When requesting for extra hardware resources, allow backend owners extra time to actually allocate them. Allocating a new machine requires buying orders, transportation, and physical installation of the hardware.

Failure Isolation

We just observed a scenario where a global service operated as a single point of failure and caused an outage in two distinct regions. In the world of monoliths, isolating failure across components is quite difficult, if not impossible. The main reason is that all logical components coexist in the same binary, and thus, in the same execution environment. A large benefit of working with microservices is that we can allow for independent logical components to fail in isolation, preventing failures from widely spreading and compromising performance of other system components. This design process is often called failure isolation, or the analysis of how services fail together.

In our example, PetPic is deployed independently in two different regions: Happytails and Furland. Unfortunately, the performance of these regions is strongly tied to the performance of the global database serving both regions. As we observed so far, customers in Happytails and Furland have quite distinct interests, making it hard to tune the database to efficiently serve both regions. Changes in the way Furland customers access back this data, can resonate poorly on the user experience of Happytail users. There are ways to avoid that. A simple strategy is to use a bounded local cache. The local cache can guarantee an improved user experience since it also reduces response latency and database resource usage. The cache size can be adapted to the local traffic rather than global utilization. It can also serve saved data in case of an outage in the backend, allowing for a graceful degradation of the data.

What about other components in the product architecture? Is it reasonable to use caching for everything? Can I isolate services running in the cloud to my regions? Yes, and you should. Running a service in the cloud does not prevent it from being the source of a global outage. A service running in different cloud regions can still behave as a global service and a single point of failure. Isolating a service to a failure domain is an architectural decision, and it's not guaranteed solely by the infrastructure running the service.

Let's take a look into a practical example. The control component performs a series of content quality verification. The developer team recently integrated an automated abuse detection routine to the control component, which allows validating content quality by the time a new picture is uploaded. A new intended customer starts uploading pictures of Deviant animals into PetPic, which is expected to serve only pictures of dogs and cats. The stream of uploads activates the automated abuse detection in our control component, but the new ML routines cannot keep up with the amount of requests. Although control limits the number of threads dedicated to process abuse to 50%, they end up consuming all processing resources, and customers in both regions start experiencing high latency when uploading images to PetPic. If we isolate the control component operations to a single region, we can further restrict the impact of abuse situations like this one. Even if a service runs in the cloud, making sure that each region has its own dedicated instance of control, will guarantee that only customers in Happytails would be affected by the stream of bad image uploads. Notice that stateless service can easily be restricted to a failure domain. Isolating databases isn't always possible, but you may consider implementing local reads from the cache, and occasional cross-region consistency as a good compromise. The more you can keep the processing stack regionally isolated, the better.

Tips

Keeping all services in the service stack colocated and restricted to the same failure domain can prevent widely spread global outages. Isolating stateless services to a failure domain is often easier than stateful components. If cross-region communication cannot be avoided, consider strategies for graceful degradation and eventual consistency.

Planning SLOs

On this last example, we will review the PetPic SLOs, and verify if there is even the SLOs provided by each packet. The SLOs are the contract we have with our customers. This table presents the SLOs engineers believe would provide PetPic customers with a good user experience. Here, we can also see the SLOs offered by each internal component. The API SLOs must be built based on the SLOs of API backends. If a better API SLO is required but not possible, we need to consider changing the product design and working with the backend owners to provide better performance and availability. Given our latest architecture for PetPic, let's see if the SLOs for the API makes sense.

Let's start with our operational backend. That means the backend that collects health metrics about PetPic APIs. API service only calls Ops to inject monitoring data about the requests, errors, and processing time of the operations. All writes to Ops are done asynchronously and failures do not impact the API service quality. With these considerations in mind, we can just consider the Ops SLO when computing the external SLO for PetPic.

Let's take a look at the user journey for reading a picture from PetPic. Content quality is only verified when the new data is injected into PetPic, so reads won't be affected by the control service performance. Besides retrieving the image information, API service needs to process the requests, which our benchmarks indicate takes about 30 milliseconds. Once the image is ready to be sent, the API needs to yield a response, which takes about 20 milliseconds in average. This adds up to 50 milliseconds processing time per request in the API alone. If we can guarantee that at least half of the requests will hit an entry in the local cache, then promising a 50 percentile of 100 milliseconds is quite reasonable. Notice that if we didn't have the local cache, the 50th percentile latency would be at least 150 milliseconds, that means 50% higher. For all other requests, the image would need to be queried from the database. The database takes from 100 to 240 milliseconds to reply, and it may not be colocated with the API service. The network latency is 100 milliseconds in average. The longest time a request could take is 50 milliseconds, plus 10, accounting for the cache miss, plus 100, plus 240 milliseconds, which is about 400 milliseconds. Looks like the SLO for these are well aligned with the API backends.

Let's check the SLO for uploading a new image. When a customer requests a new image load to PetPic, the API must request control to verify the content, which may take from 150 milliseconds to 800 milliseconds. Besides checking for abusive content, control also verifies if the image is already present in the database. Images present in the database are considered good and don't need to be re-verified. Historical data shows that customers in Furland and Happytails tend to upload the same set of image in both regions. When an image is already present in the database, control can create a new ID for it without duplicating the data, which takes about 50 milliseconds. This journey fits about half of the write requests leading the 50th percentile latency to be 50 plus 150 plus 50, summing up to 250 milliseconds.

Images with abusive content usually take longer to be processed. The deadline for control to return a response is 800 milliseconds. If the image is considered bad, or a verdict cannot be reached, the response usually takes 50 plus 800 milliseconds. That means 850 milliseconds to be completed. If the image is a valid picture of a dog or a cat, and it's not already present in the database, the database may take up to 1000 milliseconds to save it. For good images, it may take up to 50 plus 100 plus 800 plus 1000 milliseconds, or almost 2000 milliseconds to return a response. This is way above the current SLO engineers have projected for writes. One could consider bounding the 99th percentile SLO with the request deadline. This may also generate wrong and poor performance of the service. For instance, the database may finish writing the data after the API reported the deadline exceeded response to the clients, causing confusion on the customer side. It's better to work with the database team on a strategy to improve the performance or adjust the write SLO for PetPic.

Tips

Let's review some tips to make sure your distributed product offers the correct SLOs to customers. When building an external SLO, take into account the current SLOs of all backends. Consider all different user journeys and the different paths a request may take to generate a response. If a better SLO is required, consider changing the service architecture or working with backend owners to improve the service. Keeping service and backends colocated make it easier to guarantee SLO alignment.

Takeaways

When managing dependency for distributed microservices, consider different types of growth when evolving the product. That means the number of users, user behavior, and services relationships. Stateless services are often easier to manage than stateful ones. Colocate service components for better performance, easier failure isolation, and SLO alignment. When building an external SLO, take into account the current SLO of all backends and the different user journeys. Work with backend owners and allow extra time for resource allocation and architectural changes.

 

See more presentations with transcripts

 

Recorded at:

Jul 15, 2021

BT