Five Ways to Not Mess Up Microservices in Production
Ealier this month Takipi's Alex Zhitnitsky wrote about five ways not to mess up microservices in production (OK, "mess" wasn't exactly the word he used!). He builds on the output of the panel discussion we reported on a few months back as well as their own users, examining the top production problems with microservices and how to solve them. The emphasis, as we will see, is on distributed debugging and approaches which can help to make this a tractable effort. The first problem he discusses is monitoring:
The point is that monitoring wise, the system becomes highly fragmented and a stronger need arises for centralized monitoring and logging, to have a fair shot at understanding what’s going on.
Alex mentions a scenario described in a recent podcast where a bad version had to be rolled back, which entailed identifying the correct microservice and also the possible impact on others when the rollback occurred. He concludes with:
Takeaway #1: If you thought monitoring a monolith architecture was hard, it’s 10x harder with microservices and requires a bigger investment in planning ahead.
With a monolith architecture, your logs are probably already scattered in different places, since even with a monolith mindset you probably had to use a few different layers that probably log to different places. With microservices – your logs break further down. Now when investigating a scenario related to some user transaction, you’ll have to pull out all the different logs from all the services that it could have gone through to understand what’s wrong.
Which leads to the second of his takeaways:
Takeaway #2: Microservices are all about breaking things down to individual components. As a side effect, ops procedures and monitoring are also breaking down per service and lose their power for the system as a whole. The challenge here is to centralize these back using proper tooling.
The third problem is related to the oft quoted statement from Leslie Lamport on the definition of a distributed system: "A distributed system is one where a machine I’ve never heard of can cause my program to fail." Or as Alex puts it, "An issue that’s caused by one service, can cause trouble elsewhere." And the takeaway from this is also fairly obvious:
Takeaway #3: With monoliths you usually know that you’re looking in the right direction; microservices make it harder to understand the source of the issue and where you should get your data from.
Finding the root cause of a problem in a distributed system is the basis of the next problem, and this is where your logs help but are only part of the solution:
In most cases, the first few bits of variable data you HOPEFULLY get from the log won’t be the ones that move the needle. They usually lead to the next clue which requires you to uncover some more of the magic under the hood and add another beloved log statement or two. Deploy the change, hope for the issue to reoccur, or not, because… sometimes merely adding a logging statement seems to solve issues. Some sort of a diabolical reverse Murphy’s law. Oh well.
So we have our next takeaway:
Takeaway #4: When the root cause of an error in a microservice spans across multiple services, it’s critical to have a centralized root cause detection tool in place.
Alex's final problem is about version management and cyclic dependencies between services and as he mentions, there are two problems related to keeping your dependencies in check.
1. If you have a cycle of dependencies between your services, you’re vulnerable to distributed stack overflow errors when a certain transaction might be stuck in a loop. 2. If two services share a dependency, and you update that other service’s API in a way that could affect them, then you’ll need to updated all three at once. This brings up questions like, which should you update first? And how to make this a safe transition?
More independent services obviously means there is a higher probability they will have their own release cycles which then adds to the complexity, especially around reproducibility. Hence why the fifth and final takeaway is:
Takeaway #5: In a microservice architecture you’re even more vulnerable to errors coming in from dependency issues.
It's good to get even more input to the microservices discussions from groups putting them into production. It's also good to see that many of them agree on core problems and some common approaches to tackling them.