Sam Newman: Practical Implications of Microservices in 14 Tips
What are the practical concerns associated with running microservice systems? And what you need to know to embrace the power of smaller services without making things too hard? At last GeeCon 2014 in Kraków, Sam Newman tried to answer those questions by giving 14 tips about how microservices can interface, how the can be monitored, deployed, and made safer.
Benefits of microservices
When we talk about microservices, says Sam, we talk about fine-grained systems, where each component is small, at about one thousand lines, 500 lines, or even smaller in some situations.
Among the benefits that microservices offer, Sam highlights a few. Microservices:
- Can easily reflect domain-level entities and operations.
- Can be more easily aligned to the structure of an organization and adapt to its evolution.
- Can be independently deployed, which is of great help for continuous delivery since this allows to just redeploy a service independently from the rest of the system.
- Make it easier to adopt new technologies. Indeed, in a single monolithic system, it is very hard to mix, e.g. Scala or Clojure into a Java system. On the other hand, you can select a service, maybe a lower risk one, for a new technology and maybe in the process you will learn that that approach is the best for that problem.
- Allow a fine-grained approach to performance tuning or scaling, so instead of, say, scaling your whole system, you can scale only those parts that really need to be able to scale.
Sam compares two different criteria when building a system:
- Standardisation, i.e., looking for a single standard way of doing things; this approach is typically taken to build monolithic systems, where consistency and safety concerns are put first.
- Free for all, i.e., allowing for some degrees of freedom when it comes to choosing the technology stack inside of a service or the way of working. The big advantage of this is that it allows autonomy.
This duality introduces a tension about where you put the boundary: what you are going to standardise when you break down your system into small components or add a new one.
This question has a clear answer for Sam based on an architectural analogy. When building a city, great freedom is allowed at the single building level, while roads and utilities are greatly standardised. The equivalent of roads an utilities in software architectures are APIs, monitoring, deployment management, and architectural safety, says Sam. His 14 tips are structured along those boundaries.
- Standardise in the gaps between services, be flexible about what happens inside them.
A nice way of doing this is defining interfaces for each microservice and enforcing them so that coupling is reduced as much as possible.
- Avoid RPC-mechanisms or shared serialisation protocols to avoid coupling.
Integration style defines how the individual nodes collaborate. Sam proposes an evolutionary view of integration styles, going from the least desirable one, which he calls data oriented and delves around a central database, to the more desirable document oriented and resource oriented styles. In the middle he leaves the procedure oriented style, based on RPC-mechanisms and serialization protocols. Sam considers this style really difficult to make fit in microservices since it limits the possibility of integrating diverse microservices.
- Have one, two, or maybe three ways of integrating, not 20.
Another suggestion from Sam related to integration style is not multiplying integration styles and focusing on a reduced number of them to prevent the system to become hardly manageable.
- Pick some sensible conventions and stick with them.
Shared conventions are really necessary if you want to enforce a certain degree of freedom. Sam's suggestion is getting a good book, such as Apigee's Web API Design or REST in Practice, and stick to it, but not mix different conventions.
- Avoid distributed transactions if possible.
Distributed transactions bring a great promise with them, but they are really difficult to get right and make work, as the CAP Theorem proves. So, Sam says, just plainly avoiding them is better.
- Capture metrics and logs, for each node, and aggregate them to get a rolled up picture.
Capturing metrics and logs is of capital importance when it comes to monitoring. The most difficult part of this is how to manage the log files stored at different nodes. One option is using SSH multiplexing, but this is not really optimal when cloud services such as logstash and Kibana are available to make the task so much easier. Another good option according to Sam is Graphite, which can be fed through collectd from linux boxes or nsclient++ from Windows boxes.
- Use synthetic transactions to test production systems.
Rather than focusing on low-level behaviour like CPU and response times, Sam suggests to focus on domain-level operations and build an alert system that tells you when things go wrong. You need track the low-level behaviour of your system, for sure, but that is only useful when you are "drilling down" to understand what caused the problem. On the other hand, Sam defines himself a huge fan of suits of tests to be run on production system to check that key user journeys are working correctly. This technique is called either synthetic transaction of semantic monitoring.
- Use correlation IDs to track down nasty bugs.
In a microservice-based system, it usually happen that a task is carried out through different nodes. This entails the risk of losing the big picture when investigating issues. An easy way to correlate tasks at different nodes that belong to the same domain-level operation is associating a correlation ID to the higher-level operation and let this ID flow through the system so you can later reconstruct the whole history of that operation as it went across the system.
- Abstract out underlying platform differences to provide a uniform deployment mechanism.
While ensuring you give your microservice the autonomy necessary to explore new technology stacks and approaches, it is still good idea to find a way to handle the different deployment strategies that different technology stacks use so that deployment does not become problematic. This can be done, according to Sam, by builing abstractions on top of the different deployment methods, particularly using scripting. He mentions Fabric, which is a Python script to manage deployment on local and remote machines. Once you have your deployment process defined in higher-level terms, when you want to add a new technology stack you only need to implement some primitive behaviours specific to that technology stack.
- Have a single way of deploying services in any given environment.
Another advantage of scripting is that it allows to manage deployment across different environments, such as deployment on a development machine or on a production system, which rarely resembles the former.
- Consumer defined tests to catch breaking changes.
Deploying usually implies to go through a pipeline of steps such as ensuring the code compiles, smoke testing, integration testing, acceptance testing. Once all those steps are passed, you can deploy. What we look for when deploying a new component is quick feedback, so we know if we have broken anything as soon as possible. Integration tests are a way to deal with this, but they are usually slow, says Sam, and sometimes fail due to causes that have nothing to do with the deployed system, e.g., the network being temporarily down. A better approach, according to Sam, is using consumer written tests, which resembles the idea of using key user journeys to monitor the system.
- Don't let changes build up - release as soon as you can, and preferably one change at a time.
Deploying several microservices at the same time is a bad idea, since if something goes wrong then it becomes hard to understand how to fix the problem. Furthermore, deploying multiple services at the same time defeats one of the major advantages of microservices, which is that they allow to deploy small changes one at a time in an independent way.
- Use timeouts, circuit breakers, and bulk-heads to avoid cascading failure.
When a node starts behaving incorrectly, there is a chance that this could affect other nodes. As an example, Sam mentions the case where a thread pool goes exhausted in a node; quickly, then threads running in other nodes starts blocking waiting for the thread pool to start working again, and their number can grow from a few to thousands, at which points the node crashes. This failure could then propagate further up and make things worse. Cascading failures are a vulnerability and should be avoided. According to Sam, a great book that goes into detail into this issue is "Release It!" by Michael. T. Nygard. This book describes useful patterns like circuit breaker, whereby a node, in case it cannot talk to another downstream node, kills the connection and fails fast instead of keeping trying. Circuit breakers are so essential that Sam goes as far as to say that it is fully correct to have a circuit breaker for each node in the system. Another useful pattern to avoid cascading failures is bulk-heads, where multiple thread pools are used for downstream nodes, so if one gets exhausted, the other is still available.
- Consider service templates to make it easy to do the right thing.
A powerful approach that Sam suggests is separating what makes the specific behaviour of the service from what is needed to make it work within the system. As an example, one could have integrations layer for the upstream and downstream nodes and a metrics layer surrounding the service. This would provide a way of codifying the way a service can be standardised.
About the Author
Sergio de Simone is iOS Independent Developer and Consultant. Sergio has been working as a software engineer for over fifteen years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. Currently, his focus is on development for mobile platforms and related technologies.
Most important from my experience:
* prefer asynchronous over synchronous communication patterns, else your microservice based system will perform order of a magnitude worse compared to a monolith at twice the development and operational cost.
* Don't use schemaless messaging (e.g. JSON). Async RPC/specified messages+protocols have big productivity advantages.
A service inherently owns a schema (the way it expects messages to look like). Not specifying them will not give you anything, but an untestable wobly string mess by the time. Once the system/protocol of a service changes, you need strict control+checks wether messages received are well formed. Message versioning is also important (at least if you aim for losely coupled microservice cluster architecture).
* flow control+backpressure. Once microservices talk to each other, there will be a slowest service in turn bogging down other services which rely on it. Once this cascades, your cluster will easily come into an indefinite state. Detecting overload and cascading backpressure is hard but required to guarantee system avaiability [agree on that part of the article]
* Failure detection and failure recovery. (False failures, failing consensus etc.)