During the years Ben Sigelman worked at Google, they were creating what we today call a microservices architecture. Mistakes were made during this adoption which he believes are being repeated today by the rest of the industry. In a presentation at QCon London 2019, Sigelman described his recommendations to avoid making these mistakes when starting with microservices.
Sigelman, CEO and co-founder of LightStep, started with a short resume of his experiences during his years at Google, and noted that the most respected projects from that time shared some common characteristics:
- Identification and leverage of horizontal scale-points
- Well-factored application layer infrastructure, for instance, RPC, discovery, load-balancing, authorization
- Rolling upgrades and weekly releases
Sigelman’s first recommendation when starting with microservices is to that you should understand why you are doing microservices, and he refers to Vijay Gill, SVP of engineering at Databricks, who in a recent presentation argued that the only good reason to use microservices is because you will inevitably ship your organizational chart. For Sigelman, the main reason for adopting microservices is human communication. We have still not found a way for more than a dozen engineers to work effectively on a piece of software, so we need team independence to keep the velocity in development.
For Google it was technical reasons, like their planet-scale technical requirements, that led to a microservices architecture. But that also led to problems since they used technical requirements and throughput as the main reason for adopting microservices, not team velocity. Sigelman therefore emphasizes that we should be very mindful about why we adopt microservices — are the reasons team independence and velocity, or technical aspects?
High velocity is important, but Sigelman doesn’t agree with the assumption that the best way to achieve velocity is by letting each team have total independence over the technology they use. One important part of an microservices architecture is that the cost for common concerns, like security and monitoring, should be shared by the whole organization. Having every team build these things on their own will be very costly. Sigelman therefore recommends that there should be one or a few supported platforms from which each team can select the one most suitable for their services.
Serverless computing is for Sigelman the way we should be doing computing, but he points out that Function as a Service (FaaS) is extremely limited, and that we commonly are confusing FaaS with the idea of serverless computing. One important aspect when using serverless is the huge difference in latency between calling a function within the same process compared to a network call. For example, the latency for a memory reference is order of magnitudes faster than sending a packet over the Atlantic Ocean. He refers to a paper by Hellerstein et al: Serverless Computing: One Step Forward, Two Steps Back, where these issues are described and quantified. Although he believes serverless makes total sense in certain situations, like serverless at the edge, he is sceptical in regards to using serverless as the backbone for a microservices architecture. It should be treated with caution and a lot of upfront estimation, because it’s a different approach compared to adopting microservices from a team perspective.
Giant dashboards showing a lot of time-series data are a great way of visualizing variance over time, but they are bad at pinpointing an actual problem. The main reason for adopting a microservices architecture is to reduce communication between teams, but during an outage this works against finding the root cause. To overcome this and be able to find the reason for an outage, we must reduce the search space, and in practice Sigelman sees observability of a system as two activities:
- Detection of critical Service Level Indicators (SLIs), like latency, error rate and request rate. This is a small subset of all time-series data, but a subset that really matters for a user
- Explaining the variance found, for microservices especially variance over time and variance in the latency distribution
By making a good hypothesis for where this variance originates you have probably come very close to resolving the issue. He points out that visualizing everything on a giant dashboard is a terrible way to explain variance and can lead to more confusion.
Sigelman describes distributed tracing as a log of individual transactions as they pass through each microservice, enabling us to follow the lifetime of a transaction throughout the system. One of the biggest challenges in tracing is the huge data volume in large systems. You must use sampling to reduce the amount of data created, but then you also face the risk of not retaining intermittent faults. Another problem is that although visualizing individual traces is necessary, the raw distributed trace data is too overwhelming for a person to reason about. For him, a superior approach is to:
- Consume 100% of raw distributed trace data
- Measure SLIs with high precision
- Explain variance with sampling designed for a specific SLI, and real statistics
He notes that tracing as a data source is extremely valuable, but that we as an industry are still primitive. He describes it as the youngest and most immature of the three pillars of observability.
In a keynote at the conference, Sigelman discussed distributed tracing and observability in more detail.
Most presentations at the conference were recorded and will be available on InfoQ over the coming months. Sigelman's presentation from QCon San Francisco in November 2018 is already available: Lessons from the Birth of Microservices. The next QCon conference, QCon.ai, will focus on AI and machine learning and is scheduled for April 15 – 17, 2019, in San Francisco. QCon London 2020 is scheduled for March 2 - 6, 2020.