InfoQ recently spoke with Ben Sigelman, CEO of LightStep and founder of the OpenTracing and OpenTelemetry projects, to discuss the challenges of managing microservices in deep systems, where individual service owners interact with a huge number of service dependencies that they do not own.
As described in Sigelman's recent keynote at Systems @ Scale, the issue resides in the difference between control and responsibility, and the ways in which teams can accurately determine what is happening within and in-between each service. Development teams are often in control of several services that call or are called by other services, however they then inherit responsibility for these other connected services, despite not being on those teams. As services call each-other, the chain of communication grows deep and complicates a team's ability to quickly diagnose where specifically an error or slow-down occurs.
Unlike standard performance monitoring, microservice performance issues can creep in due to changes in a communication pattern. For example, monitoring may indicate a slowness in one service when its code runs within established parameters, but the issue is caused by a difference in a calling service that has significantly increased its demand.
The key to resolving issues in microservice systems stems from enabling observability and control within services, to quickly identify where any performance issues exist, either within or in-between microservices, removing ambiguities. "Finger-pointing occurs when the data is unclear," explains Sigelman. "When an issue occurs, there's a term MTTI, which stands for Mean-Time-To-Innocence. When data is clear, MTTI is low, and when there is no data or that data is not clear, MTTI is high." High MTTI leads to long meetings with many people to analyze root cause and guilt, which can lead to high meeting costs.
Observability means the ability to quickly see what is happening within and between services without requiring a change to those services. Control is the ability to act on what you have observed. The aim of observability is to gain control. Both observability and control benefit from the standardization of OpenTelemetry, which enables different tools to extract the right metrics/KPIs in the right way so that teams can act on what they see.
The transcript of the InfoQ Q&A with Ben Sigelman can be found below:
InfoQ: What is the role between OpenTracing and OpenTelemetry?
Ben Sigelman: OpenTelemetry is the standard for telemetry data, its structure, and what to gather. In essence, it turns on the spigot. OpenTracing was similar but narrower, focusing specifically on distributed tracing instrumentation. For someone starting today, I would direct their attention towards OpenTelemetry -- it does more and carries OpenTracing forward.
InfoQ: Your keynote talk covered the role of "deep systems." Why are systems becoming "deep", what problem do they solve, and what new problems do they introduce?
Sigelman: When we say microservices, everyone thinks about the services themselves, not the shape of the larger system. If you have a lot of microservices, they grow deep; there are not just many services, but many layers. If you have 500 services, it's not as if one router or API gateway talks to all 500 services; they talk to each-other. And a service can only be as fast as its slowest dependency, so each layer adds a new way for things to go wrong.
The reason the industry moved to microservices was to facilitate autonomy and independence across devops teams, though ironically the depth of these systems often creates friction, inefficiency, and a reduction in overall velocity. This is because it's so difficult to track issues between microservices, understand the complex ways that they rely on each other, and determine which service needs adjustment to restore an SLO (Service Level Objective).
InfoQ: How do control and responsibility factor into microservices?
Sigelman: In any system, you're responsible for your dependencies, your dependencies' dependencies, and so on, yet you only control the services you can build and deploy. In a deep system, the size of your dependency tree grows geometrically with your system depth, and there is too much metric and logging data to sift through using conventional tools. The only way to solve this is to take advantage of tracing data at the core of the observability system. Tracing data is the only data with context about the layers in your system.
InfoQ: For Java, the OpenJDK team recently open-sourced Flight Recorder / Mission Control, which is low-overhead performance monitoring of a JVM. How is OpenTelemetry different and where do each of them excel?
Sigelman: Flight Recorder / Mission Control is great to look at a particular JVM, but microservices are different. Most technical fires that occur within organizations are the result of a code or configuration push. For example, a team above you changes the way they call your service to 100 times instead of 1 -- maybe they need to stop doing that. Profiling can show you that the code is running hot; it won't show you why it's operating so often and how it's connected to other services. If you do need to profile a single JVM, though, that tool is phenomenal.
InfoQ: If the aim of OpenTelemetry is to gain observability across these deep services, what are some key aspects that teams should observe?
Sigelman: Services should measure success in the eyes of their consumers. This means establishing an SLI or SLO that measures success: things like response times, error budgets, and so on. You should know what your consumers care about and have precise goals based on that. If investigations begin with an SLI/SLO, the observability system knows what problem you're trying to address, which greatly reduces the size and scope of potential root causes.
InfoQ: When teams establish what success means for their consumers, what concepts often sound alluring but actually lead to problems?
Sigelman: Tracking base system metrics, like CPU and RAM, are often not indicative of root problems. Another issue is teams with microservices thinking that loose coupling means total independence and autonomy to make completely different decisions. Netflix, for example, has a "paved path" of tools and frameworks that work well. When people choose this paved path, they have supported languages, libraries, security checks, and other assistance. Taking total autonomy and going off this path, going off-menu to something no one else knows makes control difficult because other teams cannot help you as easily.
InfoQ: For those who need to analyze complex microservices, what are specific tools that can help them?
Sigelman: As a first step towards modern observability, OpenTelemetry can be integrated into systems to gather high-quality telemetry data. Teams can do this with zero code changes, through the Auto Instrumentation Agent. This agent assists with getting access to data, but does not provide analytical tools. For those who'd like to analyze and visualize that data, I would recommend the free LightStep tier.
Both myself and Liz Fong-Jones, of Honeycomb, regularly discuss the role of managing microservices at scale on Twitter.