Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Embracing Observability in Distributed Systems

Embracing Observability in Distributed Systems



Michael Hausenblas discusses good practices and current developments around CNCF open source projects and specifications including OpenTelemetry and FluentBit.


Michael Hausenblas is an Open Source Product Developer Advocate in the AWS container service team focusing on observability. Before AWS, Michael worked at Red Hat, Mesosphere, MapR, and in two research institutions in Ireland and Austria.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.


Hausenblas: Welcome to my session, embracing observability in distributed systems. My name is Michael Hausenblas. I'm an Open Source Product Developer Advocate in the AWS container service team. The focus here is really on observability in the context of distributed systems, for example, containerized microservices.


Let's have a look at the more traditional setup that we have, and that would be a monolith. You would have a number of different modules in that monolith. For example, in this case, that might be an eCommerce application. You have interfaces with external systems like payment, or single sign-on, or some risk profiles, or an ERP stock application. Obviously, you're talking to your clients, you want to sell something. If you take that monolith, and you break it apart into a number of microservices, obviously, the better you did a job of modularizing your original monolith, the easier it is now for these simple, smaller microservices to exist and to interact. What are the characteristics, if you look at the overall setup of such a, for example, containerized microservices system?


What speaks for it is, on the one hand, you have increased developer velocity, because different teams can now be responsible and can iterate independently from each other for different microservices, and they can have different release cycles and testing. That makes the whole thing faster. You end up with a polyglot system in the sense that you potentially have different programming languages and different datastores there that you can optimize for the task at hand. For example, you might write the renderer in Node.js, and the payment options microservice might be in Java. You also have what I would call partial high availability, meaning that parts of the system might go down. However, to the end user, it still looks like there is some functionality available. Think of, for example, an eCommerce setup you might not be able to search for something, but you can still check out something in your shopping basket.


What are the cons? It is a distributed system now, and very likely, those different microservices end up on different nodes. Think of for example Kubernetes where each of the microservice might, for example, be a deployment and parts owned by that deployment. That ends up on different nodes, and the different microservices now end up using networking to talk to each other. It is much more complex than a monolith. It's already hard to figure out how different parts work together in the case of a monolith, but in the case of the distributed system, in the case of a microservices system, you have a lot of additional complexity. One of the biggest challenges in this context of a microservice setup is the observability of the overall system. That is equally true for developers and for operation folks.

Observability Challenges

Let's have a look at the challenges. Thinking of that we're talking about a distributed system, one of the things you have to wonder is how to keep track of the time and location of different signals. You might wonder, what is the right retention period of a signal? How long should you keep around the logs? Maybe you're required to keep around the logs for a certain period of time for regulatory purposes. You have to consider the return on investment. By that I mean that it is a certain effort, for example, for a developer to instrument their application, their microservice. It costs money to have the signals around to store them, to have applications to look at these different signals. You want to make sure that whatever effort and whatever money you put in there, you have a clear outcome and a clearly defined scope, what you get for it. The different signals may be relevant to different roles and different circumstances. For example, a developer looking at troubleshooting or profiling their microservice along a request path, might need a different set of tools compared to someone from the infrastructure team looking at a Kubernetes cluster, for example.

Observability, End-to-End

Before we get into the landscape and what is going on currently, especially in CNCF, let's have a look at the observability basics. When I talk about observability, I mean the entirety of all the things that you see there, all the sources that might be an app or microservices, in our case here. It might be infrastructure sources, like for example, VPC Flow Log or database, datastore, you typically treat them as opaque. You don't know what's going on inside. You get some signal out there in the compute unit, compute engine, for example, containers or functions or whatever, and then you have some compute engine that actually runs and executes your code. You have the telemetry bits that include usually agents, SDKs, protocols that take, route, and ingest the signals from the sources into some destinations. A couple of different types of destinations there. There are things like dashboards, where you can look at how metrics are doing over time, for example. You might have alerts. You have long term storage. For example, you put some logs on an S3 bucket. Ultimately, this is what you really want, the sources and the telemetry bits, that is what you have to invest. You have to instrument your code. You have to deploy agents to collect signals and forward them, ingest them into some destination. You ultimately want to consume them. You want to do something with those signals, generate insights, and make decisions based on those signals. The last thing you want to do in the context of a distributed system is obviously to [inaudible 00:07:45].


A different way to view this, not from this pipeline point of view, but from a more conceptually decomposed point of view is performing what is called a morphological analysis. This is a problem solving method developed by a Swiss astrophysicist called Fritz Zwicky. The basic idea is that you decompose your solutions into small units that are more or less independent. In this case, you would have six dimensions. Of course, you can have more or less depending on how you view it. I came up with these six dimensions. They're analytics, which as I said, is what you actually want to have. You want to consume them. You want to store signals. The telemetry bit, again, this is agents. This is protocols, like OpenMetrics, for example. You have the programming languages, that as a developer, you're most interested in. Does a certain set of telemetry technologies support your programming language? Are they available there? Can you use them in your programming language? The infrastructure piece, where you have on the one hand, things like compute related sources that could be for example, Docker logging drivers, VPC Flow Logs, S3 bucket logs, but also datastores. Very important. You almost always have some state involved, and very often these are opaque boxes so you get some signals out but you can't really look inside that box. Then the compute unit, as I said, in this case, highlighted for what we have in AWS. You want to think of EKS for example, which is Kubernetes, Lambda function. Compute unit referring to how a certain service, microservice is scheduled and exposed. The compute engine is the actual runtime environment, for example, EC2, or Fargate, or Lightsail.

This allows a relatively straightforward way to answer the question for a specific word-glot, for a specific example, what options are available? Let's, for example, say you're running EKS on Fargate. You're interested in logs, so you also have the logging driver there from Docker. Are you writing your microservice in Java? You might be using Fluent Bit to ship the logs and route the logs, and you're consuming the logs in the context of Elasticsearch. There you have one particular path through these six dimensions, and you can imagine that there are many combinations possible.


Let's move on to signals. We have essentially the three pillars, which are the logs, essentially discrete events that usually are timestamped and can be structured, for example, adjacent here. Metrics which are regularly sampled, numerical data values that are usually with dimensions and labels that capture their semantics. For example, a destination to view them is Grafana, a very popular one. Traces, which are the signals that happens along the request path in a number of microservices. Think of a request comes in in the frontend of our microservice, and that propagates to the system and touches different microservices, and all these things together. Taking that's a trace. Here, I've shown Jaeger, which is a very popular frontend to render traces, and they see at a glance, how long does a request take? Then you can drill in and see where exactly along the request path, in which of the microservices the time is spent. Also, usually you get access to the logs and can specifically look at logs of a particular microservice.

Observability at CNCF

Let's have a look at what is going on in terms of observability at the Cloud Native Computing Foundation, the CNCF. As you can probably guess from this picture, a lot is going on here. This is just a snapshot. If you want to have a look, at, you get a more up to date and current picture. A lot of open source and commercial offerings in this space along the different signals that we already discussed, metrics, the logs and the traces, but also chaos engineering. CNCF rightly so takes chaos engineering as part of the observability story.

Observability Activities at CNCF

In terms of open source projects and specifications, we have various graduation levels there. The three graduated projects are Prometheus, Jaeger, and Fluentd. Then we have Cortex and Thanos that are incubating, and OpenTelemetry, OpenMetrics, Chaos Mesh, and Litmus that are sandbox projects. Depending on adoption and the due diligence outcome, the projects evolve and get higher up in that hierarchy. There is an additional special interest group called SIG Observability that I'm a member of and contributing to that you can think of working across the different projects, doing things like due diligence, white papers, a general forum for exchanging experiences around observability. I would like to encourage you to have a look at that and maybe try this.

CNCF End User Technology Radar

In late September 2020, the CNCF in the end user space did a technology radar on observability solutions and came up with a nice blog post and a content piece, a video that you might want to study, and have a look at that and make up your mind, what you think of these assessments here.

Routing Logs with Fluent Bit

Let's move on to concrete examples for different signals. First, we have a look at our logs. Logs are relatively established when I look at the discussions that I'm having with our customers at AWS. This is something that pretty much everyone is already doing and has been doing for quite a while. In this case, we're looking at Fluent Bit in ECS, where it's called FireLens. You can essentially, more or less in a simple declarative manner, route your logs from your containers, from your tasks through the FireLens container to the destinations as you see here on the right-hand side. This is really this basic idea that, as a user, you should, besides the instrumentation effort that you have in the context of your own microservices and applications, in terms of telemetry bit, you should really only be focusing on the configuration and the rest should be provided by the overall system by the platform that you're running it.

Prometheus and Grafana for Metrics

Let's move on to metrics. Remember, metrics being these numeric values that are emitted at regular time intervals. In this case, the use case would be you have a service mesh that produces metrics. In this case, we're using Linkerd, but you could also think of App Mesh or Istio, that use Envoy in the data plane. You have as a sidecar, a proxy sitting that intercepts all the traffic. Because it intercepts all the traffic, it can also emit metrics on what is going on on the wire. That is scraped by Prometheus, and in this case ingested into Grafana, where you then can do dashboards and can say, what is the success rate? How many 200s on the HTTP level, or whatever, have I seen? This is usually, especially in the context of service meshes, nowadays, almost out of the box, which means you really don't have a lot of work there to do. Again, coming back to this, you should be focusing on the configuration and declaratively tell the system where and how you want to consume your signals.

Distributed Tracing with X-Ray

Moving on to the last example, the distributed tracing, in this case with AWS X-Ray, a managed service that we offer. In this setup that I showed you here, it is the source being a service deployed in EKS, which is a Kubernetes managed service. The telemetry bit is actually ADOT. ADOT being our AWS distribution for OpenTelemetry. It's pretty straightforward, as you can see, consuming these traces in X-Ray, akin to what we've seen earlier with Jaeger. You essentially can view what is going on, where exactly along a request path the signals are happening. Then you can drill down, but you can also look at percentiles.


OpenTelemetry, I personally am super excited, is becoming standard in the CNCF. It has its roots in the merger of two tracing efforts called OpenTracing and OpenCensus. Those two projects merged in CNCF creating OpenTelemetry. It has its roots in tracing. Not very surprising. You would expect that the tracing part in OTEL, OpenTelemetry is already stabilized, and work in terms of metrics is ongoing. Logs as a supported signal type are also planned, going forward and currently experimental. As you can see in the right-hand side, covering pretty much all the relevant modern programming languages that you would hope for and expect, including Java, Python, Go, .NET. There are a couple of things in there that you want to pay attention to. What is labeled here in the OpenTelemetry context is a backend, I called them destination early on. You have to collect the telemetry bit in your application, the library, the SDK for your programming language of choice, allowing you, as a developer, to instrument your application emitting whatever signals you may wish to consume further down the line.


I think we are at a very interesting point in time, and that is that you do have the freedom of choice. You can use the best of class for a particular use case. Meaning that along workloads, along signals, you might choose different tools, different methods to get the result that you want. Remember, return on investment. There is this idea that as you are relying on open standards and open source, you can minimize your exposure in terms of being locked into a certain solution. That's certainly something you want to pay attention to. Last but not least, especially in this context of moving workloads from on-premises into the cloud, especially with open standards and open source, you can build portable observability systems. Meaning, you can deploy, for example, a Prometheus based solution on-premises, and then when you move to the cloud, maybe you can offload certain things to your cloud provider of choice, but you can essentially reuse that. A lot of that is portable.

Questions and Answers

Betts: I saw that you had a great slide that showed six dimensions of a system and the observability aspects, starting with the compute engine, working up to your datastore, language, and eventually get into telemetry and analytics. You highlighted one possible path up to there, where can people go to get information to help them decide and figure out what's the right path for the solution they're trying to implement?

Hausenblas: What is the right path through these six dimensions that I came up with, but again, this is up for discussion. I strongly believe that there is not a single right path. It really very much depends on your requirements. It depends on what you already have. You might be on-premises or in the cloud, already invested in certain systems, or you might already pay a vendor for something. You want to start with essentially what you currently have, and see how a given set of solutions might fit in there. For example, you might be using CloudWatch already and X-Ray, and you may be thinking of using AMP, or managed Prometheus offering. Then working backwards from there, seeing, how do I get signals in there, for example, using OpenTelemetry? There are already quite some good resources out there. We're also investing in that space and definitely producing more. At the end of the day, I would always recommend people to first look at what they already have, what is already established in their organization. Certain things like the programming languages or whatever, that's usually something that is already a given, that's not something you're going to change after the fact. Take stock of what you have and work backwards from where you want to get.

Betts: I think one of the other aspects you talked about was that you need different signals for different roles. Your developers that are doing troubleshooting and profiling, they have different needs than the infrastructure team. How do you provide both, and especially, how do you not get in each other's way and step on each other's toes with too much information or not enough that the other one needs?

Hausenblas: The question is essentially how to make sure that every role that we see in such distributed systems, containerized microservices, or functions, or whatever, get what they need to do their job. The basic idea is that you make sure that you understand what is the most important, both types of signals, but also most important tasks that someone has to take care of. For example, if I'm responsible for the provisioning, I'm part of the platform team. I'm responsible for provisioning and making available a Kubernetes cluster to teams that then grab a namespace or grab a cluster and to deploy their applications there. Then I have questions in certain SLOs and SLAs that come around that. How fast can I provision a cluster? How much disruption does an upgrade? What signals can help me to answer or support me to meet my SLOs?

The same is true for developers. With developers, there's one trend that we're now seeing in the context of shift left, so more focus on the actual operation, more or less of the microservices or functions. Meaning that, increasingly, especially in the cloud environment, the infrastructure is offloaded or intentionally people offload this part of the heavy lifting to the cloud providers. Meaning that developers potentially might be on-call for their code. If you want to go all the way in to actually hand out pagers, or nowadays, it's probably apps, and not the good old pagers, but essentially being on-call for your code. We do that in Amazon quite successfully. There are other widely adopted methods and methodologies like SREs, that across the board, that's something that companies look at, and increasingly also adopts. It really depends on, what are the goals, what are the SLAs, SLOs, SLIs? Then looking at the different signals that can help me in a specific role to do that.

Betts: Then I think that leads to one of the questions, when you talk about specific goals. How might some of the tracing tools allow us to measure latency for streaming data that's passing through many services?

Hausenblas: There are two parts to it. The one where I'm personally more optimistic and sure that we are doing, we as in the community, CNCF, and wider open source community. That is in the context of your own code. If you ask myself, or anyone out there, how do we go about instrumenting the code? That's pretty much done. You have all these SDKs, all these libraries that you can use to instrument your code. Where it's a little bit harder, or less complete covered is when it comes to what I called inference data. Essentially, everything that is opaque to you, where you just have an API, and you don't know what's going on behind that, you get some logs or whatever out of that. In my experience, that is certainly something that we as a whole, as a community of practitioners, we need to invest more. That we can effectively create and support that experience for actual tracing and the likes for also streaming data.

Betts: For an organization that's just getting started with observability and working with a managed service, that you don't have direct access to the systems, what are the first steps that you would recommend when you have that black box scenario?

Hausenblas: I always recommend and I'm a big fan of small steps. Ideally, in a Greenfield environment, if possible, having quick results and iterations, quick feedback loops, so that you see, are you on the right track or does a certain direction actually pay off? You don't want to end up in a situation where your overall observability setup in terms of footprint, in terms of costs, overall, your stored signals, effort that people have to put in is more than 10% or 20% of the actual functionality of the system. You want to keep an eye on that in terms of budget, the license cost, or whatever. That's always the easiest to assess. How much do I pay vendor X to use a certain dashboard or whatever? Do you also consider how much effort it is for your developers to instrument code? How much there might be work in terms of deploying agents and everything that is part of the telemetry? You want to keep an eye on that. By quick iterating, figure out, are you on the right track or not? Bottom line really is it's almost always an organizational challenge, not so much a tooling challenge.

Betts: I think that goes to the next question about instrumenting applications with vendor specific code using a bridge to get it into your current vendor, so if you already have monitoring tools or analytics tools. Have you worked with those? Are you seeing success with using a bridging solution?

Hausenblas: Yes, I would say that at the current point in time, we are in this transitioning phase where I mentioned early on that OpenTelemetry came out of OpenCensus and OpenTracing, and with that, a number of more vendor specific SDKs. We are now, as we speak, March 2021, moving towards a situation where, essentially, the OpenTelemetry of course with the different distributions that are maintained by different vendors, but essentially, the OpenTelemetry collector and the SDK there enables this portability. It shouldn't be a huge issue. I'm super interested, if you have any data points, even if it's on the anecdotal, or whatever, please do share with me. I'm super interested in that topic as well.

Betts: You showed a small part of the CNCF landscape with all those little cards on it. I think, just for observability, and analysis, they had over 100 different options, which can be daunting. I think the full landscape has almost 1000 now at this point, of different tools and vendors. You've mentioned OpenTelemetry obviously, a few times. That sounds like a good first step. What are some other first things people should be looking at and saying, "I want to start using the CNCF guidance to decide what to use in our system?" What else should they be looking at as the first option?

Hausenblas: The no-brainer, to me at least, are Prometheus for metrics. This is a low hanging fruit. If you are not already using it, you definitely should. Based on that, in this context, the Prometheus Exposition Format was the basis for what is now called OpenMetrics. A wire format to transfer the metrics, we present as for these metrics. For logs, definitely Fluent Bit. I'm calling out Fluent Bit, not Fluentd, although both are CNCF projects, because we found that Fluent Bit in terms of the footprint has a better profile. Although there are less plugins available for Fluent Bit compared to Fluentd, going forward in terms of future investment. If you're now adopting it, I would recommend Fluent Bit over Fluentd. Obviously with Prometheus comes Grafana. That's another low hanging fruit, where very often you find these Grafana dashboards almost integrated coming out of the box, part of the solution, like in service mesh land.

I would, in general, as long as you see, be part of that. That's a general question in terms of how do you assess the healthiness and how much traction a certain open source project has. There are Slack communities for that. There are good GitHub Issues. You should be part of that and make that part of your strategy. It's not that you're just grabbing some piece of code and using it. By using it, you're part of that community. The very least you can do is provide feedback. Then you also usually have a forum to exchange experiences with your peers.

Betts: You did talk a little bit at some point about portability. I like the idea especially as you're talking about migrating from a monolith to microservices, or just a distributed system architecture and the idea of moving from on-prem to the cloud. If the company has a lot of aging on-prem infrastructure, and they're saying we're going to make this transition to the cloud, is there something that people can do to start adding observability now that might help during that transition process?

Hausenblas: In terms of portability across different environments, I particularly focus on this, something is on-premises, and we've seen that in the last couple of months and year that that is accelerating. You want to make sure, again, like always come back to these open specifications, further down the line, open standards and open source, because that is something that on-premises, you can run it yourself. You can deploy it yourself. You have full visibility in there. If necessary, if you have the engineering muscles, you can fix stuff yourself. Or maybe even, and that's something where many organizations maybe need to do a little bit more thinking, and that is investment in the community.

We very often see open source as this like, I get something for free. Sure, you can use open source for free, but don't expect that you get free support. One very important part of that strategy is that you are considering yourself being part of the community. It's not just a vendor where you get something from, you are part of the community and raising issues. Potentially, if you can, even sending in pull requests to advance something or fix a bug, should be part of that strategy. Again, if you listen carefully, this is not a technological issue. This is an organization issue, is the organization as such, ready to do that?

Betts: What are key metrics that AWS uses for observability?

Hausenblas: Key metrics? It's an ambiguous term. I'm not entirely sure if you mean metrics as the signal type. I'm going to interpret it in a way, unless I see something on the chat that corrects me there.


See more presentations with transcripts


Recorded at:

Aug 22, 2021