Transcript
Hausenblas: My name is Michael Hausenblas. I work in the AWS Open Source Observability Service Team. I want to talk about state of OpenTelemetry: where we are, and what is next.
What Is Observability?
Let us have a very quick look at what observability really is. Observability is the capability to continuously generate and discover actionable insights based on signals from the system under observation with the goal to influence that system. We have sources, those might be compute, like a Kubernetes cluster or a Lambda function, a database, datastore. Those sources generate signals. We have agents, and then we have destinations, backends where we store these signals, and we graph these signals, and we interact and filter and alert on these signals. A human might consume that signal too, investigate something, understand something, or a piece of software, think of, for example, autoscaling. What's with the agent? The piece of software that sits between the sources and the destinations, collects all the signals and ingests them into the backend destinations.
Signals
We're dealing mostly with four major signal types. Logs, which are signals that have a textual payload. They're capturing events. They're mostly meant for humans, to be consumed by humans. We have metrics which are numerical signals, aggregates that have typically their semantics encoded in the name and/or via labels. They carry numerical values. Then we have distributed traces that are all about propagating an execution context along a request path. Then we have profiles, which OpenTelemetry not yet covers, but in the future, hopefully. Those are about the resource usage in the context of the code execution.
The Problem and Solution
What is the problem we're trying to solve here? The first bit is really all about the journey from the sources to the destination. We have currently widely a number of different agents that we use to collect the source signals and ingest them into backends. The solution going forward is replace all these various agents, these proprietary protocols, and formats, and vendor specific agents with one agent that rules it all, and that is OpenTelemetry. Not just the agent, but also the instrumentation.
OpenTelemetry Concept
Let's have a closer look at what is OpenTelemetry on a conceptual level. Formerly, OpenTelemetry or OTel, is a Cloud Native Computing Foundation project, CNCF project. You might know CNCF from big hits like Kubernetes, and also Prometheus, and many others. What does OpenTelemetry really do? It provides a set of specifications, a protocol, OTLP, an agent that we call collector, and libraries, SDKs. Again, think of it, sources, agent, destination, OpenTelemetry sits in the middle. OpenTelemetry aims to support all major signal types. Currently, we're focusing on traces, metrics, and logs, across 11 programming languages, from Java, over Python, to things like Erlang and Elixir. The big advantage of OpenTelemetry besides that it's an open standard and all the vendors, and all the ISVs, and all the cloud providers that are behind it, it's really that it turns this telemetry challenge, instrumenting your code and collecting the different signal types, ingesting them, into table stakes. It makes it table stakes. On top of that, you get correlation of different signal types, so you can more easily jump between these different signals.
OpenTelemetry Collector
If we zoom in, in the middle, into this collector, how does that look like? Conceptually, we're talking about so-called pipelines. This is a per signal type, so a pipeline for logs, a pipeline for metrics, a pipeline for traces, a pipeline, future, potentially for profiles. That, again, conceptually have three different types of components that you can use there, a bit like Lego bricks. You have receivers, those are inbound or ingress, where from the signal sources, from the bottom, downstream signals come into the collector. For example, you might have an OTLP, so a native OpenTelemetry receiver. Then there are processors, in the middle of the pipeline you want to do something, for example, logs. You might want to drop certain logs, or redact them because there's PII, Personally Identifiable Information in there. Or you want to batch them, so rather than sending one signal after there, you batch it up for 10 seconds, or for whatever number of metrics, for example, or traces. Then there are the exporters, which allow you to ingest those signals into the backend destinations, for example, to Jaeger, Prometheus. You can have many pipelines. You can have many pipelines that cover the same signal type. You can treat them independently. You could have one log pipeline for one specific environment like development that lands the logs in a certain backend, with another one for production. You see, this OpenTelemetry Collector is a very substantial part of the OpenTelemetry project and the overall value prop. What are the three main components in the pipeline? It is receiver, processor, and exporter. The pipeline wires up these three component types, and let you build these different routing and filtering pipelines as you see fit.
Distros
There are three ways or three fundamental approaches to how you can use the agent, the collector. Different vendors and different cloud providers indeed have different approaches to that. I just used the official documentation, the opentelemetry.io/vendors. For each of the providers, I dug into the descriptions and tried to figure out what are the different signal types that they are currently providing, in what state, like GA, or preview, or beta? How do they deal with the collector? Are they themselves maintaining collector to use in the upstream collector, which is provided by the project? What's with the SDKs? Is there a specific SDK, or again, upstream? If the relative, across the board, the providers have managed OTLP endpoint, so natively allow you to ingest OpenTelemetry data?
OpenTelemetry Adoption
That's a basic overview on OpenTelemetry. Let's see, in terms of adoption. I will present two different survey data. Here, on the one hand, the first two slides on the OpenTelemetry community quarterly survey. Not very surprising, given where we are with the adoption, traces went GA in 2021. Metrics are going GA as we speak. Number of those things are stable, we'll get back to that in the roadmap. Logs will be going GA in 2023. It's not too surprising that currently, half of the people who responded to that survey said they're using it for tracing, which makes a lot of sense. A third for metrics, roughly. Looking into the future, then the picture slightly changes, again, to be expected that logs will take a bigger part, and metrics as well, slightly.
Continuing with this survey, again, asking about what components, in the widest sense, both collector and across the program languages, and there you see that at least collector, and Go, Java, Python, and JavaScript. Leading the pack, Go doesn't surprise me again too much, because the whole cloud native system, from Kubernetes to Prometheus to the OpenTelemetry Collector are written in Go, so there is a certain affinity there for early adopters, at least.
Moving on to a second survey, which I self-ran, and essentially asked people to provide their feedback. The first two are really just setting the scene. What agents are you currently using? I was a little bit surprised to see already quite a good share, two-thirds, saying that they are using the OpenTelemetry Collector. It might be a selection by us. Folks who are already using OpenTelemetry are more open to responding to that survey. Then, in the backend destinations, where do you send signals to? Prometheus is clearly leading the pack there, followed by others, or across the board, CloudWatch and Elasticsearch. Most interesting that, really, I want to point your attention to this, is, what are the biggest pain points of your current agent setup? Interestingly enough, lack of correlation, is with half of the respondents, indeed, the number one, which is a perfect fit for OpenTelemetry to be there, very honest. Followed by too many agents. Obviously, that's the value prop of OpenTelemetry. You want to consolidate rather than having multiple agents running, you want to have one agent there.
Moving on to the second part, I asked about adopting OpenTelemetry. What's the motivation, what drives you to adopt OpenTelemetry? Both industry standard because it is an industry standard, and to reduce vendor lock-in are pretty much the two main reasons why folks are adopting OpenTelemetry. Ask further and you see that 71 out of 91 people here answered this question. If you're already using OpenTelemetry, what setup are you using? Indeed, that reflects also earlier on distro survey that I presented, that a good share are using upstream distro and collector, which is in line with what you would expect, because the majority of distributions indeed use upstream. There are certain challenges when you're using upstream or roll your own. It's not bad, don't get me wrong, but it means that you're responsible, you're on the hook. You need to security patch it. You need to make sure that the resource usage is in place. You're responsible for all the things that are going on in the collector.
One last bit of information here, which I also found very interesting, assuming that someone already is into OpenTelemetry, what are the reasons that slow you down? What are the road blockers? What are the paper cuts? Yes, also, again, very much to expect. People saying, almost half, what I need, for example, logs is not yet fully available. That is, again, not a big surprise. That is probably to be expected, given where we are in mid-2022. Other insights there that we as a community need to work on, lack of documentation, or tutorial available, and the software not stable enough. That includes also the SDKs.
Roadmap
Now that you have a somewhat better understanding of the adoption, where and how and why folks are using OpenTelemetry, let's have a look at the roadmap. Where are we? Where are we going? Distributed traces already are GA, end of 2021. Everything is stable there. You can use it in production. Metrics, this year in May to be precise, most of these things became stable. We're still in this process of various SDKs implementing the metrics, making their turn into GA. Release candidates exists, and you can use metrics in production. Logs, on the other hand, are still under active development. While on the protocol level, we are stable, there are a number of things that yet need to be figured out. That's where we need your feedback, we need to understand, what exactly is the usage? What are the expectations? How do you want to use logs? Clearly, as you can see from the data, people want logs. People are, essentially, to a certain extent also waiting for logs to be available in GA so that they can finally start to consolidate and adopt everything.
Summary
OpenTelemetry is the vendor-neutral telemetry standard. It's an open standard for all signal types. It enables you to instrument once and ingest anywhere, making telemetry effectively table stakes. Vendors at large have agreed upon the fact that they do not want to compete on the telemetry bits, of the agents, the performance there, and so on, but on the backends, allowing you to consume the different signals, correlate them and so on. OpenTelemetry has broad industry adoption. All major ISVs in this space, all major cloud providers are behind it, have respective teams, and myself an example, a product manager for our distribution of OpenTelemetry at AWS. This is really something that in terms of investment, if you ask yourself, should I be investing in OpenTelemetry? This is a big plus. This is something where you have the safety and security of the future.
In 2021, traces went GA. This year, metrics go GA. 2023, logs will go GA, which means if you're considering adopting OpenTelemetry, now is the time. There's super interesting stuff going on in the community. Earlier this year, we had an initiative bringing profiles to signal type, think of continuous profiling, things like Pixie, Parca, Pyroscope, bringing that into OpenTelemetry. There's a working group around that, and you can participate if you want. Then there's real-user monitoring. There are collector improvements. There are so many things going on. By and large, currently, the focus is really on logs. Once logs are out of the door, then the community will probably move on and focus on the other things that I mentioned here. I'm currently writing a book with Manning called, "Cloud Observability in Action," where I'm covering the topics as well.
See more presentations with transcripts