InfoQ Homepage Articles Cloud Native and Kubernetes Observability: Expert Panel

Cloud Native and Kubernetes Observability: Expert Panel

May 06, 2021 22 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Observability is more about a cultural shift with organizations than merely using some of the technologies, tools and platforms that the industry is converging upon.
Observability needs to answer the tougher questions and be able to do “forensics” or Root Cause Analysis on an unknown issue using the signals that are emitted. Although metrics and instrumentation are integral parts of Observability, those in itself are aimed at answering basic questions like “is my system healthy?”
Learning the different open source tools and platforms might seem daunting, when synthetic probes and a few features might suffice for a start. However, the advice is to resist the temptation of creating and using home grown tools.
Distributed Systems and microservices based around platforms like Kubernetes are starting to coalesce around Prometheus and OpenTelemetry standards for data interchange.
Although there are a number of efforts on Observability, there is a lot of work still remaining. The call for action from the panelists is for everyone interested in the topic to get involved in the different communities.

Kubernetes, microservices, and distributed systems are fundamentally changing the conventional landscape. Traditional site reliability techniques like metrics, instrumentation, and alerting need to be supplemented with other signals, such as tracing which is integral to observability. Further, they need to be correlated to enable troubleshooting, both proactively and reactively, be able to do more in-depth Root Cause Analysis, cater to multiple personas, and so on.

InfoQ recently caught up with observability experts to discuss several topics including fundamental questions about what observability really entails, the misconceptions and challenges that the users are facing, and the open standards that are influencing the industry in general.

InfoQ: Can you please introduce yourself briefly and talk about a favorite practical encounter or a nightmare scenario that made you think of observability (especially cloud-native observability) differently or you wish had better observability in your software system?

Suereth: I’m Josh Suereth, one of the committers of OpenTelemetry, an author, speaker, and technical lead at Google Cloud. The first time I really felt the need for “observability” and not monitoring was during a demo to a VP for a new product feature. Even though we had tested our feature thoroughly, we saw a huge response time lag when demoing and I was unable to (quickly) figure out why and give a reasonable answer. A few months later (and a whole lot more instrumentation), I could definitively pinpoint the problem and outline the class of users that would suffer similar response times if we hadn’t fixed some underlying issues. At the time, I remember thinking “our monitoring should have caught this.” However, while we had extensive alerts and SLOs, we were still lacking observability: and that limited our ability to answer new questions.

Branczyk: I’m Frederic Branczyk, and I am the CEO and founder of Polar Signals. Previously I was the architect for all things observability at Red Hat, which I joined through the CoreOS acquisition. I’ve worked on Prometheus and its connecting ecosystem to Kubernetes for the past 5 years. I’m a Prometheus maintainer, a Thanos maintainer, and one of the tech leads for the special interest group for instrumentation within Kubernetes. A phenomenon I find to be prevalent among users of observability tools is that most observability tools and products are mere databases of seemingly endless amounts of data. As such, users have a very hard time cutting through the noise and don’t actually get the understanding of their running systems that they desire. Instead, they are overwhelmed by the amount of disconnected information.

Fong-Jones: I’m Liz Fong-Jones, and I’m a long-time Site Reliability Engineer who currently works as the Principal Developer Advocate at Honeycomb.io. I’m a contributor to OpenTelemetry’s Go SDK and serve on the OpenTelemetry governance committee. Struggling with a wall of 20 dashboards each with 20 graphs on them as a Bigtable SRE at Google caused me to resolve to do better by the next generation of engineers building cloud-native systems. Searching for answers brought me to evangelize high-cardinality exemplars as a fusion of monitoring and tracing at Google, and later to Honeycomb. At Honeycomb, we build tooling that helps developers gain observability into their systems using distributed tracing. To me, service level objectives and observability techniques go together like peanut butter and jelly: you need one to know when things are too broken, and the other to be able to quickly diagnose and mitigate.

Plotka: Hi, my name is Bartłomiej Płotka (you can call me “Bartek”), and I am Principal Software Engineer @ Red Hat. I perform a mix of Developer, Architect, and SRE roles in the Red Hat Monitoring Team. I co-created Thanos, maintain Prometheus and other open-source projects. I am helping others in the CNCF as the SIG observability Tech Lead. I am also writing a book about programming with O’Reilly. My observability nightmare is the cost of deploying, running, or using multiple observability platforms. Usually, you need three or four inconsistent systems that might be hard to integrate or duplicate the collected data. If not done carefully, it often surpasses the cost of running workloads we want to monitor or observe.

InfoQ: Is Cloud/Kubernetes/distributed systems observability different from observability in general? What are the three top observability trends that developers/architects should pay attention to? Also, the three top observability challenges that enterprises will face and how to obviate it?

Suereth: I see the trends relating to challenges, and I think the first trend we saw was companies looking to deal with scalability issues in their monitoring tools. Companies are looking for more and more metrics, across more and more systems with more and more retention. With this, we also see a rise in managed offerings, where a company can outsource the scaling issues to someone else. I think companies need a clear strategy for how they plan to scale their own observability solutions: Do they seek help from vendors or do they grow expertise in house? Having a plan can help make critical decisions quickly when scaling pains start to show up.

The second challenge/trend I think most enterprises are facing is a divergence of tooling and standards. While a given tool may be the best APM solution on the market, it might not scale well or even support metrics and dashboards. Companies are forced to make a myriad of choices that impact many personas at the company. It’s possible the tool chosen by the developers isn’t the one ops would prefer, and you’re forced to make a choice. These choices might be made in isolation, e.g., one group may be asking for an APM tool, while another team is setting up dashboards and monitoring. Now, teams are asking questions like “How do I investigate performance issues I see in the metrics dashboard in my APM tool?” Many vendors are starting to offer “observability suites” instead of just dashboarding or APM.

Similarly, the agent instrumentation landscape is diverse, robust, and mind-numbingly deep. If I want insight into my nginx server, for example, I’m given no less than seven options of choice—some open source, some managed. Picking the right instrumentation for all the components and services your company should use can be a daunting task. Even for a given vendor, you may need to choose between several options (e.g., Prometheus-based or vendor-specific). Here, we see many observability vendors joining together in the OpenTelemetry project to try to standardize and simplify the decision going forward.

Branczyk: Other than scale, and amount of disjoint systems, I see Cloud/Kubernetes observability no different than anything else. Cloud and Kubernetes, for better or worse, has led many organizations to rethink and in many cases re-implement their software architecture. I don’t see this as a direct result of Cloud/Kubernetes though, I think there were multiple movements happening at the same time, which led to the sudden and often dramatic re-platforming.

I think what the industry has struggled with, and is starting to embrace, is that observability isn’t a checklist. You can have the “three pillars of observability,” metrics, logs, and tracing systems, and yet still be clueless about what is actually happening in your systems. observability needs to be embraced culturally as much as technologically. We have many great open-source projects that can store and query these types of data, like Prometheus, Grafana Loki, and Jaeger, but it’s just as important—if not more—that we take care of the quality of data that we put into these systems as well as making use of them in thoughtful ways.

What I think we are finally arriving at is that observability is no longer defined as a specific type of data, but rather a concept that describes the usefulness of data to be used to understand a system. Any data that allows us to understand our running systems is observability and our ability to connect different sources of data is key to how quickly we can understand—often also referred to as correlation. In this trend we are starting to look at other sources of data that are useful to understanding a running system, such as continuous profiling.

Fong-Jones: Just to start, we need to define what observability means. To me, observability is the ability to understand the behavior of your systems, without attaching a debugger or adding new instrumentation. The concept of observability is really agnostic to where you’re running your workload, but the added complexity of multi-tenancy, cloud-native workloads, and containerization lead to a rising need for observability. Single-tenant monoliths can be easier to make observable because all the functionality is right there, but as you add more services and users there’s a chance that a bug will only manifest for one particular combination of services, versions of those services, and user traffic patterns.

The most important thing to be aware of is when you’re about to grow your previous solutions, and to be proactive about adding the right instrumentation and analysis frameworks to achieve observability before it’s too late. When you stop being able to understand the blast radius each change will have, and when you stop being able to answer the questions you have about your system because the underlying data has been aggregated away…that’s the point at which it’s too late. So be proactive and invest early in observability to both improve developer productivity and decrease downtime. It doesn’t necessarily matter what subset of logs, traces, and/or metrics you’re collecting; instead, what matters is whether you are able to answer your questions with the signals you have.

Plotka: Distributed workloads in clouds (like Kubernetes) undoubtedly pose new challenges to observability or monitoring. First of all, by design, pieces of often complex software are now highly distributed: scattered across machines, racks, or geographically, often replicated or sharded. Furthermore, the industry did not end on microservices. It started to federate and replicate whole Kubernetes clusters, grow and shrink them dynamically, keep them short living and reusable. This ultra-distributed architecture requires observability to be decoupled not only from applications but from clusters too. Such shifts often surprise enterprises that are not prepared for such dynamicity. That’s why CNCF projects evolve in this direction. Prometheus in Agent mode, OpenTelemetry with OpenMetrics, or Thanos replicating additional observability data like Scrape Targets, Alerts Exemplars between clusters. Those are only some examples of the work we do to make distributed observability easier for open-source users.

InfoQ: Observability does affect multiple personas. Why this sudden increased interest in observability, however? What are some misconceptions or pitfalls that practitioners should avoid?

Suereth: To me, observability is a sign that we’re starting to ask the harder questions of our system. “Is my system healthy?” is a fine first question, but it leads to harder to answer questions—“why does it take so long for this button click? can we curb the growth rate of our database?, which users are most affected by latency?”—these are questions we must answer and metrics alone aren’t always enough.

I think there are two major pitfalls in observability (and I’ve fallen into both): Waiting too long to add observability and, conversely, expecting too much from observability.

Observability is a principle of a system. It’s your organization’s ability to answer questions about behavior that allow your business to improve, adapt and grow. If you wait until you have a critical question to answer, then it’s (likely) too late to add observability. In my example on a slow demo for the VP, it took a few months to implement and wait for data collection before we could start answering the questions we needed to.

The second major pitfall is expecting too much from observability alone. While data and visibility can help answer a lot of questions, it’s not a panacea. observability won’t change how your data storage scales or prevent outages (by itself). observability helps the organization answer why outages are occurring but your organization still needs to solve the underlying problem (with better visibility). observability should always be tied to questions that lead to actions, and these actions need to be prioritized.

Branczyk: I can only explain the sudden increase with both open-source solutions getting significant community traction and that traction being carried back into companies. There have been studies from major companies for years that show low latency and high availability increase customer conversion, but the tools and workflows used to be rather inaccessible to most companies out there, as you needed to build these tools in-house, so in a way, I believe the CNCF democratizing these tools has lead companies to believe they can finally capture these opportunities as well since the tools are now available to them.

Following my thoughts about checklists earlier though, a misconception is often that organizations install one or multiple of the popular open-source stacks and think they are done and have observability now. observability needs constant thought and care—it’s not a one-off—much like with testing, we need to embrace it as part of our software engineering practices and culture. Along the same lines, it can’t be carried by a single team within an organization—everyone must participate—certain teams can take care of certain parts, but everyone must pull on the same string.

Fong-Jones: The importance of observability has risen as services become more complex. But there’s also an attempt to co-opt the term as “catchier monitoring” rather than understanding it as a different capability requiring an entirely new approach. The pitfall to watch out for, therefore, is advertisements of “automatic observability” or “three pillars of observability,” popularized by those trying to adapt traditional APM or metrics and logging approaches to claiming they do observability too. The “three pillars” definition is a trap intended to encourage consumers to spend more money storing redundant data. You don’t need all to buy or build a solution for all “three pillars” to achieve observability, and while I might disagree with Bartek on whether the most important signal to first start collecting is traces or metrics, we both agree that you shouldn’t feel a need to collect all the signals just because a vendor told you so or an excited engineer wanted to build a system. What’s important is collecting just enough data to be able to answer unknowns about your systems, regardless of what shape that data takes.

Plotka: Observability was always important, but indeed we have seen considerable demand for such data. There are many reasons. The complexity of distributed systems which are the default way of solving problems, cause more unknowns. Similarly, enormous growth in public or hybrid cloud usage increased unknowns and disabled older non-cloud-native or more manual monitoring or debugging tools. Another aspect is the shift in the developer role, encouraging them to run the software they create. Older patterns had a separation between development and operating. Now, all those mixed (for a good reason) thanks to ideas like DevOps or Site Reliability Engineering. It forced the cloud industry to think of a more serious focus on multi-tenant observability and monitoring. Primary pitfall? observability Practitioners often aim to collect every possible piece of information. High cardinality metrics, all the log lines, all the traces for all the requests. Because who knows when they will be useful? Such an approach ends up being an enormous overspend of money and effort. Suppose each request to your HTTP service generates from 3 to 100 tracing spans. If you collect them all, your clusters will spend most of their time processing observability data than doing actual work. Be reasonable, start with metrics, introduce on-demand logging, tracing or profiling.

InfoQ: OpenTelemetry is emerging as a standard for all three major signals. Will this be a silver bullet and replace proprietary protocols/agents? For those using other tools and platforms today can you suggest a migration path (if needed) for the future?

Suereth: I have a personal mantra that “nothing is a panacea.” That said, I think OpenTelemetry has all the right people at the table to replace proprietary protocols and agents with an open standard the industry can rely on. We’ve been investing in OpenTelemetry from its inception and still think it will live up to its promise.

For those using other tools, I think OpenTelemetry provided a lot of compatibility tools and “incremental steps” companies can take advantage of:

The OpenTelemetry collector can be used as an agent/service to adapt from alternative Telemetry signals into its own format. It supports metrics protocols like Statsd, Collectd, Prometheus/OpenMetrics, Jaeger, Zipkin, and OpenCensus.

Companies don’t need to re-instrument their applications to start in the OpenTelemetry ecosystem. They can start using the collector to "lift and shift."

The OpenTelemetry "auto instrumentation" components provide a means to get Telemetry signals without writing (much) code. For example, the java instrumentation project provides an agent that works against most major frameworks or application servers and gives traces and metrics. This can be adopted as needed or not at all.

OpenTelemetry APIs provide the best mechanism for correlated telemetry creation and give companies the ability to provide business-specific observability into their systems.

Branczyk: I have my doubts about a single library that integrates everything into everywhere being benevolent for life, having worked on the Kubernetes Heapster project I’ve seen one of these integrate-everywhere projects fail gloriously first hand. While there is a lot of momentum, vendors are eager to add integrations to their platform, but as soon as momentum is lost, things go unmaintained quickly and projects can struggle easily.

That said, I understand the appeal of OpenTelemetry for migration paths and disparate tooling integrations in companies. If you look at any sufficiently large company, it’s very possible that a handful or more observability products are used in production today. Migrating or even just supporting all of those vendors in a single stack is rather appealing.

I am personally most excited about the wire formats that are being standardized, such as OTLP for tracing and OpenMetrics for metrics. These build a much better foundation for interoperability, in my opinion, people can build libraries and agents in whatever language on whatever device they want and they can plug into existing solutions and ecosystems, without necessarily relying on a particular vendor’s implementation of a library. Today,we already see companies like Slack extending or changing fundamental aspects of existing tracing libraries, simply because one-size-fits-all solutions tend to not exist. Therefore, I think it is good to focus a lot of our time on the wire protocols as that’s where actual interoperability comes from.The library implementations, at the end of the day, are very important as well, but in my view, secondary for the higher goal of uniting the industry.

Fong-Jones: The OpenTelemetry Collector is a Swiss Army Knife that can adapt almost any telemetry protocol into an OpenTelemetry Protocol (OTLP) or into a vendor’s export format. In that respect, a lot of the problem of translation and migration in place is solved. However, context propagation to ensure trace ids and span ids are consistent across systems is a problem that the OpenTelemetry Collector can’t solve by itself. OpenTelemetry’s language SDKs support the B3 and W3C standard propagation mechanisms, but if a vendor SDK doesn’t support either B3 or W3C, then the path to future interoperability becomes more challenging.

We’ve seen end-users seamlessly migrate between providers or perform proof of concepts with multiple providers at once with OpenTelemetry, so the benefits of OpenTelemetry in improving consumer choice and flexibility are already coming to fruition.

Plotka: We already see fantastic progress in this direction—agreeing on the open-source protocols for main observability signals. It all depends on the usability, performance, and stability factors. OpenTelemetry designed a solid tracing spec but logging and metrics are still in the making. Among other things, the OpenTelemetry community works hard to bring the benefits of heavily adopted OpenMetrics (evolution of Prometheus Metrics format) to the core agents. The challenge with such an ambitious project is to stay focused. The number of features, integrations, and use cases is enormous. Time will tell if OpenTelemetry as the generic solution will win with specialized solutions in this area. Join the OpenTelemetry community to learn more or want to help.

InfoQ: Can you clarify the role of Application Performance Monitoring (APM) and observability, especially since many of these tools are homegrown? Is it integral to observability and will the emerging industry standards in the observability area have an impact on these tools?

Suereth: The example I mentioned earlier, where we had a very slow demo for our VP, showed me the critical nature of APM to observability. Effectively, it gives you three critical components that are game-changing for root cause analysis and diagnosing almost anything in your distributed system. They are:

Context Propagation. A system may behave differently based on a myriad of components. If you don’t propagate request context through the whole system you have "blind spots" in your understanding of where a slowdown occurs. This can lead to dramatic prioritization changes from "optimize this data query" to "we need a full-on cache."

Distributed Tracing. APM (and distributed traces) can help you home in on problematic scenarios that lie outside your statistical norm for behavior but have a huge negative impact on users. E.g., we found in one server that a system restart could lead to a huge negative latency and were able to tone down this effect by "self-warming" the server before making it available for traffic.

Correlated logs. The ability to tie logs to a trace and see this in context is a huge boon, and something being made more practical and likely with technologies such as OpenTelemetry. Finding a bottleneck and immediately being able to ask "why" with visibility into logs is game-changing.

Branczyk: I fundamentally see no difference between APM and observability. They tend to describe the same thing—data that helps us understand our running systems, whether that is data collected from infrastructure components or applications running on top of it. That said, it’s tempting to start out thinking one only needs a tiny fraction of features available in the vendor or open-source products and having to learn these tools seems daunting, when all people want at the start is synthetic probes. I highly discourage creating entirely homegrown systems though. Time and time again over the last decade I have encountered companies that did this and ultimately suffered. Building these tools tends to be far enough out of scope for companies that they eventually get neglected and companies need to undertake high-risk migrations for some of their most critical pieces of infrastructure.

Fong-Jones: There’s always the temptation to home grow a tool, but I’d suggest avoiding it. There are competent open source self-hosted, managed open source, and pure vendor solutions in the marketplace that can address your needs at any cost point. APM solutions can fulfill much of the need for observability in a monolithic environment, but as your environment grows in complexity and number of services, there may be better answers out there that look more like distributed tracing. Regardless of what backend you choose, the automatic instrumentation of OpenTelemetry and its API for instrumenting additional metadata is your best bet for vendor-neutrality and interoperability as you make the transition from APM to observability at scale.

InfoQ: Briefly, anything else you would like to add? Something the community needs to pay more attention to, maybe a call for action, etc.? Any session you would be closely watching at Kubecon EU virtual?

Branczyk: It’s a great time to get involved in observability, there is still so much room for exploration. No type of data (at least that we know of today) is exhaustive in allowing us to understand every nuance of our running systems. I encourage everyone to look at workflows they may still perform manually for troubleshooting purposes today, it may just be the next big thing in observability. CNCF SIG observability is always excited about new ideas in the space, so drop by and participate.

Fong-Jones: We’re always looking to hear from users who are adopting OpenTelemetry, so let us know what you think. You can add yourself to the ADOPTERS file in the community repo or stop by the #opentelemetry channel on CNCF slack. Also, a quick plug for two upcoming community events I’m involved in as a program committee member: come by o11yfest.org or o11ycon.io to learn more from the observability community.

Plotka: It’s worth adding that you are always welcome to join our CNCF SIG Observability community. We are working on various items, docs, and initiatives around cloud-native observability. Say hello on the chat or join our meetings. I would also recommend joining Project Office Hours during the virtual KubeCon 2021 (“Meet maintainers” sessions). Feel free to be active, ask questions, say hello, propose features, ask for context, anything. Thanks.

Distributed Systems and microservices add to the existing challenges of traditional observability which has often involved monitoring, alerts, and so on. It involves a cultural change since it affects multiple personas in a typical digital enterprise, large or small. It needs to answer more insightful questions than “is my system healthy?” and often involves multiple signals, tools, platforms, and the ability to correlate them.

Kubecon + CloudNativeCon 2021 schedule includes an observability track with several interesting sessions on in-depth and getting started topics. OpenTelemetry is a platform that is evolving and being worked on by multiple vendors and users where there are numerous opportunities to get involved in the different communities and the work underway.

About the Panelists

Bartłomiej Plotka is a Principal Software Engineer @RedHat | Co-author of @ThanosMetrics | @PrometheusIO maintainer |@CloudNativeFdn SIG observability Tech Lead | Linkedin.

Liz Fong-Jones is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights. Linkedin, Twitter: @lizthegrey

Josh Suereth is a staff software engineer at Google. Application Developer, Database Administrator and Software Architect in a variety of applications. Particularly interested in Software Design and Multi-Tiered/Distributed Architecture. Always interested in trying out cutting-edge solutions. A Scala maintainer, OpenTelemetry committer and active in the open-source community. Linkedin, Twitter: @jsuereth

Frederic Branczyk is CEO/Founder at Polar Signals. A former Senior Principal Software Engineer at Red Hat (through the CoreOS acquisition) contributing to Prometheus and Kubernetes to build state-of-the-art modern infrastructure and monitoring tools. He discovered his interest in monitoring tools and distributed systems in his previous jobs, where he used machine learning to detect anomalies indicating intrusion attempts. He also worked on projects involving secrets management and thread detection for distributed applications to build sound and stable infrastructure. LinkedIn, Twitter: @fredbrancz

InfoQ Software Architects' Newsletter

Cloud Native and Kubernetes Observability: Expert Panel

Write for InfoQ

Key Takeaways

About the Panelists

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter