Observability must evolve with serverless, event-driven architectures, Martin Thwaites mentioned in his talk Observability and the Art of Software Engineering at GOTO Copenhagen. OpenTelemetry can decouple telemetry from vendors, letting developers emit consistent, high-quality data that explains real system behavior. Shared vocabularies and good telemetry make debugging faster and improve reliability, speed, and developer productivity.
Modern observability is tightly coupled to the definitions "modern" systems, "modern" development processes, and "modern" architecture. It’s a way of saying that the way we architect, build, and therefore support systems has changed since the days of monoliths and servers, Thwaites explained:
We’re now building Serverless, Event Driven, Cell-based architectures, therefore the way we think about the telemetry, and ultimately observability around them, should also change.
OpenTelemetry is the glue that sits between your systems, documenting what’s happening (emitting their telemetry), and the system (or potentially systems plural) that help you make sense of that data, Thwaites said. It’s not tied to any single way to investigate that data, which means it’s not tied to the way a particular vendor or solution chooses to focus:
This decoupling makes it a developer-focused tool. You can concentrate on producing the best telemetry you can, instead of tailoring it to make it work within your current product.
Good telemetry is data that’s focused on describing how the system "works" in production, Thwaites said. By "works" in this context, we’re referring to how each service is serving a particular request or interaction, he explained:
It will allow you to, from that data, understand what makes this interaction different from another, and what that caused to happen in the system, whether that’s specific database calls, or whether it’s particular, unique, codepaths that were executed.
If this is done consistently, debugging of production issues is amazingly simple and quick, Thwaites concluded.
One of the things that people have been finding over the years of monitoring systems is that consistency in telemetry is important. The lack of consistency in how people talk about their systems performance has become more important as the complexity of those systems has increased, Thwaites said. He mentioned Weaver, a tool to document the telemetry emitted by systems that goes beyond the standard attributes you might expect like HTTP or gRPC:
It allows teams to define a shared vocabulary of telemetry in a way that observability backends, AI tooling, and ultimately humans, can use to understand that complex system.
Weaver also provides live checking and exception tracking against telemetry to ensure that you’re using only the approved conventions, and code generation to make adoption easier.
Producing good telemetry is the single greatest thing that will move the needle in how your team can support the production systems, Thwaites argued:
The best teams I’ve worked with have spent as much time curating the telemetry they output as they have writing the code that performs the business outcome.
It’s a development task, not an operations task, Thwaites said. Once teams embrace telemetry as a core part of developing good software, you’ll see its effect in so many different ways, from MTTR, MTTD, developer happiness, defect rate, everything, he concluded.
InfoQ interviewed Martin Thwaites about observability and telemetry.
InfoQ: What can observability do for artificial intelligence applications?
Martin Thwaites: Observability is designed as a means to ask questions of your production system that you didn’t know that you needed to ask while you were writing the code, which is exactly what we need when a system can use AI to perform tasks. We don’t know how that system is going to react to a given input, and that input can and will change as users interact with it.
It’s now even more important that we get robust telemetry, that includes our unique business context, out of our systems so that we can answer those weird and wonderful questions.
InfoQ: How are telemetry and test-driven development related?
Thwaites: Telemetry is a core output of our applications; it’s how we understand how an action from a user did the right thing. If we’re writing tests in a TDD workflow (i.e. writing tests before the implementation), and we’re using telemetry as part of those tests to understand that an action was performed correctly, then the code we produce is designed to be observable from the start.