A good observability strategy focuses on the business goals of a system, and uses data from across a distributed system to identify if those goals are being achieved. Successfully implementing such a strategy requires "making the right thing easy," by simplifying how teams share their data. These were some of the ideas discussed during the InfoQ Live roundtable discussion on observability patterns for distributed systems, held online on August 25. The panelists included Liz Fong-Jones, Luke Blaney, Idit Levine, and Daniel "Spoons" Spoonhower.
How a company can get started improving observability depends on what already exists in their system. Trace IDs, which allow you to see how a request flows through your system, can be propagated using existing networking components, whether you're using a service mesh, or even a load balancer or proxy.
Make it easy for teams to onboard their existing services into the observability platform. Blaney described how the Financial Times uses a System Code to identify components. By including this one data field in logs and messages, a team will immediately see the benefit of being included in the observability platform. This is usually more effective than simply mandating that all teams must follow strict guidelines for reporting.
All the panelists said some variation of, "make the easiest path the correct path," with Fong-Jones observing that, "teams are super lazy." Because most teams are focused on developing their service, find ways to create automatic dashboards and update runbooks. Spoons emphasized the need to create machine-readable central documentation.
Similarly, using structured logging makes information digestible. That can greatly aid looking for patterns. One of the behaviors to encourage is being able to form and test hypotheses. Having all the data from across a distributed system can become overwhelming, so you need ways to narrow your focus.
The practice of site-reliability engineering requires a different mindset than "ordinary" software engineering. Although DevOps has been an attempt to apply software engineering to IT operations, SRE takes an opposite approach when thinking about failure. This can be thought of as the duality between monitoring, which is looking for what is anticipated, and observing, where the focus is on what is unexpected.
Each of the panelists had a few pitfalls that they've seen, and hope people will avoid. Fong-Jones is wary of products that promise to easily add observability into your system. People think this comes from data and tools, but it's a capability that evolves over time, and your mindset and behavior really matter.
Spoons said individual excitement around adding observability needs to be focused on what matters most to the business. If someone goes off and does a lot of work, but it's outside the critical path, it offers no benefit, and ends up being forgotten.
Blaney wants developers to make sure they're asking the right question. Too often, a check will be made if a dependency is up, without determining if it is actually able to be used. That's a subtle distinction, and people new to the field often miss the importance of it.
Levine, who created the Squash debugger for microservices, described how working with a distributed system means activities such as debugging are only really applicable at development time. Different techniques are needed to troubleshoot a distributed system in production. The compiled binaries have to contain not only business logic, but also some and operational logic. Levine has found using Envoy as a sidecar to be extremely helpful for adding details about traffic and service behavior.
Fong-Jones also emphasized the need to "Test in prod. Test in prod. Test in prod." You will never be able to fully recreate the production environment. To the greatest extent possible, developers should use the same tools and techniques during development as they will use to troubleshoot production outages.
The conversation wrapped up with the question of what's next for observability. In the future, systems will be providing even more data, and Spoons hopes that will result in less manual effort by humans. Levine believes we will see more self-healing systems, as they're able to analyze the state of the system and respond accordingly. She also thinks that will help address the signal-to-noise problem that can be common with observing large systems.
Blaney and Fong-Jones both cautioned against getting lost in all the data and tooling, and to keep focusing on what really matters. It's important to always ask, from a business perspective, if the system is in a good state.