Building Observable Distributed Systems

Today’s systems are more and more complex, with microservices being distributed over the network and scaling dynamically, resulting in many more ways of failure, ways we can’t always predict. Believing that we can build the perfect systems can lead to a false sense of security, so we need to be prepared! Investing in observability gives us the ability to ask questions to our systems, things we never thought about before. Some of the tools that can be used for this are metrics, tracing, structured and correlated logging.

Pierre Vincent, site reliability engineering (SRE) manager at Poppulo, spoke about building observable distributed systems at QCon London 2018. InfoQ is covering this conference with Q&As, presentations, summaries, and articles.

InfoQ interviewed Vincent about applying observability when building distributed systems.

InfoQ: You mentioned in your talk that reaching production is just beginning. Can you elaborate?

Pierre Vincent: We are really good at everything before production but we rarely improve on what is happening afterwards. The more I look at this I can see that spending all that time focusing at pre-production not only comes with diminishing returns but it’s almost counter-productive. We may believe we can build the perfect thing, but at the end of the day, systems can fail and it gets very hard to deal with, especially when we get to work on distributed systems.

As a developer I had it backwards for a long time: production was the end game, when we reach production then we’re done and it’s somebody else’s problem, moving onto the next story or feature. We have come to the comfort of thinking that all the important things we do pre-production are enough: TDD, integration testing, staging, end-to-end and so on.

Nowadays we’re dealing with much more complex systems, distributed over the network, scaling dynamically, etc.. This brings in so many more ways things can fail and we have to think about that. Believing that we can build the perfect thing gives a false sense of security: when things will fail, we simply won’t be prepared for it. And the bad news is, it’s going to fail much harder because we don’t have anything at our disposal to detect the problem or investigate it.

InfoQ: How can observability help us to be prepared for problems that happen in production?

Vincent: Investing in observability means to be prepared to spend the time on instrumenting systems, giving ourselves the tools to cope for the unknowns that come in production.

We can have different ways to give ourselves that information in production. It can be very simple at the start, such as some basic health-checks. There’s a lot of tools to do powerful things with time series metrics. Metrics, tracing, logging, correlations, structured logging, events; there’s not one solution for everything but combined together it just brings a really powerful solution. We have to admit things may go wrong but when they go wrong we have all of these things that help us as detectives, help us like to react and recover faster.

InfoQ: Which observability techniques and practices do you recommend?

Vincent: It can be tempting to just stop at metrics or monitoring, while there are other things worth looking into.

Metrics are very efficient at aggregating data over low-granularity dimensions, which makes them a good choice for service level monitoring type of things. However, metrics don’t scale well at high-cardinality, which is what is needed for debugging and exploration. We have tried to use Prometheus for high-cardinality time-series and it wasn’t fun!

Good logs, especially structured logs, play a huge role in increasing our understanding of how applications behave. By good logs, I mean easy to search and providing sufficient context to understand how events occurred. Correlation of logs is also one of the first things to look at, with things like a Request Id following a request through the entire system. Logs can be expensive though, logging libraries have often non-trivial overhead, it can be tricky to manage their volume, and sampling at logs level isn’t easy.

We’ve had some positive experiences with Zipkin (Open Tracing) in our stack - we actually use Trace Ids as our log correlations Ids, which makes it even easier to jump from logs to traces. Within first few hours of looking at traces, we discovered a few things we had no clue were happening. Unfortunately, distributed tracing has a big instrumentation cost for existing systems. We’re more than a year in and we haven’t yet instrumented everything. A word of advice: if you’re starting now, build tracing in, it’s so much cheaper upfront.

Something we’re looking into right now is exception shipping, in particular for client-side browser errors. This was a blind-spot we’ve had for a while and we’re trying to see if tools like Sentry can increase visibility, especially for customer-impacting errors.

Overall, you have to spend time using each of these things for what they’re good at. A single tool won’t work; the trick is to find the right balance so that they best compliment each other.

InfoQ: What are the benefits that observability can bring?

Vincent: The immediate benefits for us was the visibility. When we introduced Prometheus a few years ago, our reaction was pretty clear: how did we ever do without knowing these things? From then on, it’s a virtuous circle, asking more questions, instrumenting more if we can’t answer them.

As I said before, it doesn’t come without issues - we’ve had to review our strategy when metrics weren’t fitting the solution. One example is that we started relying on structured logging for customer-level granularity (which is high-cardinality), which gave us the ability to debug on a customer-per-customer basis.

InfoQ: What will the future bring for observability?

Vincent: Improvements around ease of use and developer experience really come to mind. If systems are hard to instrument and outputs aren’t easy to interpret, this is a losing battle.

If I had to pick one thing, I think logging is going to get a lot more love in the next while. Even though it has its limits right now, there is a lot of potential in structured logging: provide more detailed context and enable better debugging.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter