Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Applying Observability to Ship Faster

Applying Observability to Ship Faster

This item in japanese

To get fast feedback, ship work often, as soon as it is ready, and use automated systems in Live to test the changes. Monitoring can be used to verify if things are good, and to raise an alarm if not. Shipping fast in this way can result in having fewer tests and can make you more resilient to problems.

Dan Abel shared lessons learned from applying observability at 2020.

Shipping small changes, daily or hourly, sounds simple, but it can be a hard thing to be great at. It really helps to have independent, low coupled systems, said Abel. He mentioned that thinking about the desired design, constantly while coding, really helps:

The Bounded Context concept (from Domain Driven Design) is a great guide to thinking how things can start to be separated and operate independently. I also have a rule that if several services have to be shipped together, we’ve probably built ourselves something too coupled to test, ship and monitor in the way we do.

Getting our products and features released as soon as we can allows us to learn as we go, Abel said. "We can often better see and solve issues in the small; learning good patterns before we scale up," he mentioned.

InfoQ interviewed Dan Abel, a software engineer, consultant, and coach, about applying observability to ship faster.

InfoQ: What purpose do metrics, monitors, and dashboards serve?

Dan Abel: For us at Tes, these allow us to check in on the health of our systems. They give us confidence that things are working as intended and crucially that our users are reaching their goals.

We instrument our applications to gather metrics - information from our Live systems. We then get visible displays of our service health via dashboards. Crucially we can assert on this data using monitors.

We verify our systems by tracking success and failure metrics, and setting expectations on these via monitors to get alerted if user success drops too low, or if errors rise.

For example, we can ask, "Did we fail to render those PDFs?" "Did we serve those downloads okay?"

For me, that’s test automation in production.

InfoQ: What happened when you arrived at a new company who was shipping fast with fewer tests?

Abel: When I arrived at Tes, I found working in this new way both exciting and challenging. It felt weird to not be running integration suites and thoroughly testing every nook before shipping. I was there to learn about new things and be an engineer with service ownership, so: challenge accepted.

I found myself on a team that was building a replacement job application system. The existing system was extremely valuable to the business, so we needed to find new ways to keep releasing and learning.

"Move fast and break things" couldn't quite apply here. If we wanted to keep shipping, we needed speedy and accurate feedback from our services in production.

So we asked, "What would happen if we applied what we cared about from test automation, using what we knew about production monitoring?"

And of course - we are engineers. Once we cracked that, we found we could do more complex measuring and monitoring.

InfoQ: What have you done to get faster feedback?

Abel: Shipping new work often, and as soon as it is ready, combined with automated systems in Live to tell me that each change is good.

As soon as I have a good level of confidence from my test automation, I want to get the software into my user’s hands. Rather than wait to hear about an issue from a user via customer support, I support my users by keeping an automatic eye on the system.

I want accurate feedback, as well as fast feedback. I want to know that my change has not damaged our users’ ability to use our system to reach their goal - like applying for a job. So we built monitoring to verify that things are good, and to raise an alarm if not.

The final part of the faster feedback puzzle is to record the nitty-gritty: what’s our code doing when it’s serving user requests? This means that when an issue arises, we can use this data as a spyglass to observe what’s going on inside our services helping us react quickly. That’s a really useful superpower.

InfoQ: What new skills were needed and how did people develop them?

Abel: We needed to learn to think in monitoring terms, learn more about monitoring tooling, and how best to monitor.

Most monitoring systems are set up for platform and operations monitoring. Using these for application monitoring is taking them and engineering somewhere new. Early on, we got some weirdness out of our monitoring. The system was telling us we had issues when we didn’t. It sounds silly now, but reading and re-reading the monitoring system documentation until we really got it helped. Digging deeper into how different types of metrics and monitors were designed to be used allowed us to build a more stable monitoring system.

We also found that there were things we wanted to do, that we couldn’t do with out-of-the-box monitoring. Our early application monitoring was noisy and misfired. Too frequently it told us we had problems that we didn’t have. We kept iterating.

We ended up building more of the monitoring in code than we expected, but it was well worth the time. We got the bare bones of a monitoring system early, and by using it in the real world, we worked out what we really needed.

A skill we learnt late: as teams, we need to keep tending to our monitoring.

Our instrumentation, monitoring and observability is as important, maybe even more important, than our test automation. We knew how to manage our tests like we manage our code. We needed to learn to always manage our operability and observability like we manage our tests. That means reviewing as a team; keeping what’s useful; and removing or adjusting stale alerts that slows us down.

InfoQ: What have you learned?

Abel: Ship quickly, and pay attention to what goes on in Live! Why? It helps us provide a better service to users. It also has a big effect on how we build our systems. It’s guided us to focus on being resilient to problems. This is a big win.

Production bugs are inevitable. We found that rather than the occasional HUGE issue, we got many small problems. This drove us to think about making our systems more resilient to failure earlier than we otherwise would have. Often those actions were based on feedback which drove us to build in better viewpoints of the internal behaviour of our system.

I’ll always fight to get a system running in Live and launched to users as soon as I can, with the plan to observe, learn and improve.

InfoQ: How could someone start down this path?

Abel: A key learning is to use what I have and build towards what I need.

I read a lot about observability. There are some really cool new technologies out there that sound like the bee’s knees, and there is a lot of debate about suitable technologies.

All this new tech looks absolutely great, but don’t stop if it’s not available to you. I got great results from recording an audit trail in a database and monitoring for exceptional behaviours. What’s important is getting feedback from the place where your users are. Ensure you have the fidelity you need to make good swift choices to help your users, and keep building a better system for them.

Rate this Article