InfoQ Homepage Articles An Engineer’s Guide to a Good Night’s Sleep

DevOps

An Engineer’s Guide to a Good Night’s Sleep

Leia em PortuguÃªs

This item in japanese

Aug 20, 2019 18 min read

Follow us on

Youtube232K Followers

Linkedin26K Followers

Key Takeaways

Support for highly available systems must adapt as architectural models and team organisation evolve.
System ownership means we need to examine more than business deliverables.
Build a technically complex, yet supportable system, using five best practices.
Rethinking support, automating, catching errors early, and understanding what your customers really care about, and a solid mix of breaking things and practicing is the best hope for a good night’s sleep.

We are building really complicated distributed systems. Increased microservices adoption, fueled by the move into the cloud where architectures and infrastructure can flex and be ephemeral, adds complexity every day to the systems we create and maintain. All of this complexity takes place alongside operating models with autonomous and totally empowered teams, so each distributed system has its own tapestry of technical approaches, languages, and services.

Thankfully, despite such complications, you can build a technically complex, yet supportable system, using a set of best practices gathered from my experience building a new generation of APIs in a completely greenfield manner.

Complexity is the Cost of Flexibility

Image: Skyscanner service tracing

Microservices are autonomous and self contained, and allow us to deploy independently without being shackled to lengthy release coordination. Being cloud native allows us a lot more flexibility around scaling. But, these new approaches and technical capabilities are used for organisational reasons, such as scaling teams while still maintaining momentum, not to reduce technical complexity. So, all of this comes at an operational cost and a rise in complexity.

Martin Fowler talks of the trade-offs of microservices, with one being:

"Operational Complexity: You need a mature operations team to manage lots of services, which are being redeployed regularly."

While working for the Financial Times, I had the opportunity to help a team build a new generation of content APIs in a complete greenfield manner. We were empowered - a self-organised team that could choose how to re-architect, deploy, and maintain all this new functionality. We could choose all aspects of the tech stack, but we also had to define our support model—we were fully accountable and we knew it, so we built with that in mind. This accountability made us view the operational-support model differently than on previous projects.

Initially, our implementation was very flaky with APIs regularly unavailable—or worse—still serving unreliable data. This drove our initial major consumer, a new version of the ft.com site, to guard against us by using caches to store our data in case the API was down. When it came to the go-live of this new site, they wanted assurances—service-level agreements (SLAs), from us.

The most important thing to the Financial Times is breaking news. We were therefore asked for a very tight SLA of 15 minutes recovery for the APIs involved in breaking news. Now, this is pretty tough, especially as we had some pretty big datastores behind the scenes. This SLA not only brought with it a set of technical challenges, but also several organisational challenges, such as out-of-hours support. Luckily for us, we were empowered to define how to do that in the way that best suited us. We did do it and managed to virtually eliminate out-of-hours calls.

We didn’t build a mythical, magical system that was fault-free, but we did build a resilient technical and operational system. Five best practices brought us to that point of resilience:

1. Enable the Engineer

This is ultimately the most important part of building a resilient system. Every engineer must start thinking of operability as a primary concern, not as an afterthought. They need to start questioning their implementations continuously—"What do I want to happen if there is an error: Get called regardless of the time? Retry? Log something to look at later (perhaps)?"

We are now working in or at least moving toward, autonomous, empowered teams or squads. This approach gives us accountability, but also the influence over the support model we want to use. It doesn’t matter if you operate in the traditional model of a dedicated company-wide, first-line support team, or teams that are called directly from tools like pager duty.

Of course, working with other teams complicates the process, but can also take the pressure off the engineers—it is a trade-off and often depends on the company-wide model. Teams need to define how they work together best and ensure they have adequate tooling. Above all, for the teams to work effectively, they need to foster and evolve the relationships within each team and among collaborating teams.

Rethinking Out-of-hours Support

When the SLA changed for the content APIs with the new site rollout, there had to be commitments about out-of-hours support. We took the opportunity to examine the way we did this. We were keen to reduce the effect of out-of-hours support on people’s home life. We achieved this using two buckets of engineers, as shown in the following figure.

The support model defined by Content Programme at the Financial Times

The operations team could round robin call the engineers in the bucket, with one bucket being "on call" from week to week. The engineers were not obligated to answer the call, and if they didn’t, then the next person would get called. This allowed people to spontaneously head to the pub, out for a bike ride, or swim, without worrying they might get called. This system was completely voluntary and well compensated—people got overtime if they answered a call. It is important to recognise that some people are just unable to commit to this, and also important to recognise this is not part of the day-to-day job, so they should be paid.

This support model extended into the daytime also. There were two people who responded to incidents, queries, and also worked to improve the platform and operational processes. They rotated in from the engineering teams. This was a dedicated role because we recognised this as a critical function and no operational improvements would happen if they were working on scheduled work. Additionally, this meant engineers were raising pull requests across the entire estate, not only the areas with which they were familiar. This platform-wide issue handling was an important aspect. It helped bolster the confidence of the engineers to cope with calls out of hours and encouraged them to think of operational concerns when implementing business functionality. The pain of bad practices would bite them while on rotation, or worse still, at 3 am.

People design systems differently if they understand their code could result in an out-of-hours call. "Will this error affect the business?" For the Financial Times, the question was always, "Will this error affect the brand?" But, that business could be a whole lot worse if you worked for a hospital or power station!

Handling Errors

Our next improvement was hooking up the errors with alerting in a sensible manner. Thinking about severity levels is a useful tool to differentiate between something that can be fixed during normal hours. Not all errors are going to cause issues to the primary functionality of the system. Here is an example of what we adopted:

The severity level is clearly stated in the aggregated system-health check.

We had a containerised platform, starting with one we hand-rolled ourselves before migrating to Kubernetes. We wanted to understand immediately when our platform was unable to perform critical business activities such as publishing content. However, there were some "not-working" services that could be recovered in-hours, at our leisure. The example above shows a health-check page that aggregated the health of the entire platform and if the aggregate went to critical, we got called. However, in the example shown, we are warned we have an out-of-date version of CoreOS, the container operating system we used. This warning was very important to address because it could have critical security patches; but, it does not need to be done at 3 am on a Sunday morning. It can be addressed post-coffee on a Monday instead.

Engineers can address operational issues going forward, as well as glance back and keep updating older applications as the operational manner is refined.

Lehman’s laws of Software Evolution states:

"The quality of a system will appear to be declining unless it is rigorously maintained."

"As a system evolves, its complexity increases unless work is done to maintain or reduce it."

This can apply to technical implementations and the operation-support model. Engineers need to maintain and enhance the operational aspects of the system to keep it functioning as required.

2. Avoid out-of-hours calls, by catching more in-hours calls

This is probably the largest category, responsible for getting one of those out-of-hours calls. However, there are a few things you can do to reduce this likelihood.

Releases

Releases are the biggest culprit. Even with the best will in the world, we can’t completely guarantee a problem-free release. Scenarios could lay undiscovered—a slow "leak" of an issue that becomes more noticeable over hours, days, or weeks. Releases are always going to increase the chance of getting called, but we can do some things to reduce that risk.

It’s not a solution to not release at 5 pm. Fear of late-day releases is probably a good indicator that there is low confidence in the release or with the team supporting the release. Of course, there is always a balance of risk:

What is the impact of release?
How quickly could this be rolled back?
How confident am I that this release isn’t going to break things?

Charity Majors blogs about this, likening Friday freezes to "murdering puppies:"

From "Friday Deploy Freezes Are Exactly Like Murdering Puppies" blog post by @mipsytipsy

Deployments should be as quick as possible; don’t underestimate peoples’ attention span. If it takes a long time, people will wander off and make a cup of tea or worse still, head home, before verifying the release. If you are unable to make your deployments super quick, then make sure you get poked to verify the release has gone out successfully. A simple Slack alert hooked into your CI would suffice. Alternatively, you can automate the verification.

Once in production, verification in some manner is critical and does not necessarily need to be sophisticated. This means testing in production, which is not as scary as it may sound. I don’t mean for anyone to throw their code into production and hope for the best. Some options:

Have production details and be able to perform a manual test
Use feature flags with a small number of users exposed to the change initially
Turn off the new feature overnight, while you build confidence in the release
Constantly replay requests to highlight any issues early on

James Governor called this "Progressive Delivery" and Cindy Sridharan also talks of testing in production.

Besides releases, the other thing that can massively increase your chances of a call in the middle of the night—batch jobs.

Batch Jobs

Batch jobs that do all the heavy lifting of a job in the middle of the night mean you get delayed feedback for an issue, at exactly the time you don’t want it.

The next figure illustrates this:

An order system that takes real-time order, but transforms them at 3am

Here we have a simple order system that is getting orders in an event-driven manner. However, it aggregates and transforms all those orders at 3am, before sending a file off to third-party reconciliation.

If we could move much of the heavy computation back into the day, we shift the call back into business hours, shown below:

An improved order system that moves the transformation aspect to real-time

So, if we transformed or aggregated the orders in real-time, the job at 3 am merely has to FTP the file off to a third party.

3. Automate failure recovery wherever possible

Don’t get woken up for something a computer can do for you; computers will do it better anyway.

The best thing to come our way in terms of automation is all the cloud tooling and approaches we now have. Whether you love serverless or containers, both give you a scale of automation that we previously would have to hand roll.

Kubernetes monitors the health checks of your services and restarts on demand; it will also move your services when "compute" becomes unavailable. Serverless will retry requests and hook in seamlessly to your cloud provider’s alerting system.

These platforms have come a long way, but they are still only as good as the applications we write. We need to code with an understanding of how they will be run, and how they can be automatically recovered.

Develop applications that can recover without humans

Ensure that your applications have the following characteristics:

Graceful termination: Terminating your applications should be considered a norm, not an exception. If you lose a node of your Kubernetes cluster, then Kubernetes will need to schedule your application on a different VM.
Transactional: Make sure your application can fail at any point and not cause half-baked data.
Clean restarts: We are guaranteed to lose VMs in the cloud, and we need to withstand this— whether it’s a new function spinning up or a container being moved.
Queue backed: One way you can get your applications to pick up where they left off if suddenly terminated—back them up via a queue
Idempotency: This means that repeated requests will have the same outcome, and is really useful for being able to replay failed events.
Stateless: Stateless microservices are much easier to move around; so, where possible, make them stateless.

Allow for the possibility of complete system failure

There are also techniques for dealing with situations when an outage is greater than one service, or if the scale of the outage is not yet known. One such technique is to have your platform running in more than one region, so if you see issues in one region, then you can failover to another region.

For example, at the Financial Times, we had an ACTIVE/ACTIVE setup, where we had our entire cluster running both in the EU and in the US, and requests were geo-location routed to either cluster, as shown below:

An ACTIVE/ACTIVE multi region full system deployment

We also had an aggregated health check for each cluster, and if any one of them went critical, we would direct all traffic to the other cluster. On recovery, it would automatically fail back to serving from both. If you have an ACTIVE/PASSIVE setup, you might need to script some state recovery/verification into your fail back options.

4. Understand what your customers care about

Understanding what your customers really care about is not as straightforward as you would first think, because if you ask them, they will say "everything!" When you understand your domain, however, you start to understand the critical nature of certain functionality. For example, for the Financial Times, their brand is utterly critical and linked to that is the necessity to be able to break the news, so while working on the content platform we viewed the publication and ability to read content as critical.

Once you understand the critical flows, you need to know when things are broken BEFORE your customer. Don’t let your customers be the ones to inform you of outages or issues. You want to make sure you alert on critical flows without additional noise.

"Only have alerts that you need to action." - Sarah Wells, Director of Operations and Reliability @ the Financial Times

"Alert fatigue" means that if you alert on every error, you become almost blind to those alerts. Only have alerts for things that you will take action on right away, even if that alert is for an error you have no plans to do anything about because you can address those in the future if/when they become important to the system. You have logs you can query in the future for those unalerted issues if you want to fix them. After all, there is no such thing as a perfect system.

Not all applications are equal. Only alert on business-critical apps that affect those flows you’ve previously identified. Some breaks need a call, while others can wait until you have a coffee in the office the following day.

Having alerts in place is great, but you don’t necessarily want to wait until there is a real failure to send an alert, and this is particularly true if you have spiky traffic. A content platform is a really great example of this. The Financial Times is global, but most content is written inside the UK, so in the early hours of Sunday there will be very few publishes, if any. So, this situation presents a "Schrodinger’s platform"—is the platform dead or have there been no publishes?

To ensure we had a steady flow of feedback on our critical flows, we introduced something called synthetic requests, or simply, republishes of content we pumped through the system continuously. This enabled us to alert if something broke with an actual publish being involved.

These synthetic requests gave us some amazing additional benefits—we deployed monitoring and these synthetic requests in lower environments, so we actually ended up using our monitoring as end-to-end tests also.

We instrumented our code with rudimentary tracing, by just embedding a transaction identifier. There is now a plethora of tooling such as Zipkin that can provide much richer tracing information. Ben Sigelman did a great keynote at QCon London 2019 on the importance of tracing: "Restoring Confidence in Microservices: Tracing That’s More Than Traces."

This tracing allowed us to track our publishes through the system and react when they got stuck somewhere along the line. We initially wrote a service that did the monitoring, but it became quite brittle and one needed to understand a lot of business logic dispersed across the system, which seemed like an anti-pattern in the microservices world. So, we moved toward using structured logging to identify monitoring events. We could then feed those logs into Kinesis and run streaming SQL over the results, as shown below:

Analysis of logs using streaming SQL on a kinesis stream

This approach meant we could ask, "Have we got all the expected results within a certain timeframe?"

5. Break things and practice

After working through the previous best practices, it may seem you’ve done everything possible to not get called, but what do you do if all else fails and you do get that 3 am call?

"Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions."

We can test how resilient our system is to failure by pulling bits and pieces out or down. This can be done at the system level, but can also be very useful in testing the human element of operational support.

When should you release the chaos monkeys? Everyone is at a different point in the journey of their system’s maturity, so the answer—it depends. However, if you are doing the fairly common journey of migration from a monolith to microservices, then when do you want to start experimenting? You obviously can’t do this when you still have your monolith, because you would take down the whole system. However, I don’t think you have to wait until you are at the very end of your journey to do this style of experimentation. Equally, you don’t need to implement anything automated or elaborate to get value from such experimentation. You can manually take services or VMs down and see how the system reacts.

There are other ways to practice supporting the system, too. On the content platform at the Financial Times, there was a single point of failure at the entry for publishes, as shown below:

A simplified view of the Content Platform at the Financial Times

This single point of failure was definitely something we wanted to fix, but the impact was quite wide-reaching, so we hadn’t managed to allocate some effort to it. It was also something that didn’t cause as much pain as you would expect. In fact, it actually gave us some benefit—when we wanted to release an update to it, we would have to fail over to the other region while we released.

So, in effect, we were practicing failover quite regularly. If you rely on a mechanism for urgent support processes, make sure you practice it regularly to ensure it continues to work and also to reduce the fear and uncertainty around doing this. Following practices like these below will be your best bet for a good night’s sleep:

Build teams that understand what being on call means, and as they own the system, they also own all the aspects of supporting it.
Ensure that your development process adheres to best practices to reduce the risk of causing calls from standard daytime activities.
Automation of failure recovery is becoming increasingly straightforward to achieve, so utilize it as much as possible.
Deeply understanding your customers' needs means you can react to failure more quickly.
Don’t wait until a catastrophic failure to test your responses—put processes and approaches in place you can use to test every aspect of your support.

Regular practice increases confidence in supporting the system, which in turn increases the confidence to know what to do if you’re called at 3 am.

About the Author

Nicky Wrightson is currently a principal engineer working on the data platform at Skyscanner, the world's travel search engine. Prior to this Nicky worked as a principal engineer at the Financial Times, a media organisation - here she led the roll out a new content and metadata platform with a revamped operational model alongside it. She has a passion for driving change in more than just the creation of new cloud native architectures but also the necessary cultural evolution that is necessary.

InfoQ Software Architects' Newsletter

An Engineer’s Guide to a Good Night’s Sleep

Follow us on

Key Takeaways

Complexity is the Cost of Flexibility

Related Sponsors

1. Enable the Engineer

Rethinking Out-of-hours Support

Handling Errors

2. Avoid out-of-hours calls, by catching more in-hours calls

Releases

Batch Jobs

3. Automate failure recovery wherever possible

Develop applications that can recover without humans

Allow for the possibility of complete system failure

4. Understand what your customers care about

5. Break things and practice

About the Author

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter