Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Failover Conf Q&A on Building Reliable Systems: People, Process, and Practice

Failover Conf Q&A on Building Reliable Systems: People, Process, and Practice

Key Takeaways

  • One of the biggest challenges associated with maintaining or increasing the reliability of a system is knowing where to invest time and energy within an organisation.
  • Continuous delivery automation can provide a foundation to start building confidence around reliability. If engineers can track key deployment metrics, they are in an excellent position to be able to analyze and set service level objectives (SLOs) around systems and processes. 
  • When testing, engineers must understand the system’s architecture, its dependencies, and the expected behaviors of each. Engineers should take on a “fail-fast” mentality, and not be wasteful with resources.
  • Shipping small features with good observability (monitoring etc) and using incremental release strategies, e.g. canary releasing and feature flagging, can help maintain reliability and limit the blast radius of issues.
  • Access to operational data is critical for systems of record workloads. Increased reliability can be implemented via load balancers, active-passive systems, and backup/restore mechanisms. This can also be implemented using an active-active approach.
  • In order to effectively deal with failover events, focus should be placed on the practice of handling failure (via "game days" or "fire drills"), building trust, and establishing routine communications and transparency among responsible parties/stakeholders.
  • Incident retrospectives and post mortems are an opportunity both to improve the systems that failed and your response process.

One of the biggest engineering challenges associated with maintaining or increasing the reliability of a system is knowing where to invest time and energy. InfoQ recently sat down with several engineers and technical leaders that are involved with the upcoming Failover Conf virtual event, and asked their opinion on the best practices for building and running reliable systems.

The COVID-19 pandemic has caused negative impact to many lives and economies. It has also disrupted the running of many in-person events. The Gremlin team has assembled Failover Conf with the goal of allowing people across the globe to come together virtually and share experiences and ideas related to the topic of reliability. 

This year we expected the opportunity to gather in-person to share our knowledge and experiences building production systems with one another. Then the unexpected happened, forcing many events to cancel or postpone. But we’re resilient. When one opportunity falls through, we "failover" to another.

The following virtual Q&A include a discussion of the relationship between reliability and a variety of topics, such as continuous integration, testing, releasing features, data storage, and organisational process and human interactions.

Responses have been lightly edited for clarity and brevity.

InfoQ: What is the biggest challenge associated with the topic of reliability within IT?

Angel Rivera: The biggest challenge associated with reliability in IT is building and maintaining system resiliency. Teams are under immense pressure to ensure that uptime is adequately met, especially those in automation, communication, entertainment, and emergency response services. The systems that provide these services must be designed to recover from unexpected failures with minimal impact on performance and consistency.

Tiffany Jachja: One of the most significant challenges around reliability within IT today is gaining reliability at scale. Over the past few years, cloud providers have focused on enabling enterprise-scale and delivery. So today, it’s not about if we can do something a certain way but more about how we do it in a repeatable and sustainable fashion. It’s not “can we deploy a serverless function?”, or “can we deploy a service onto Kubernetes?”, it’s more how we can ensure that within complex landscapes we’ve built the confidence within our IT organizations.

Heidi Waterhouse: The hardest part of reliability in IT is that it's a moving target. If we never changed anything in our systems, reliability wouldn't be very hard, but we have to balance the need for progress with the need for stability.

Jim Walker: Ultimately, it really comes down to what is your weakest link. Reliability is much like security in this way but also very different. With reliability, I've always felt the simpler the better. Reducing the number of things that will fail eliminates some risk.  With security, you typically add layers to make sure nothing gets through. For Cockroach Labs, it is all about building redundancy for your database so you can get resilience without having to add layers of redundancy to mitigate risk. We work to simplify this challenge for your application architectures.

Laura Hofmann: The biggest challenge associated with the topic of reliability is knowing where to invest your time and energies. We’re never ‘done’ making a system reliable, so how do we know what components are most critical? Where will we get the highest ROI? Furthermore, how do we decide that a system is reliable enough? To answer that last question, set recovery time and recovery point objectives (RTOs and RPOs) and let yourself be guided by them. Based on those metrics, decide where you should be investing your time.

To decide where to start improving the overall reliability of your system, you need to know how all of the components interact, and identify the most critical components and bottlenecks. You can spend all of your time making a database reliable, but that won’t matter if it sits behind a heavily used but unreliable caching layer. Dependency graphs are great for visualising how the components of your service fit together and will allow you to identify the places where you will reap the biggest reliability rewards. The challenge here is that dependency graphs get stale ridiculously quickly unless they are automated.

Dave Nielsen: These are all great answers. In addition, I’ll mention that data is naturally stateful, and constantly changing.  When operating databases at high scale, backups become out of date the moment you make them. Keeping a live copy of your data on another server may be your best option. Some call this “active-active”.

InfoQ: How can continuous integration (CI) and continuous delivery (CD) impact or help reliability?

Rivera: CI/CD enables teams to build and test code with every iteration and provides immediate feedback on code changes. This feedback provides precise details on failures in which the developer can use to quickly resolve issues. Developers can also build new tests for previously unknown bugs that can negatively impact system resiliency in production.

Jachja: CI/CD automation gives you a foundation to start building that confidence around reliability. If you can track key deployment metrics (deployment frequency, change failure, lead time to change and mean time to restore), you’re in an excellent position to be able to analyze and set service level objectives (SLOs) around your systems and processes. 

A poor change failure rate and infrequent deployment frequency could mean that you don’t have enough quality gate or unit tests, or maybe that your deployments are too big. These metrics level the playing field. Anyone with a software delivery stake can look at these metrics and say: “Hey, in this iteration, we need to improve our time to production by X amount and ensure we have enough code coverage and tests to validate our systems. Let’s have 20% of our work this sprint be around, ensuring this happens.” 

We know that delivery teams are cross-functional; they have to be, CI/CD makes it possible for anyone to understand the results of the processes and call out our next iteration of improvements. 

InfoQ: What’s the most effective way to codify quality tests within a delivery pipeline?

Rivera: First I believe you have to understand the system’s architecture, its dependencies, and the expected behaviors of each. Once those are captured, teams could implement testing strategies such as smoke testing, load testing, fuzz testing, security/compliance tests etc. which can be automatedly executed within the prefered CI/CD tooling.

Jachja: Realize that there are all kinds of quality tests that require different types of setup and cost to run. Integration tests and end-to-end testing can be costly in terms of configuration, maintenance, resources, and creation. If you can test a component of your system with a unit test, do so. This minimizes your lead time, and you can catch these issues earlier on within your pipeline. You don’t want to deploy an application into the QA environment, run a series of tests, fail, then get the results a few hours later and realize a unit test could have caught that locally on your dev machine. 

Take on a “fail-fast” mentality, and don’t be wasteful with resources. If you can track metrics around your test suite, do so. Static code analysis can help your development teams avoid code quality regressions, and it’s fairly low effort to implement. You’ll probably notice if there are duplicate tests across your different types of tests when you start to consider tests as an ecosystem.

InfoQ: In modern software systems, how does release differ from deploy?

Waterhouse: Deployment is the act of getting code to its destination. It's a technical process and belongs to the technical team. Release is a business-value decision and should be decoupled from deployment. 

A team could deploy hundreds of times before the cumulative value of the changes is enough for the product/business team to trigger a release.

Hofmann: Deployment is about getting code out safely. Releasing is about getting new functionality in front of customers. The difference lies between pushing out a new code path, and actually activating it. 

If all code paths you deploy are active from the moment you start serving traffic, then deploying and releasing happen as part of the same process. And, depending on what your needs for a given service are, doing that might be an a-okay way of releasing! Is it in production? Do you care if the service breaks? How quickly can you fix an issue or shift traffic back to an older version? 

Deploy and release can happen at the same time, but they don’t have to. For the times that you do care, feature flags are an essential tool for managing what you want to release, when, and to whom. 

At scale, you could feasibly put every change you make behind its own feature flag and deploy constantly without your customers ever knowing. Doing that is liberating because it allows you to treat your deployment and release strategies as entirely independent problems. Better yet, it allows someone else, say, the product team, to have full agency and control over when to release each of those changes without requiring input from engineering.

InfoQ: What strategies can engineers use to mitigate the risk of release failure?

Waterhouse: Engineers can reduce release risk for releases by paying attention to both risk reduction and harm mitigation. Risk reduction includes testing in production, and engineering for each change to be individually controlled. Harm mitigation includes strategies like canary deployments, limiting blast radius, and enabling kill switches and rapid rollbacks.

Hofmann: If you manage your releases separately from your deployments, release failure is a good thing! There’s no better way to learn and iterate quickly as long as you have: the observability of when a feature or component is working as expected (including whether or not your customers are converting), and the tools to fail back quickly. Again, feature flags, canaries and gradual rollouts are all part of your release management arsenal. Decide what your threshold of risk is and let that determine your release strategy.

Feature management is risk management, and the level of risk will vary depending on the change you are releasing. Are there times when you have greater traffic volume? If you’re an e-commerce company, maybe black friday isn’t the best time to roll out a change to your payment page. What audiences do you start with? Do you have an identified group of beta users? Are you particularly interested in certain platforms, browsers, geographies?

Once you’ve established that, how quickly do you want to roll out the change in question. Is it the kind of thing where you would know immediately if something is broken, or do you want to let it sit at a fractional percentage of your traffic volume for a while and just see what happens? Most importantly, if something goes sideways with a well instrumented feature release, roll it back down to zero, ship a fix and roll it out again.

Small features behind flags, good monitoring and observability tooling, and fast deployment cycles will all help you become more comfortable with the risk of release failure. When a release fails, it will be one feature and not the whole application. You’ll know about the release failure and have the ability to roll it back and fix it quickly.

InfoQ: What role do data stores play in disaster recovery and business continuity (DR/BC) planning?

Walker: Data is everything, especially for online transactional workloads. All the mechanisms put in place are really there to make sure you have access to data, so I would say that they are often the central focus of DR/BC. Access to operational data is critical for systems of record (SoR) workloads and often we will see architects design for this using load balancers, active-passive systems, and backup/restore mechanisms. But all of this is super complex and incredibly expensive to both support and implement. 

Over the past few years, there has been a movement to incorporate some of these capabilities into the database itself. Hadoop and HDFS in particular helped orgs learn that storing replicas of data could help with the availability and survivability of failures. These concepts are now baked into databases like CockroachDB, which not only protect against many failure zones but also ensure global access to data no matter where it is being asked for. 

Nielsen: DR can be implemented using Active-Active, but it is difficult to implement on your own. That is why database vendors need to provide it as part of their product or service, out of the box.

InfoQ: How can engineers build or utilise reliable data store systems? How do they make the storage of data more resilient?

Walker: Data and storage are related but also quite different. Often, we will see organizations implement distributed, redundant storage as well as distributed, redundant databases on top of this. This increases the replication rate of data and further protects from any loss of data.  These layers are fairly straightforward. However, the trick is managing recovery points and times. 

Everything fails and no system is completely safe from this fundamental truth. So, in the event of a failure of a data store or storage layer, we have to think through how long a system will be offline, and how to remedy any loss or any out of sync systems.  

We believe the database should be active-active and the active-passive architectures are antiquated and risky. Having an always active database can get these recovery objectives to near zero and save you from an application or service being unavailable and even worse reduce the risk of a manual recovery and remediation of data conflicts in a backup passive system.

Nielsen: There are many ways to implement reliable data storage systems. One way is to use built-in redundancy provided by the database and service providers. Another is testing, testing, testing...

Automated testing is an important part of reliability. Introducing random and catastrophic failure as part of your automatic testing is also important because failure is often unpredictable. Chaos engineering, a term coined by Netflix, is a popular technique to test resilience, especially in production.

InfoQ: What is the best way to get people and technology working together effectively during a failover event?

Rivera: Ultimately establishing routine communications and transparency among responsible parties/stakeholders, along with solid DevOps cultures are critical in effectively dealing with failover events. Establishing these human related elements is hugely beneficial. It can enable everyone involved to leverage technology and automation to capture valuable telemetry, which can surface unexpected behaviors and assist in efficiently identifying root causes and potential solutions for these incidents. 

Jachja: I think it’s important to remember the human aspects of chaos. The causes of failover could be entirely out of our control. So it’s essential to rely on your systems and your backup plans. And also realize that if you don’t have the right systems or plans in place, that it’s all the more reason to find out how your people and technologies can start to work together effectively. Tooling and team structure do play a huge role in how we manage chaos, so a big part of that collaborative effectiveness is around how we’ve built a foundation for supporting minor and major events. 

Waterhouse: The best way to get people and technology aligned during a failover incident is to practice. Tools need to serve human needs all the time, but especially when those humans are stressed out and under time pressure. Practicing allows teams to see where they are likely to have problems in advance, and then to build guardrails so the strength of the tool does not become a liability.

Walker: Communications and information is always the key to surviving failure and this not only depends on how well systems repair themselves, but also on how well they deliver warnings and notifications to the people involved and responsible for uptime. We work closely with customers to develop a "runbook" for every implementation, as this has become a useful best practice to help organizations gain insight into these types of events. There are also some very cool commercial solutions that focus on this problem, like PagerDuty, which we see all over the place.

Nielsen: Trust is a key part of preparing for a failover event. One way to develop trust is through fire-drills where each person on a team can see how their team mate contributes to the whole effort. It also helps teams learn about each person's interpersonal style, which can help improve communication during a failover event. Of course, the other benefit of fire-drills is they help you perform better under stress, and perhaps even identify a way to avoid the failover to begin with!

Hofmann: First and foremost, have a plan for when things go wrong, and document and practice that plan. All engineers and stakeholders outside of engineering should be familiar with the process before they ever have to put it into action. Equally as important as having and practicing the plan is to continually review and update it. Is it still serving you, and could it be serving you better? Incident retrospectives (postmortems, whatever your org calls them) are an opportunity both to improve the systems that failed and your response process.

If you’re looking for some inspiration in the form of incident response rabbit holes during this period of working from home, I really enjoy reading about emergency and incident response in fields outside of engineering and tech. There’s a ton of transferable lessons in how groups like FEMA and NASA prepare for and respond to failures. If you’re looking for a book recommendation, check out Chris Hadfield’s book ‘An Astronaut’s Guide to Life on Earth’ to learn just how much space travel can teach you about site reliability engineering.

About the Authors

Angel Rivera started his career as an US Air Force Space Systems Operations specialist in Cape Canaveral AF Station where he realized his passion for technology and software development. He has extensive experience in the private, public and military sectors and his technical experience includes military/space lift operations, technical writing, software development, SRE/DevOPs engineering. He also has a wealth of experience in defense and federal sectors such as contracting, information systems security and management.

Tiffany Jachja is a technical evangelist at Harness. She is an advocate for better software delivery, sharing applicable practices, stories, and content around modern technologies. Before joining Harness, Tiffany was a consultant with Red Hat's Consulting practice. There she used her experience to help customers build their software applications living in the cloud.

Heidi Waterhouse is a developer advocate with LaunchDarkly. She delights in working at the intersection of usability, risk reduction, and cutting-edge technology. One of her favorite hobbies is talking to developers about things they already knew but had never thought of that way before. She sews all her conference dresses so that she's sure there is a pocket for the mic.

Jim Walker is a recovering developer turned product marketer and has spent his career in emerging tech. He believes product marketing is a strategic go-to-market function for growth companies and helps organizations translate complex concepts into a compelling and effective core narrative and market strategy. The list of startups he has had full-time roles include ServGate, Vontu (purchased by Symantec), Initiate Systems (purchased by IBM), Talend (Nasdaq:TLND), Hortonworks (Nasdaq:HDP-CLDR), EverString, CoreOS (purchased by RedHat/IBM) and his current gig, Cockroach Labs.

Dave Nielsen, as Head of Community & Ecosystem Programs, engages with emerging technologies and open source projects like microservices, serverless and Kubernetes to bring the magic of Redis to the broader developer community. Dave has extensive ecosystem experience from his years in the early web, cloud and big data communities. Prior to Redis Labs, Dave led the relationship between Intel’s Deep Learning in Apache Spark project and public cloud providers; and at PayPal where he helped pioneer web api developer evangelism. Dave is also co-founder of CloudCamp and several Silicon Valley user groups.

Laura Hofmann is a Senior Site Reliability Engineer at Optimizely. In her 3.5 years at Optimizely, she has seen the company grow from an A/B testing tool to a full-fledged experimentation and progressive delivery platform. Her recent focuses include incident response as well as driving towards progressive delivery and trunk-based development in monolithic applications.

Rate this Article