InfoQ Homepage Articles Can We Trust the Cloud Not to Fail?

Can We Trust the Cloud Not to Fail?

Bookmarks

May 11, 2021 27 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

It is crucial for technical decision-makers to understand the specifics of what’s at the core of systems and how they work to provide the promised guarantees.
Failure detectors are one of the essential techniques to discover failures in a distributed system. There are different types of failure detectors offering different guarantees, depending on their completeness and accuracy.
In practical real-world systems, a lot of failure detectors are implemented using heartbeats and timeouts.
To achieve at least weak accuracy, the timeout value should be chosen so that a node doesn’t receive false suspicions. It can be done by changing the timeout and increasing its value after each false suspicion.
Service Fabric is one of the examples of systems implementing failure detection - it uses a lease mechanism, similar to heartbeats. Azure Cosmos DB relies on Service Fabric.

In short, we can’t trust the Cloud to never fail. There are always some underlying components that will fail, restart, or go offline. On the other hand, will it matter if something goes wrong, and all the workloads are still running successfully? Very likely we’ll be okay with it.

We are used to talking about reliability at a high level, in terms of certain uptime to provide some guarantees for availability or fault tolerance. This is often enough for most decision makers when choosing a technology. But how does the cloud actually provide this availability? I will explore this question in detail in this article.

Let’s get to the very core of it. What causes unavailability? Failures - machine failures, software failures, network failures - the reality of distributed systems. Them, and our inability to handle them. So how do we detect and handle failures? Decades of research and experiments shaped the way we approach them in modern cloud systems.

I will start with the theory behind failure detection, and then review a couple of real-world examples of how the mechanism works in a real cloud - on Azure. Even though this article includes real-world applications of failure detection within Azure, the same notions could also apply to GCP, AWS, or any other distributed system.

Why should you care?

This is interesting, but why should you care? Customers don’t care how exactly things are implemented under the hood, all they want is for their systems to be up and running. The industry is indeed moving towards creating abstractions and making it much easier for the engineers to work with technologies, ultimately focusing on what needs to be done to solve business problems. As Corey Quinn wrote in his recent article:

"I care about price, and I care about performance. But if you’re checking both of those boxes, then I don’t particularly care whether you’ve replaced the processors themselves with very fast elves with backgrounds in accounting."

This is true for the absolute majority of the end users.

For technical engineering leaders and decision makers, it can be crucial to understand the specifics of what’s at the core of the system and how it works to provide the promised guarantees. Transparency around the internals can provide a better insight into further development of such systems and their future perspective, valuable for better long-term decisions and alignment. I gave a keynote talk at O’Reilly Velocity about why this is true, if you are curious to learn more, or you can read a summary here.

Theoretical Tale of Failure Detectors

Unreliable Failure Detectors For Reliable Distributed Systems

The paper by Chandra and Toueg has been groundbreaking for distributed systems research and is a useful source of information on the topic, which I highly recommend for reading.

Failure detectors

Failure detectors are one of the essential techniques to discover node crashes or failures in a cluster in a distributed system. It helps processes in a distributed system to change their action plan when they face such failures.

For example, failure detectors can help a coordinator node to avoid the unsuccessful attempt of requesting data from a node that crashed. With failure detectors, each node can know if any other nodes in the cluster crashed. Having this information, each node has the power to decide what to do in case of the detected failures of other nodes. For example, instead of contacting the main data node, the coordinator node could decide to contact one of the healthy secondary replicas of the main data node and provide the response to the client.

Failure detectors don’t guarantee the successful execution of client requests. They help nodes in the system to be aware of known crashes of other nodes and avoid continuing the path of failure. Failure detectors collect and provide information about node failures. It’s up to the distributed system logic to decide how to use it. If the data is stored redundantly across several nodes, the coordinator can choose to contact alternative nodes to execute the request. In other cases, there might be failures that could affect enough replicas, then the client request isn’t guaranteed to succeed.

Applications of failure detectors

Many distributed algorithms rely on failure detectors. Even though failure detectors can be implemented as an independent component and used for things like reliable request routing, failure detectors are widely used internally for solving agreement problems, consensus, leader election, atomic broadcast, group membership, and other distributed algorithms.

Failure detectors are important for consensus and can be applied to improve reliability and help distinguish between nodes that have delays in their responses and those that crashed. Consensus algorithms can benefit from using failure detectors that estimate which nodes have crashed, even when the estimate isn’t a hundred percent correct.

Failure detectors can improve atomic broadcast, the algorithm that makes sure messages are processed in the same order by every node in a distributed system. They can also be used to implement group membership algorithms, detectors, and in k-set agreement in asynchronous dynamic networks.

Failure Suspicions

Because of the differences in the nature of environments our systems run in, there are several types of failure detectors we can use. In a synchronous system, a node can always determine if another node is up or down because there’s no nondeterministic delay in message processing. In an asynchronous system, we can’t make an immediate conclusion that a certain node is down if we didn’t hear from it. What we can do is start suspecting it’s down. This gives the suspected failed node a chance to prove that it’s up and didn’t actually fail, just taking a bit longer than usual to respond. After we gave the suspected node enough chances to reappear, we can start permanently suspecting it, making the conclusion that the target node is down.

Diagram 1: Suspecting Failures

Properties of a Failure Detector

Can we discover all the failures? How precise can we be in our failure suspicions? There are different types of failure detectors offering different guarantees.

Completeness is a property of a failure detector that helps measure whether some or all of the correctly-functioning nodes discovered each of the failed nodes. What are possible variations of completeness when working with failure detectors? How do we measure whether a failure detector has a strong or weak completeness property?

When every correctly-operating node eventually discovers every single failed node, it indicates that the failure detector has strong completeness. Eventually, because a node can first suspect another node to have possibly failed for a while, before marking it as permanently suspected.

Strong completeness property means that every node will eventually permanently suspect all the failed nodes. Weak completeness means that some nodes will eventually permanently suspect all the failed nodes.

Diagram 2: Strong Completeness

a) All nodes are healthy.
b) Nodes 2 and 3 crashed.
c) The rest of the nodes eventually notice the all crashed nodes.

Diagram 3: Weak Completeness

a) All nodes are healthy.
b) Nodes 2 and 3 crashed.
c) Node 4 eventually discovers all the crashes.

Note that it’s not enough for a failure detector to be strongly complete to be effective. By definition, a failure detector that immediately suspects every node will have strong completeness, but being oversuspicious doesn’t make it useful in practice. We need to have a way to measure whether the suspicion is true or false. Can a node start suspecting another node that hasn’t crashed yet?

Accuracy is another important property of a failure detector. It shows how precise our suspicions are. In some situations, we might be wrong by incorrectly suspecting that a node is down when, in reality, it is up. Accuracy property defines how big a mistake from a failure detector is acceptable in suspecting the failure of another node.

With strong accuracy, no node should ever be suspected in failure by anyone before its actual crash. Weak accuracy means that some nodes are never incorrectly suspected by anyone. In reality, it is hard to guarantee both strong and weak accuracy. They both don’t allow for a failure detector to ever be mistaken in its suspicion. Therefore, weak accuracy might sound misleading. For example, there could be a scenario where a node is suspecting another node in failure but stopped suspecting it after a very short while. This scenario is pretty common in reality, however, it doesn’t satisfy even weak accuracy. Even with weak accuracy, an incorrect suspicion isn’t allowed.

Strong and weak accuracy properties are sometimes called perpetual accuracy. Failure detectors with perpetual accuracy must be precise in their suspicion from the very first attempt.

Diagram 4: Eventual, Not Perpetual Accuracy

Because the failure detector starts suspecting the failure of a node that in reality is up, and later changes its suspicion - this scenario shows eventual accuracy. Failure detectors with perpetual accuracy won’t allow this scenario.

To allow a failure detector to change its mind and forgive temporarily wrong suspicions, there are two additional kinds of accuracy - eventually strong and eventually weak. In distinction to perpetual accuracy, failure detectors with eventual accuracy are allowed to initially be mistaken in their suspicions. But eventually, after some period of confusion, they should start behaving like failure detectors with strong or weak accuracy properties.

Eventually strong accuracy means a healthy node should be never suspected by anyone, after a possible period of uncertainty. Eventually weak accuracy means that some healthy nodes should be never suspected by anyone, after a possible period of uncertainty.

Diagram 5: Eventually Strong Accuracy

a. All nodes are healthy.
b. 10 seconds later: nodes 1 and 5 are down and are suspected by the rest of the nodes.
c. 10 more seconds later: nodes 1 and 5 are still down and are suspected by the rest of the nodes.
d. 10 more seconds later: nodes 1 and 5 are still down and are finally permanently suspected by the rest of the nodes.

Diagram 6: Eventually Weak Accuracy

a. All nodes are healthy.
b. 10 seconds later: nodes 1 and 5 are down and are suspected by some of the nodes.
c. 10 more seconds later: nodes 1 and 5 are still down and are suspected by some of the nodes.
d. 10 more seconds later: nodes 1 and 5 are still down and are finally permanently suspected by some of the nodes.

Types Of Failure Detectors

It is important to understand the concepts of completeness and accuracy, as they are the foundation for comprehending classes of failure detectors. Each of the failure detector types embodies a combination of specific completeness and accuracy properties, resulting in eight types of failure detectors by T.D. Chandra and S. Toueg classification.

Perfect Failure Detector: Strong Completeness, Strong Accuracy
Eventually Perfect Failure Detector: Strong Completeness, Eventual Strong Accuracy
Strong Failure Detector: Strong Completeness, Weak Accuracy
Eventually Strong Failure Detector: Strong Completeness, Eventual Weak Accuracy
Weak Failure Detector: Weak Completeness, Weak Accuracy
Eventually Weak Failure Detector: Weak Completeness, Eventual Weak Accuracy
Quasi-Perfect Failure Detector: Weak Completeness, Strong Accuracy
Eventually Quasi-Perfect Failure Detector: Weak Completeness, Eventual Strong Accuracy

Failure Detectors In Asynchronous Environment

Each of the failure detector types can be useful in different environments. In a fully asynchronous system, we can’t make any timing assumptions. There is no time bound on when a message should arrive. In reality, we are often dealing with something in between an asynchronous and a synchronous system. We might not know how long it’s going to take for a message to arrive, but we know it should eventually be delivered. We just don’t know when.

Some distributed system problems are proved to be impossible to solve in a fully asynchronous environment in the presence of failures. For example, the impossibility results (FLP) - showed that it’s impossible to solve consensus or atomic broadcast in an asynchronous system with even one crash failure. In such an environment, we can’t reliably tell if a node is down, or just takes longer to respond.

To provide an option to overcome the impossibility results, T.D. Chandra and S. Toueg showed that it is indeed possible to solve consensus in an asynchronous system by adding a failure detection mechanism. T.D. Chandra and S. Toueg introduced the types of failure detectors, showing that some of the types can be transformed into others. Not only did T.D. Chandra and S. Toueg create a classification model, but they also described failure detector applications for solving consensus and showed which failure detectors are minimally sufficient for solving consensus problems. T.D. Chandra and S. Toueg proved that Eventually Weak Failure Detector is the minimally sufficient type that can be used to solve the problem of consensus in an asynchronous system with the majority of correct nodes, allowing for the number of failures f = N/2 - 1. They also proved that Weak Failure Detector is the minimally sufficient type that can be used for solving consensus in such a system with any number of failures f = N - 1.

From Theory To Practice

What does this mean for the distributed systems we are building in reality? In practical real-world systems, a lot of failure detectors are implemented using heartbeats and timeouts. For example, a node can be regularly sending a message to everyone sharing that it’s alive. If another node doesn’t receive a heartbeat message from its neighbor for longer than a specified timeout, it would add that node to its suspected list. In case the suspected node shows up and sends a heartbeat again, it will be removed from the suspected list.

The completeness and accuracy properties of this timeout based failure detector depend on the chosen timeout value. If the timeout is too low, nodes that are sometimes responding slower than the timeout will alternate between being added and removed from the suspected list infinitely many times. Then, the completeness property of such a failure detector is weak, and the accuracy property is even weaker than eventual weak accuracy.

To achieve at least weak accuracy we need to find the timeout value after which a node will not receive false suspicions. We can do this by changing the timeout and increasing its value after each false suspicion. With this, there should be a timeout big enough for nodes to deliver their heartbeats in the future. In practice, this puts the described failure detector with an increasing timeout algorithm in the range of Eventually Weak Failure Detectors.

Reducing, or transforming a failure detector algorithm with one set of completeness and accuracy properties into a failure detector algorithm with another set of such properties means finding a reduction or transformation algorithm that can complement the original failure detection algorithm and guarantee that it will behave the same way as the target failure detection algorithm in the same environment given the same failure patterns. This concept is formally called reducibility of failure detectors.

Because in reality it can be difficult to implement strongly complete failure detectors in asynchronous systems, per T.D. Chandra and S. Toueg we can transform failure detectors with weak completeness class into failure detectors with strong completeness. This concept is formally called reducibility of failure detectors. We can say that the original failure detector algorithm based on timeouts (described earlier) was reduced or transformed into an Eventually Weak Failure Detector by using increasing timeouts. As T.D. Chandra and S. Toueg showed that it is also possible to transform failure detectors with weak completeness into failure detectors with strong completeness.

Detecting Failures in the Wild (Cloud)

How does one actually implement and apply the concept of failure detection in distributed systems others systems and components rely on? This is something that generally stays behind the scenes, but is a powerful foundation for distributed systems to operate reliably and to stay available for the end users. Let’s dive into one of the biggest distributed systems of the world, Azure cloud, as an example. Internally, Azure ultimately relies on a system called Service Fabric. Like Kubernetes, Service Fabric isn’t limited to only running on Azure, and can be deployed anywhere.

Service Fabric

Service Fabric is a distributed systems platform developed at Microsoft for building highly available, scalable, and performant microservice applications in the cloud, as the ACM publication defines it. It is used for services like Azure SQL, Azure CosmosDB, Azure EventHubs, ServiceBus, EventGrid, etc., as well as the core control plane of Azure.

According to Matthew Snider, who has been working on Service Fabric as its first technical Program Manager for more than a decade:

"when people ask what is the core replication or consensus algorithm - when in the raft paper it’s mentioned that a certain optimization is left out - Service Fabric has it. It’s a fighter jet that you don’t need to take to go to the grocery store".

Service Fabric is a system that serves a specific purpose for critical workloads with high availability and data safety criteria for building other complex systems. Clemens Vasters, Principal Architect and Strategist behind Azure Messaging, wrote that Service Fabric is an advanced system and an integral backbone behind many Azure cloud services:

Service Fabric defines the Federation Subsystem based on Distributed Hash Table (DHT), responsible for things like failure detection, routing, etc. Service Fabric Reliability Subsystem is working on top of the Federation Subsystem to provide leader election, replication, availability, orchestration, and reliability.

Diagram 7: Service Fabric

Credit: "Service fabric: a distributed platform for building microservices in the cloud"

In Service Fabric, nodes are organized to form a consistent scalable cluster of machines - a virtual ring. A ring is bootstrapped by seed nodes, distributed across the hardware fault boundaries. When joining a cluster, a new node contacts the seed nodes, and counts how many acknowledgements it receives in return. If the number is more than or equal to the majority of the seed nodes, a new node joins the neighborhood. There isn’t a strong connection between every single node. There is a Distributed Hash Table (DHT) on top of the ring, giving nodes identities, for addressing, routing, and keyspace assignment. Each node keeps track of multiple of its immediate successor nodes and predecessor nodes in the ring – called the neighborhood. Neighbors are used to run Service Fabric’s membership and failure detection protocol.

The failure detection protocol uses a lease mechanism, similar to heartbeats. Service Fabric uses indirect leases, not just relying on direct neighbors for the leases, but using a few extra connections between a given node and other nodes in the neighborhood set. Leases fail on the machine level in Service Fabric. Arbitration process happens at the seed nodes. They are responsible for deciding what to do with nodes that are suspected to have failed by other nodes. If a node gets an acknowledgement back from the majority of the arbitration group nodes, it can stay in the cluster. There is a situation possible, where two nodes get a majority and stay. If a node loses arbitration, it just leaves.

When Service Fabric cluster encounters partitions, leases between many nodes get lost. In this case, the side that has the most seed notes will stay up, and the other side of the partition will go down. The nodes will agree without a consensus protocol based on the arbitration mechanism. In Service Fabric, there is a separation between the membership, communication, and failure detection from leader election and replication. This way, membership and failure detection can be fully distributed without central points of failure and consensus requirements.

In the Reliability Subsystem, Service Fabric maintains multiple organizational components, for example, failover manager and reconfiguration agents. A configuration is a definition of what reliable representation for instances of a service looks like. Reconfiguration is a process of moving from one configuration to another. The Failover Manager’s job is to initiate and drive reconfigurations and work with individual reconfiguration agents on nodes to ensure services stay in their desired state.

If a process crashes - it’s identifiable that it crashed. Things get hard when the process experiences thread exhaustion, heap contention, when something else changes in the flow of the application, or when malformed messages are returned, process freezes, halts, etc. How do we know if the process is just slow, or if it is dead (Byzantine Generals Problem)?

In Service Fabric, there isn’t a lease on individual processes, just a lease with the machines. Service Fabric provides an API to services to indicate that they are dead. A thread that is determining whether the process is alive or dead might itself be broken or is stuck. The core failure detection services in Service Fabric are kernel drivers that won’t get preempted. If there is a problem with a service that crashes a Node A, we don’t always want to automatically transfer everything to the Node B, because it might cause the same error and just move the problem instead of solving it. In case of failure detectors, a node can be confidently marked as crashed if the machine itself is unresponsive. But with the services, it’s more important to understand the underlying reasoning of the error to understand what the right action is. If health errors are encountered while the upgrade is in progress, then the upgrade will roll back. Service Fabric also supports health probes, where we can get an instruction to execute every few seconds, and if there isn’t a response, the service will be marked as yellow or red, and then restarted. System itself says "as far as we know the machines are here, as far as we know the core processes are here, heartbeats are successful, you’ve got some problem - figure it out." Health probes have a way to identify the status, but Service Fabric won’t automatically act unless the engineer tells us to.

Service Fabric can run outside of a cloud infrastructure, outside of Azure. However when running in the cloud, when critical cloud services like Cosmos DB or Event Hubs rely on Service Fabric behind the scenes, what happens if Azure infrastructure decides to deallocate a VM? Azure generates repair notifications for infrastructure changes. Service Fabric has an ability to push back on Azure changes. It can stall the repair operation and gracefully move or shut down the affected nodes in order to make the change safe. In sensitive environments like Cosmos DB and SQL, Service Fabric can push back on Azure infrastructure changes for as long as necessary. It can say "no, you can’t shut down this VM" for every 15 seconds to forever. For patch or upgrade maintenance - Service Fabric can hold it up for a certain time according to different durability tiers per upgrade domain. This way, Azure configuration and infrastructure changes don’t affect critical services that rely on Service Fabric. With other platforms (other clouds or standalone platforms), infrastructure integration can be used to deal with underlying changes. Service Fabric Repair Manager is where we can say - I’d like to format or reboot the machine, take infrastructure signals and watch for them, then call the Repair Manager saying that the machine will be rebooted. Service Fabric will handle the rest.

Numerous services on Azure rely on Service Fabric for membership, failure detection, consensus, and more. Azure Cosmos DB is one of them.

Azure Cosmos DB

Azure Cosmos DB is a distributed database. Cosmos DB has a notion of containers - the fundamental units of scalability. Containers consist of partitions, which are composed of replicas sets - one replica set per region per partition. Each replica set is composed of individual database replicas. Today, Cosmos DB uses replication factor four with a dynamic quorum. OS and DB software upgrades are rolled out one replica at a time, in which the quorum is downshifted based on three online replicas to preserve fault tolerance.

Diagram 8: Cosmos DB

Credit: "Global data distribution with Azure Cosmos DB - under the hood"

Within a replica set, Cosmos DB heavily leverages Service Fabric for orchestrating replicas - adding and removing replicas, performing leader election, and other operations. Cosmos DB relies on Service Fabric for managing local replica-set configurations. Service Fabric provides a mechanism to detect the failure of the nodes. However, it may not necessarily detect the failure of CosmosDB itself. Any change in local configuration is governed by a strongly consistent central authority. This way there isn’t a risk of a split brain situation where some replicas of the same replica-set think to exist in different configurations. Service Fabric can also be supplied with some health information by Cosmos DB to help it detect failures that it cannot detect out of the box. Service Fabric is the central authority to deal with failure signals and manage configuration in response.

For the regional outages, Cosmos DB implements a different mechanism. Because Service Fabric is confined to one federation in one region, every federation runs an independent copy of the Service Fabric. This prevents Cosmos DB from using it to detect regional outages. Cosmos DB has an automatic failover mechanism, which will detect a regional outage and cause a global reconfiguration.

In Cosmos DB there is a central replica-set, called a hub or write-region. It replicates data to the rest of the regions. Depending on consistency level, the global replication mechanism may act a bit differently, but the overall goal is to deliver the data to all regions. In strong consistency mode, this requires synchronous delivery, so all regions must acknowledge receiving and replicating data within their replica-sets. With other consistency modes, it allows a bit more slack and there is a chance for some regions to trail behind a bit. Failure detection problem here is two-fold: detecting regional failures, and detecting the failure of this central replica-set hub that drives the replication to the regions.

For regional failures, this can actually happen at the hub. For example, in strong consistency mode Cosmos DB runs heartbeat messages and regional leases to detect any problems. If the region is not responsive, its lease is going to expire. After lease expiration, all read operations halt in that region, and that central-replica hub can temporarily remove the failed region out of the global replication quorum. Thus, in strong consistency, we do lose the availability for new writes for a longer period of time than the duration of a lease, but reads remain available. Strong consistency mode is the most interesting, as in other modes we can mask some failures and slowdowns, but in strong consistency it is an all-or-nothing kind of operation.

For detecting the failure of the central replica-set hub, Cosmos DB has a decentralized failure detector that runs off the heartbeat messages. When regions do not see the heartbeat for too long, they suspect a hub failure. To make sure this failure is globally observable, Cosmos DB initiates the voting phase in which the regions exchange their observation of the hub’s health status. The voting is initiated by any region that suspects a hub to be down. There can be multiple voting phases running at the same time initiated by multiple regions, because later Cosmos DB knows to react only to one. Cosmos DB tries to limit the concurrent voting phases for the same configuration epoch by staging the vote phase start at different delays in different regions. If the quorum of regions agrees that the hub is down, hub failover is initiated. This hub failover is coordinated from the outside of any replica-set in a more centralized management service to make sure there are no duplicate attempts to failover and there are additional safety checks.

Cosmos DB uses an Eventual Weak Failure Detector in practice.

Weak Completeness – Eventually every process that crashes is permanently suspected by some correct process.

For Cosmos DB, some means more than half. This is based on the assumption that no more than half of regions will fail or be partitioned at the same time. Also to prevent costly failure-recovery operation when detecting failures across Azure regions over the weak WAN links.

Eventual Weak Accuracy – There is a time after which some correct process is never suspected by any correct process.

Eventual Weak Accuracy property makes a lease mechanism possible based on timeouts for Cosmos DB to preserve consistency when removing a suspected region from write quorum. The lease does not need to make an assumption on time as a wall clock, but only the time duration to be the same in every region. This usually can be satisfied with CPU ticks at coarse granularity.

When a failure is suspected, there are two types of server-side recovery:

Failure in the read region: will cause the write region to wait out the lease time, so it’s safe to remove the failed region from the write quorum, unless quorum size is already less than half.
Failure of write region: will trigger force failover and elect new read region as next write region.

There are other types of failure recovery in Cosmos DB as well, such as client-side redirection. Client-side recovery mechanisms can occur much faster than server-side failover. It is a big part of what enables improvement from four nines of availability to five nines (allowing a very small window for downtime). Server-side failovers can take some time to both detect and orchestrate the failover. In an active-active Cosmos DB topology – if one region goes down, requests can be simply re-routed to another active region. This can happen immediately on the client upon seeing initial timeouts – well before the server has detected there is an issue. Upon initial startup, the Cosmos DB SDK client will fetch a list of the available regional endpoints for each active region. This list is subsequently refreshed using a heartbeat mechanism to keep it up to date. The SDK’s internal retry mechanism increments a shared in-memory counter that gets incremented for a number of failure conditions. These can be timeouts and server-side failures, such as 500-type errors. If the number of failures exceeds thresholds within a short time duration – the SDK determines that its requests cannot reach the desired region. This can happen because of a number of reasons beyond server-side failures, such as an issue with the intermediate network. The requests then begin to redirect to an alternative region, according to a priority list based on spatial locality.

If you are interested in how often failure detection is triggered in Cosmos DB. In a random 24-hour period, for all multi-region strong consistency accounts of Cosmos DB in all regions, there were 278 times of suspected failures because of missing heartbeats, out of them 47 ended up actually revoked the lease temporarily.

Conclusion

Distributed systems aren’t straightforward and there are a lot of moving pieces in play to ensure their availability to the end-user. I hope this article and the examples helped you better understand how distributed systems in the cloud work to overcome failures.

If this was useful, follow @lenadroid on Twitter for more insights on cloud, architecture, data engineering, distributed systems, and machine learning topics.

About the Author

Lena Hall is a Director of Engineering at Microsoft working on Azure, where she focuses on large-scale distributed systems and modern architectures. She is leading a team and technical strategy for product improvement efforts across Big Data services at Microsoft. Lena is the driver behind engineering initiatives and strategies to advance, facilitate and push forward further acceleration of cloud services. Lena has more than 10 years of experience in solution architecture and software engineering with a focus on distributed cloud programming, real-time system design, highly scalable and performant systems, big data analysis, data science, functional programming, and machine learning. Previously, she was a Senior Software Engineer at Microsoft Research. She co-organizes a conference called ML4ALL, and is often an invited member of program committees for conferences like Kafka Summit, Lambda World, and others. Lena holds a master’s degree in computer science. Twitter: @lenadroid. LinkedIn: Lena Hall.

Acknowledgements

Natallia Dzenisenka for thoughtful collaboration on the material for theoretical failure detection.
Matthew Snider for reviewing the article and in-depth conversations around Service Fabric and its internals.
Mark Brown, Andrew Liu, Aleksey Charapko, Eddie Ailijiang for reviewing the article and answering a dozen of my questions about the core of Azure Cosmos DB.
Clemens Vasters for valuable insights on messaging.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?