BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Multi-Region Data Streaming with Redpanda

Multi-Region Data Streaming with Redpanda

Bookmarks
41:44

Summary

Michał Maślanka introduces the design of Redpanda’s Multi-Region feature, and describes how they leveraged Raft’s properties, a constraint solver, automatic data balancing, and tiered storage.

Bio

Michał Maślanka has 10+ years of experience in software engineering across different industries, with focus on distributed systems. He joined Redpanda in 2019 where he is one of the primary contributors to Redpanda core. Michal is currently responsible for Redpanda's Raft implementation and cluster orchestration bits.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Maslanka: I would like to talk about multi-region streaming with Redpanda. This is a photo of modern data center cooling system. Those large towers with lots of small pins are called chillers. Those devices cool down warm water running through those black pipes. This water is then used to keep server temperature low. Last year, record summer heatwave here in UK caused a major outage in Google and Oracle Data Center cooling systems. Temperature in the buildings went high causing the subset of computing infrastructure to go into a productive shutdown. This issue led to almost 13 hours of an outage in Google EU-West2 data center. For an average business in 2022, a minute of unexpected cloud outage cost $5,800. For more than 44% of enterprises, an hour long, unexpected cloud outage may cost more than a million dollars. This is why we need to focus on creating reliable applications that are able to survive major cloud outages. My name is Michal Maslanka. I am engineer in Redpanda Data. On a daily basis, I deal with replication layer in Redpanda, Raft protocol, and Paxos, that orchestrates single nodes to form a cluster and communicate with each other.

Outline

This talk is going to be divided in two parts. First, we'll be more general, we'll talk about multi-region fundamentals. We'll look at the benefits, challenges. We'll talk about the tradeoffs that had to be made when deploying application in multi-regions. In the next part, I would love to share with you our experience that we have with adding multi-region capabilities to Redpanda. We will go over Redpanda multi-region internals. We'll take a look at the flexible deployment options that we try to provide. I will show you a little bit of our future plans. The key takeaways that I see from this talk is that multi-region is a new table stakes nowadays, as organizations scale and depends more on their key infrastructures. Tradeoffs must be made when deploying distributed applications. The tradeoffs should be driven carefully by requirements and business needs. Also, I would like to highlight that Redpanda can be seen as a flexible multi-region streaming data platform.

Multi-Region Fundamentals - Benefits

Let's talk about the fundamentals. You're probably wondering why those images are here with red pandas? I wanted to find a metaphor of conflict between benefits and challenges. After the conflict is done, we are finding the trail. One of the primary benefits of multi-region cloud deployments is increased resilience and disaster recovery capabilities. By having an application deployed in multiple regions in geographically distributed data centers, we can ensure that even if one data center is down in case of a major cloud outage or even a natural disaster, the application will still be available. Routing and load balancing will redirect your client's traffic to a region that operates. This way, you will be able to provide undisturbed services for your customers. Another important benefit when we look at multi-region is ability to scale applications globally. This has two major parts in it. One is reduced latency. The second one is increased scalability and better performance. Obviously, reduced latency comes with the fact that your data, your application is closer to your client. The lower latency is, the higher customer satisfaction is, usually, especially when it comes to like web services. We never want to wait long. By having your data deployed close to the client, the client can access them fast. The other part of it is the scalability. By having your application in multiple data centers, you can sustain traffic peaks better, and the risk of performance problems is reduced.

Data sovereignty is another important factor. Data sovereignty means that the data is subjected to laws and regulations of the geographic location where the data is collected and processed. Data sovereignty is a country specific set of rules, that data must remain within the borders of jurisdictions where it originated. Many countries requires that some vulnerable customer related data resides within a certain geographical area. For example, a shipping vendor might need to keep customer profile data within a certain region like, for example, in EU, to satisfy GDPR, General Data Protection Regulation. While the other parts of data like, for example, related with the shipment itself, may be distributed somewhere else, may be replicated somewhere else. More countries nowadays enforce these data governance rules that mandates writes to be local to a geography, and require that all the users' data to be held within their country. For example, China mandates all the customer data to be written first to the local data center. Read operations then can span other geographies. Considering this shipping company example, you can have part of your customer profile replicated locally, but the package information to be replicated globally to provide shipment tracking services, for example, with a low latency. Those are the three main benefits, so resilience, global scale, and data sovereignty.

Challenges

Let's try to face the challenges. When considering entering a multi-region deployment space, besides the benefits, we need to consider challenges. With multi-region deployments, all the difficulties related with creating distributed systems are lifted to another level. This table shows median round trip time between some AWS regions. All the table cells that are marked with green indicates a latency that is lower than 100 milliseconds, while yellow is between 100 and 180, and the red ones are really bad. It's more than 180 milliseconds for round trip time latency. By default, like probably lots of us feel it the right way when trying to log in into the shell over an SSH protocol, into the shell that was far away like in U.S. or even far, you remember like that latency of a keystroke.

Speed of light provides a lower bound on the latency. Let's consider an example of London and Frankfurt. London is where the EU-West data center is located and Frankfurt is where the EU Central data center is located. In a straight line, the distance between London and Frankfurt is roughly 640 kilometers, and it takes 2 milliseconds to travel these 640 kilometers. In total, that gives us the lower bound on the round trip time, which is around 4 milliseconds. In reality, the latency is much higher. This is related with the need to cross multiple network infrastructure, the routing packet retransmits, packet losses, and so on. I'm talking about this latency as the network delay is really critical, as several existing impossibility results allowed establishing a lower bound on operation latency as a function of the network delay for different consistency levels. These results show that any algorithm guaranteeing a particular level of consistency cannot perform operations faster than some lower bound.

You can see in this table the notation here is similar to what is used in complexity theory to describe the running time of an algorithm. However, rather than being a function of a size of a problem, of an input, here we use the lower bound that is a function of the network delay d. In this table, the first low, linearizability is the safest, the strongest consistency level that we can provide. You can see that it's proportional. It's like lower bound by the network delay in both writes and reads. This table may have many rows, the most important conclusion is that we need to pay in terms of latency for the increased consistency levels. We can see like the last row here, the eventual consistency, the weakest one may be realized without waiting for network round trip, both in reads and writes.

Let's try to find practical implications of that table. In practice, most of the multi-region applications rely either on synchronous or an asynchronous replication. In synchronous mode, a writer has to wait for a remote region to acknowledge a write before the write is acknowledged to the writer claimed successful. On the other hand side, the asynchronous replications work the other way so that the main region doesn't wait for the remote region to acknowledge. This way with synchronous replication, its lower bound for the write latency is dependent on the network delay. Whereas for asynchronous replication, the write latency is independent. We need to tolerate different recovery point objective, and for asynchronous replication we need to be able to accept the data loss. Whereas synchronous replication provides this recovery point objective equals zero, that means that there will be no data loss. Basically, even in terms of failure, we are guaranteed that all successfully acknowledged data will be present in the system and not be lost. Obviously, if the system is bug-free.

I find this PACELC theorem a very good reasoning framework when looking at consistency and latency tradeoff. PACELC is a CAP theorem extension that is supposed to be used for latency and consistency tradeoffs in distributed databases. PACELC defines the tradeoff that must be made during normal operations, and when there is no network partition, and in case on the other failure when the network partition happens. The left-hand side of this picture, this is roughly the CAP theorem. It means that when the partition happens, we need to choose either availability or consistency. System will be either consistent or available in case of partition. On the right-hand side, we see that when the no partition happens, the latency has to be traded off with consistency. You can either optimize your system for latency or consistency. This is, again, pretty much the same conclusion as the table that we showed, like two slides before. We need to pay with latency for stronger consistency levels for easier reasoning.

Unfortunately, each month, something comes with an increased cost. Entering multi-region deployment area must be underlined by a good business requirement. Distributed applications are more complex to build as developers and architects must consider all the challenges we talked about to this point. Additional complexity is added by the network infrastructure and load balancing. Also, debugging and troubleshooting multi-region applications is really hard. Complexity increases development cost, and application creation cost. The more complex the application, it's going to be more expensive to build. It will need longer time to build it and to test it. Also, an additional factor that must be considered is the cost of operation. We need to factor in the cost of making more complex applications, and cloud infrastructure costs, cross-region traffic, observability, metrics traffic, and so on. To sum up this part, we may say that building multi-region applications is challenging, complex, and hard, but also at the same time incredibly interesting task.

Redpanda Experience with Multi-Region

In this part, I would love to share our experience and show you how we used Redpanda internals as building blocks to build flexible and hopefully easy to use multiple region solution. Redpanda is a Kafka compatible streaming data platform built on top of a Seastar, shared-nothing thread-per-core asynchronous programming framework. Seastar is an asynchronous C++ framework designed to build I/O intensive applications. Seastar is unique because its application threads are pinned to cores, such pinned thread is called shard in Seastar lingo. I will be using shard to express this thread pinned to a particular CPU core. Seastar has its own user space scheduler implementing cooperative multitasking, so your code can explicitly yield control to the scheduler. The framework forces application developers to use shared-nothing architecture, where shards do not share memory, hence, they do not need to use locks in classical sense. All inter-thread communication is based on message passing. This approach allows Redpanda to minimize number of context switches, better utilize CPU caches, and scale with the hardware. Redpanda uses also fragmented buffers to optimize memory allocation, shard local memory buffer pools, and direct I/O to explicitly bypass system page cache and maximize I/O performance. If you are interested to learn more about Seastar programming model, and how it is used in Redpanda, I encourage you to go to my colleague, John Spray's talk on adventures in thread-per-core architectures with Redpanda.

Probably most of you are familiar with Kafka data model and topic semantics. I will briefly describe this concept as Redpanda is Kafka compatible at the protocol layer and we use exactly the same abstraction. In Redpanda, data are organized in topics. A topic is a logical abstraction over a set of partitions sharing the same configuration. A partition is an instance of immutable append only log. Each partition and immutable append only log in Redpanda is implemented as an independent Raft protocol. A partition is hosted on the set of Redpanda nodes being its replica set. On this diagram you can see partition with replication factor of three, and each Redpanda face is a node, and those small boxes with numbers they represent single messages. A line represents a log. A leader is responsible for processing client requests. When the writer wants to replicate the message, a leader first writes it to its own log and then sends out a request to the followers. In Kafka protocol, clients have this flexibility when to claim a write operation successful. It can be expressed with acks property. The acks property defines how many in-sync replicas have to acknowledge a write before returning success to the client. There are three possible values, 0, 1, and all. I'm going to focus on acks=all during this talk as this option provides the strongest consistency guarantee. In Apache Kafka, acks=all means that all replicas must acknowledge their write before it is done.

In Redpanda, we use different replication. Writing with acks=all in Redpanda is treated as a Raft write. The write in Raft is acknowledged as soon as the majority of replicas safely replicated it and persisted it to disk. In Redpanda, we explicitly fsync before acknowledging the write, either on the leader or on the follower. This observation is critical, as to accept a write, we do not require all nodes to concur. This way, replica with the highest latency does not influence the perceived write latency of the leader without compromising availability and consistency. Do you remember that cross-region latency table that I showed before? Being able to skip one of the red fields, one of this latency coming from the red fields, it's already a huge win. This is a property of the Raft protocol itself that we leverage.

We decided to provide the same labeling of our topology nodes as cloud vendors do, to prevent confusion. Redpanda nodes can be assigned to regions and availability zones. Regions are usually representing separate geographic areas, whereas availability zone or racks, they exist in the regions and they are also separate, but they may be in the same data center. Each region may or may not contain multiple Redpanda nodes. Basically, each tree node in here may have Redpanda servers assigned to it. With the hierarchical deployment model, and Raft properties and Raft protocol, stretched Raft protocol, we have failure survivability, defined with a simple equation. This equation defines that if a partition a Raft group is required to survive f failures, it requires 2f plus 1 instances of a hierarchy level. In the simplest form of this equation, we may substitute n with the number of replicas, this way, we will know that to tolerate a node failure, we need at least three replicas, so it's going to be 2 times 1 plus 1, it's 3 replicas. The same rule applies to availability zones and regions. In order to tolerate a region failure, we need to provide at least three regions. In order for this rule to apply, we need to remember that the number of replicas in each entity, in each topology level must be smaller than the majority, where majority is defined with this number, basically, that's like a half plus 1. In case of 5 nodes, a majority is 3.

Example Deployment of Multi-Region Partition

Let's apply this formula to a practical example. Let's consider an example of a deployment which can sustain a region failure, whole region failure, availability zone failure, and two nodes failure in here. Having two nodes in a region forces a leader to communicate with at least one node from the region. This property allows this partition to be alive, even if the whole region is offline. In this diagram, we can see partition with five replicas, two of them in region A, two of them in region B, and one of them in region C. This is made to be cost effective as we don't need the sixth node here, sixth replica, as that doesn't increase the availability. Intuitively here, if a leader is to contact with majority, it will always have to push out this one message, at least one message. This one message will have to be acknowledged by either a node from region A or region C.

Partition Replicas Allocation

We covered properties of stretched Raft group, and how they can be leveraged to provide multiple region replication. Now let's take a closer look at the partition allocator. A partition allocator is a component in Redpanda that answers the question, where to place partition replicas. Whenever a partition is created during topic creation, or a partition addition, an allocator takes the allocation request together with constraints and responds with a replica placement. A central component of the partition allocator is a hierarchical constraint solver. Obviously, an allocator uses its current state to make the decision about the allocation. Redpanda constraint solver has an interface that is organized around constraints. Constraints allow users to express allocation preference. In the low level, we define two types of constraints, hard and soft constraints. Each constraint class is really a factory creating constraints evaluator. An evaluator is a function returning an evaluation result. Hard constraints define if a node is eligible to host a partition, whereas soft constraints express a preference of a replica to be placed on a particular node. The higher the score returned by the soft constraint evaluator, the higher the preference. Constraint interface makes it easy to solve them with a 3D algorithm.

Let's take a look at solving the hard constraints. First, we use the factory interface to create constraint evaluators. These steps allow us to execute some preliminary logic for each constraint that is executed for, before actually entering into evaluation phase, and executing evaluator for each possible allocation node. Then, in the next step, we go over all allocation nodes. The node is only eligible place to hold a partition if all the constraints are not violated. Solving the soft constraints is actually finding the best fit. This is made by calculating and summing up the score of all of the nodes, and then the score is normalized. If it happens that few of the nodes share the same score, then we take a random one from all of them to prevent the hotspotting, so to provide an even spread over those nodes.

Let's take a look at some constraint examples. This is a simple constraint that pins a replica to a certain node. You can see that this evaluator is really easy. This function returns a Lambda. It really returns a function that returns true only if a candidate ID is equal to the requested ID. This constraint is used in Redpanda internals to provide a user ability to express so-called custom topic allocation, so a user can pin partition replicas to certain nodes. This is a little bit more complicated example. In Redpanda we use strong typing approach, it means that each abstraction representing different concepts has its own type defined, even though the underlying type is exactly the same. Here we can see that rack and region are both strings, but those are in C++ actually different types. This forces us to use some generic code to actually not have to define it for different types.

This is a constraint that equally spreads partition replicas on a given label. It accepts the label name that can be either rack or region. It's used for debugging and logging purposes, then a vector of possible labels where we want the replicas to be distributed, and a mapper function. A mapper function is a function that maps a node to a certain label like a region or rack. When a node is not assigned to either of the label, it returns an empty option log. Here we leverage this first phase where we create an evaluator. Before evaluating all possible nodes, we actually populate this frequency map. The frequency map is a map representing number of existence of replicas in a given label. Then we allow allocation only when the frequency is equal to the minimum frequency. This way, we evenly spread replicas across the possible labels.

Let's take a look at an example of constraint chaining. This is how we practically use this interface. Those are the default constraints for Redpanda topic creation. We want the partition replicas to be on distinct nodes, so unique nodes. We don't want two replicas to be held on one node. That's obvious. Then we want the nodes not to be fully allocated, we need to have some free capacity. The node has to be active. We prefer the least allocated node, and nodes located in distinct racks. Here's another example of constraint chaining. This is an example that I created to present how we can deploy these five replicas of a diagram that I showed before, that is able to sustain a region failure. Here you can see that we had only one replica in one region, and this is exactly expressed with this constraint, so exact number of replicas in a label. Then we have another constraint that equally spreads replicas in region A and B. We want region and rack to be distinct across the nodes. This partition allocator will return an error wherever this allocation may not be fulfilled, or will return, basically, a list of nodes where the replicas should be placed. Expressing the constraint in this way, allows us to provide user an ability not to over-specify the partition placement in the cluster. We don't require a user to provide exactly where to place the partitions, but other than that, Redpanda can take an informed decision based on its internal state about the placement. This way a user can express their preferences.

Automatic Partition Data Balancing

The partition allocator that I presented answers the question where to allocate partition in replicas. However, as the system may evolve in time, an allocation that was valued in the past may not be optimal, or it may even violate current constraints now. To solve this problem, we introduced a partition balancer service. The service is triggered by events that are known to cause constraint violations. For example, node being down for a long time, just waiting to replicate the partition as one of the replicas is not available. Nodes being decommissioned, or this reconciliation loop is also [inaudible 00:32:58] to discover other violations like, for example, low disk space. This service is responsible for calculating fitness score for current partition allocation and this hard constraint violation. If the allocation may be optimized, a partition will be reallocated. Automatic rebalancing uses the same constraints code, but instead of calculating allocation from a set of constraints like before, it evaluates current replica placement together with the constraints to find the violation. If the violation is detected, then the partitioning is moved around in the cluster to a place where the violations no longer exist.

Leadership Preference

Another important component of Redpanda multi-region capabilities, is so-called leadership preference. Leadership preference allow users to define where the preference leader should be located. This has two components. One is extension in the Raft protocol that we implemented, it's called leadership preference, where we defined priorities that express a preference for the leaders. Basically, you can specify which is the most probable replica that will win the leader election. Another component in Redpanda is a leader balancer. It's like an optimizer that optimizes leader placements around the cluster. At the beginning we only used it in a way that it spread out leaders equally in the cluster. Now with multi-region capabilities, we give a user ability to express the leadership preference, so that this optimizer operation of the leader balancer actually moves the replicas to a place where a user wants them to be. Why is that so important? It is important especially with this multi-region deployments as if the consumer workload is in the different region, it may be eligible to pay cross-region traffic, and that may be much more expensive than conducting with the nodes that are in the same region where the application lives. An additional penalty will be the latency obviously, as the latency when communicating with cross-region will be much higher.

Reading From Follower Replica

Another important part of the whole multi-region picture is the follower fetching. As the name states, follower fetching allow consumers to read not only from the leader, either read [inaudible 00:36:15] is an acceptable tradeoff, an application may choose to read from the follower. This is a part of the Kafka protocol. Kafka protocol provides a way to select the replica that is closest to consumer. A follower may meet some offsets from the reader obviously, but it may be accepted by the application. The Raft protocol guarantees that even when reading from the follower and even if the failure happens, the consumer reading from the follower will never read beyond what was actually acknowledged committed by the leader. The consumer will never read beyond the high-water.

Tiered Storage

Another important component is the tiered storage. Tiered storage in Redpanda was intended to provide an extended retention for data by offloading them and uploading them to an object store. Before evicting data from local storage, a local hard drive, the data are uploaded to the object store, and only then can we read. This data in the object store are indexed and are still readable through Kafka protocol with some latency. The object store acts here as the slower storage tier with higher capacity and lower cost. Tiered storage is also a very good building block for multi-region capabilities as in Redpanda we provide a way to create a topic as a so-called read replica. A read replica is a lazily populated topic backed by the data available in the object store. We have two separate clusters, and one uploads the data to the object store while the other one is able to read them. Obviously, writes are locked. Writes are not possible to the read replica topic. This approach trades off read latency because obviously uploading data to S3 may take some time, and also there's like a gap when it comes to what was already uploaded and what is not yet uploaded. There is no direct traffic between the clusters. The only way that the data is exchanged is through the object store.

Future Plans

We are still at the beginning of our multi-region journey with Redpanda. This is a very exciting project. In future, besides other things, we plan to add read-only non-voting replicas as a type of replica that will allow users to achieve asynchronous replication with Raft protocol. We already have a notion of learner nodes. Those are the nodes that are members of Raft replica set, but not being counted into majority, which is effectively an asynchronous replication, as they are not required to acknowledge the write. Additionally, we plan to add peer-to-peer replication in Raft to alleviate increased cost of multi-region deployments. As in Raft, the leader must communicate with each follower directly, but with peer-to-peer replication, it will only send data to one follower in a remote region, reducing cross-region traffic cost.

PACELC - Where is Redpanda?

Where we are with Redpanda, and where Redpanda is, in terms of PACELC. With its current capabilities when we think about multi-region, we always trade availability for consistency. We value consistency because it provides easy reasoning and also data safety, which is right now very important for most of the use cases. With adding at this read-only non-voting replicas, we will give users the ability to trade consistency and actually be able to be available and have less latency.

Summary

There is an increasing need to create multi-region data applications as organizations scale and face new challenges. Tradeoffs have to be made when creating geo-distributed applications. We need to develop this intuition between latency and consistency tradeoff. In Redpanda, we used stretched Raft, hierarchical constraints solver, tiered storage, and other components to give users flexibility when creating multi-region applications.

 

See more presentations with transcripts

 

Recorded at:

Mar 06, 2024

BT