BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Relational Data at the Edge

Relational Data at the Edge

Bookmarks
44:30

Summary

Justin Kwan and Vignesh Ravichandran discuss Cloudflare’s edge database architecture, unique challenges and practices for data replication, failover and recovery, and custom performance techniques.

Bio

Justin Kwan is Software Engineer Intern - iCloud Edge @Apple, Previously @Cloudflare. Vignesh Ravichandran is the Co-founder of Omnigres. Earlier he lead the database team at Cloudflare, Ticketmaster, and Ford.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Kwan: The edge has become a very popular application of distributed systems. What exactly is the edge? The edge is simply something that runs very close to clients geographically. We think of services such as DNS resolution, content delivery, and IP firewalling, which are close to the client. These are things that we consider near the edge. Traditionally, we have considered things like relational databases, which have lived closer in centralized data centers, which are further away from clients. We can see that the edge is really just a relative metric where services that run closer to the clients are considered more edgy. Location-based network latency and costs are massively reduced when we can serve requests from clients at the edge of the network closer to where they are.

Background, and Outline

My name is Justin. I'm a software engineering intern at Apple, building out services for iCloud's edge network. Previously, I worked at Cloudflare as a system software engineer, focusing on storage infrastructure, multi-tenant resource isolation, and even database internals where Vignesh was my super amazing engineering manager. Vignesh has been an engineering manager at Cloudflare from 2019 up until now. He's been running the entire software storage infrastructure org as well as the database team. He's also worked on a startup that he's working on called spinup.host. He also is a member of PgUS. From 2016 to 2019 Vignesh gained a lot of database engineering experience at the ground level as a database engineer at Ticketmaster.

I'm excited to share with you guys how we design and operate our entire high availability database architecture at the edge. Together, we'll break down in high detail the high availability setup and consider the tradeoffs that we had to make around each part of the system. Then we'll explore some of the interesting performance challenges that we faced when bringing our database infrastructure closer than ever to the edge and dive deep into the solutions that we've implemented. We'll finally take a sneak peek at some of the really interesting recurring patterns that we see emerging in relational data storage, colocated at the edge.

Cloudflare

Cloudflare is the market leader in network security performance in edge computing. For some cool numbers, let's give you guys an idea of the scale we're dealing with. We have over 27 million internet properties that depend on our protection, which means that our network needs to handle over 46 million HTTP requests per second on average at any given time. We leverage our network Point of Presence locations spread across 250 locations globally. This level of traffic translates into over 55 million row operations per second, on average against our busiest Postgres database cluster. That stores over 50 terabytes of data across all of our clusters.

Cloudflare's control plane is an orchestration of many microservices that power our API, our dashboard. Our control plane manages the rules and configurations for critical network services along the data path. Our database clusters store things like DNS record changes, firewall and DDoS mitigation rules, API gateway routes, and even data for internal services like billing entitlements and user auth. We've commonly seen application teams frequently use Postgres in some pretty interesting ways. Since Postgres began, store procedures have been commonly used to execute business logic as units of work. The robust psql language enables intricate branching logic, which is used for enabling and disabling domains with transactional consistency. We've also commonly seen applications where service teams will use Postgres as an outbox queue for capturing domain events. One example is when DDoS rules are generated from traffic analysis in a centralized data center. They're first run into a Postgres table. A separate daemon will typically pull these events from Postgres and pipe them through Kafka to services at the edge. This means that latency critical services can avoid directly hitting the database while enjoying the durability that Postgres and Kafka give.

Keep in mind that at Cloudflare, we're running our entire software and service stack on our own bare metal hardware. Unlike many other major service providers, we spend a great deal of time considering things like rack mounted servers, network cards that power our high bandwidth backbone, and the operational maintenance required. On-premise data storage offers us the highest degree of flexibility across the entire stack. We can meticulously fine tune elements such as our solid-state RAID configuration. We've even implemented features in the open source cluster management system, and even have applied custom performance patches against PostgreSQL itself. This level of transparency and control of our system would otherwise be impossible had we used any managed vendors such as AWS RDS. Unfortunately, that also means that there's no magical autoscaling button that I can just push and immediately increase capacity when we start experiencing load. This is one of the most challenging aspects of running Cloudflare's entire stack on-premise.

Architecture

Let's dive into Cloudflare's database architecture that sits near the edge. We need to process transactions on the order of millions per second. When designing and selecting the components of our system, we try to consider what was most needed from an ideal architecture. The most important thing is high availability. The control plane for our critical services need to remain available 24/7 to protect our customers. Initially, we aim for an SLO of five nines of availability across all of our services. That means that we get five minutes and a half of downtime per year across the entire software stack. This gives our databases even less leeway. Achieving a high level of availability requires our system to operate and then scale, and so our transactional workloads typically skew on the read heavy side. Our infrastructure has to handle a high rate of read RPS and write RPS with minimal latency, as well as maintaining fault tolerance. Of course, we also care about observability, which is important in any distributed system. We internally leverage our Anycast network, which means that clients are naturally load balanced across our PgBouncer proxy instances. This is the first step of our journey in pushing the databases to the edge, since queries could be processed faster in closer proximity to the clients. While BGP Anycast allows us to proxy queries to the optimal region closest to the clients where they're deployed, write queries still need to be forwarded all the way back to the primary region, where the primary database instance resides, while read queries can be locally served from the closest nearby region.

Here's a picture of our entire massive architecture. Let's just start from the very top at the proxy layer. At the top we have PgBouncer. This manages database server connection pools used by our application clients. Queries are then passed through HAProxy which are load balanced across several database cluster instances to prevent any single Postgres database from becoming overloaded with a disproportionate number of queries. In the middle, a typical Postgres deployment contains a single primary instance, which replicates data to multiple replicas. This is done to support a high rate of read queries. Our primary database handles all the writes. Our asynchronous and synchronous replicas handle reads. Our Postgres cluster topologies are managed by the high availability cluster management tool called Stolon. It's backed by the etcd distributed key-value store, which underlying it uses Raft protocol for leadership consensus, and election in the case of primary failovers. You'll notice that we've adopted an active standby model to ensure that we have cross-region data redundancy. Our primary region in Portland serves all inbound queries. At the same time, our standby region in Luxembourg is ready to handle all incoming traffic, if we ever need to evacuate the region, or fail out. This means that our primary Postgres cluster replicates data across regions to our standby as well. There's a lot going on in this high-level diagram. We'll dive into each layer and just like each of the components and the tradeoffs that we considered.

Persistence

Ravichandran: We decided we have to build this highly distributed and relational database using edge. When we wanted to set on to do this, we wanted to look at what are the fundamental things that's going to basically define how we define the architecture. As the first principles, we looked at CAP theorem, since it's a distributed application, which we know is going to be distributed geographically, and especially using open source software, Postgres, and also everything is built on our own data centers. Just a quick recap on CAP theorem, for any distributed applications, you can have only one of the two, either you can have consistency, or high availability. A lot of the times we pick availability over consistency. Let's go over the cases where we pick the tradeoffs and what are those tradeoffs? This is a typical architecture of how we deployed in a single data center. There is a primary database and then there is a synchronous replica. Then if you look at it, at least we have a couple of asynchronous replicas. The same topology is replicated across multiple data centers. We have at least right now three core data centers where we have replicated at a real time. Why do we pick this kind of semi-synchronous replication topology where we have the primary database and at least one synchronous replica and a bunch of asynchronous replica. In a typical situation where we have like an asynchronous replica, which goes down, we don't have any problem with respect to the applications. Applications can still continue working without any impact, because it's just an asynchronous replica. Similar tradeoffs we made for cross-region replication. Let's say we have a replication region in Europe, as well as in Asia, if either of those two goes down, still our applications in the U.S. can continue working well. However, if the synchronous replica goes down, we basically bring the entire application to a halt, because we didn't want it to take any reads and writes, especially writes, without making any tradeoffs. Pretty much every time when there is a failover that happens on the primary database, we pick the synchronous replica to be the new primary. That's the reason why we have this kind of semi-synchronous replication topology that we went with.

As I mentioned earlier, there are two fundamental things. One, it's going to be based on top of Postgres. Two, everything based on on-prem. Postgres by itself is not a distributed database. It started in the '90s in Berkeley, and it's been designed with monolithic architecture at mind. How are we going to take this Postgres into a distributed Postgres? Two things, one, we rely heavily on the replication aspects: one is the logical replication, and the other one is streaming replication. There is a third derivative, what I call the cascading replication, which is basically built on top of the able two ones, either logical or streaming. Before we get into the replication and how it's useful for us to build a distributed Postgres, I wanted to have a quick primer on write ahead logs. How many of you are aware of write ahead logs? In any database, not just in Postgres, whether you pick MySQL, or Oracle, all this fundamentally relational database management systems, achieves the durability in ACID using write ahead logs. In simple terms, whenever you make any change to your database, it doesn't need to be synced directly on your file system or the database files. Rather, you can capture those changes sequentially in basically a log file. Then those log files are synced to your database files in an asynchronous fashion. This provides the durability options that we pretty much need in any database systems. That's pretty much turned out to be a really good feature for us to build a replication mission or replication systems, where we capture these changes and then replicate out to another replica.

Going into the streaming replication mode. In streaming replication mode, it's pretty simple. A replica basically creates a TCP connection, and streams pretty much every log entry from operations within a transaction to another replica. The interesting part with this streaming replication is that it's pretty performant in the sense that we can pretty much capture any changes that happens, at pretty much like 1 terabyte per second changes even to the replicas on the other end. Minimal delay. There are delays. Justin will go over some of the delays and how you can create those delays. In general, it's pretty performant. The other caveat with streaming replication is it's all or nothing. If you set up a replica, then you have to pretty much replicate every single data because it's based on the filesystem level block level. There is no way for us to know, at least so far optimize which data is being replicated. The more newer version of replication is logical replication. As the name suggests, it's more than at the logical level, instead of the block level or the foundation level where you are replicating data block by block. Rather, in logical replication, you are replicating data at the SQL level where each SQL statements that are captured at the end of the transaction, they are replicated on the other hand, more like, if you can think about a publisher and subscriber option. Definitely more flexible. Compared to the streaming replication, here you can create publishers and subscribers, even at a table level, and even more granular at a row level and on a column level. You have much more flexibility than a streaming replication provides. However, it has two caveats that's a hindrance for us, at least as of now to adopt logical replication. The biggest one is the schema changes. No DDL changes are replicated. You have to figure out some way, build like custom tools, where when you have to make some changes, replicate those changes to all the other replicas. That's just not easy. The other bigger problem is that logical replication is still not performant at scale. For a cluster, probably maybe gigabytes of size, maybe it's good, but not at least at the multiple terabyte scale. Logical replication is not yet there.

Cluster management. Pretty much one of the biggest reasons we would have a cluster management as we all know, database failures. Whether we like it or not, failures do happen, for a multitude of reasons. As simple as, we have a logical failure, like a data corruption happened on one of the primary databases, and what do we do? Or even more severe things like natural disasters and whatnot. In any of those situations, we have to trigger a failover, and doing them manually is not fun. Another fact, at least 5% of hardware at any given time is faulty, pretty much if you think about it on a fleet of like multiple thousands of servers, across multiple data centers, at pretty much any point in time, you have some kinds of failures that is always going on. It could be as simple as like a RAID failure on a disk, all the way up to like an availability zone going down. How are you going to be prepared for these failures without having a good cluster management?

We picked Stolon, which is a pretty thin layer running on top of the Postgres clusters itself. It's open source software written in Go, started in probably 2018 or 2019, it started picking up. The features that really brought us to use Stolon is, one, it's Postgres native. It speaks Postgres. Also, it supports multiple site redundancy, like in the sense that you can deploy a single Stolon cluster, which is distributed across multiple Postgres clusters. Those clusters can be located in a single region, or even they can be distributed across multiple regions. There are a few more interesting aspects, one, it provides a failover. Obviously, that's what you would expect from your cluster management tool. The good thing is, it's more stable. In the sense, we have seen very less false positives. Digging deep onto our cluster management, the first component is Keeper. Keepers are basically the parent process, which manages the Postgres itself. Any changes that has to happen has to go through this Keeper. You can think about Keepers as basically the Postgres process. Then there is the Sentinels, I would consider them as the orchestrator, which looks at the health checks of each component of Postgres, and then makes decisions such as, are the primaries healthy, or should we start an election and figure out who the new primary should be. Finally, the proxy layer where all the clients connect to it, and then proxy layers make sure that, where is the write primaries? We should avoid any kind of multi-master situation.

Connection Pooling

Kwan: Finally, database connections are a finite resource, which means that we need to manage them efficiently. We all know that there's overhead when you go and open a new TCP connection or Postgres connection, but what's actually happening under the hood? Postgres connections are built on top of the TCP protocol, so requires that three-way handshake. Following this, we also have an SSL handshake to secure that communication. Also, since Postgres dedicates a separate operating system level process for each connection, the main postmaster process has to fork itself to execute queries for the connection. This forking, as we all know, involves new page allocations and copying memory from the parent process to the new child address space. In other words, each connection requires a nonzero amount of CPU time. This consumes finite resources, such as memory. When we have thousands of these connections open concurrently, these three steps start eating up a lot of CPU time that should otherwise be used for transaction processing in the execution engine. A server-side connection pooler will open a finite number of connections against the database server itself, while exposing a black box interface that matches the same wire protocol that Postgres supports. Clients can connect to the pooler the same way they would connect to the database for queries to be sent. Managing pools in a separate server allows us to recycle previously open connections while minimizing the total number of open connections in the database server. It also allows us to centrally control tenant resources for each client, such as the number of allocated connections. Internally, we've actually done a lot of work in forking PgBouncer and adding a lot of interesting features that take inspiration from Vegas congestion control avoidance to actually give us stricter multi-tenant resource isolation.

In our specific setup, we opted for PgBouncer as our connection pooler. Like we said, PgBouncer has the same wire protocol as Postgres. Clients just connect to it and submit their queries like they normally would. We particularly chose PgBouncer to shield clients from the complexity of database switches and failovers. Instead of clients having to think about, where's the new database when it failed over? Where should I connect to? Instead, PgBouncer manages this and abstracts it away. PgBouncer operates as a very lightweight single process server, which handles network I/O asynchronously. This allows it to handle a much higher number of concurrent client connections, as opposed to Postgres. PgBouncer introduces an abstract concept known as client-side connections. This is not the same as a direct server Postgres connection. When a client establishes a new connection, it attains one of these client-side connections from PgBouncer. PgBouncer then multiplexes these queries in an end-to-end fashion, where queries originating from various client-side connections are relayed across actual Postgres server-side connections, and they go through HAProxy. Why is PgBouncer so lightweight? PgBouncer employs a non-blocking model to handle network I/O efficiently. Unlike the traditional approach of allocating a separate thread or process for each TCP connection, that Postgres follows for compute bound requests, PgBouncer utilizes a single thread that just executes a single event loop. We only need one thread stack space for all the required connections. There's, however, a bit more complexity in managing the state for all of these clients. The loop monitors sockets for read and write events using the epoll mechanism. Instead of polling actively and spawning a new thread on each new connection in userspace, we just need to tell the kernel, these are the file socket descriptors that I want monitored, and let the kernel do the rest. We then call epoll_wait in PgBouncer, which then just tells the operating system, put me to sleep until you let me know that TCP packets have arrived to this file descriptor, and notify me. Then the kernel just simply raises a hardware interrupt and provides PgBouncer with the associated file descriptor. We can process logic in userspace, such as making an authentication call to Postgres, pg_shadow, before forwarding off the query, making it a memory efficient solution for managing and relaying between a large number of client to server connections. One of the interesting challenges we've had with setting up PgBouncer is having multiple single threaded PgBouncer processes utilize all the CPU cores on a single machine, because it is a single process program. However, we wanted them to all listen on the same port. How could we reduce complexity for application teams where they don't have to think about, which port should I go to for a specific PgBouncer process? Luckily, we look to the operating system again for a solution and found that you can bind multiple TCP sockets from different processes to the same port. We've actually patched PgBouncer in open source to use the SO_REUSEPORT socket option when opening a new TCP socket to listen on.

Load Balancing

Finally, let's just chat about load balancing. We use load balancers which fronts our Postgres instances to distribute incoming database queries across multiple Postgres servers. Specifically, we've chosen HAProxy as our load balancer. Using round robin load balancing has been an effective way to distribute queries evenly across Postgres instances, or preventing any single instance from becoming overwhelmed. Similar to PgBouncer, our load balancers such as HAProxy also provides high availability and fault tolerance by automatically routing traffic away from failed or unresponsive Postgres servers to healthy ones, which avoids any downtime from degraded database instances. We use HAProxy to relay TCP connections at layer 4, incurring minimal overhead. Fun fact, it uses the kernel splice system call which just attaches an inbound and outbound TCP stream in the kernel. This means that data received from the inbound TCP socket is immediately transferred to that outbound socket and forwarded without having to be copied into userspace.

Challenges and Solutions

We're going to dive into some of the interesting problems and challenges across our database infrastructure, and share some of the cool performance tips and tricks that helped us in our journey to high availability at the edge. First up, replication lag. We found that our Postgres replication lag becomes extremely pronounced under heavy traffic, especially for applications who have many auxiliary or redundant data structures and schemas, such as bloated indices. In these cases, write operations become amplified, because we're updating them in the primary, rebalancing our index [inaudible 00:26:24] on this. We also need to replicate that as well. Other bulk heavy write operations such as ETL jobs, data migrations, and even mass deletions for GDPR compliance are also common offenders. Another interesting cause of replication lag is automatic storage compaction, otherwise known as autovacuum in Postgres. This just executes on interval. When a data is deleted, fixed waste tuple slots become fragmented on disk, so we have this process that goes ahead and cleans up those fixed waste data slots, and so that avoids any fragmentation. Now, replication lag is an unavoidable problem in any replicated distributed system since data needs to cross the network path. Internally, we target an SLO of 60 seconds of replication lag, and we tell this to the upstream application teams to work with. To minimize the replication lag, we, number one, try to batch our SQL query writes into smaller chunks to avoid replicating large blocks of data all at once. One way we also sidestep replication lag and maintain read-after-write consistency is by caching or reading directly after writing to the primary or the synchronous replica. Finally, we can avoid replication lag by simply ejecting all replicas from the cluster leaving only the primary. No replicas means no replication lag. While this might sound insane, this approach requires a deep understanding of tenants query, workloads, and potential changes in volatility. That might significantly change the system's resilience. You can think of this as the unsafe keyword in Rust.

Another interesting incident, one of our largest public-facing incidents happened back in 2020, where a series of cascading failures severely impacted our database's performance and availability. Our public API services experienced a drastic drop in availability plummeting all the way down to 75%, all our dashboards for other critical network services were becoming 80 times slower. From what we saw, both regions' primary databases had executed a failover and promoted their synchronous replicas cleanly. However, the primary database in the primary region soon started to crumble under the massive load of RPS traffic. As a result, this led to a second failure of the new elected primary database. That left us with no more synchronous replicas to promote. This is starting to sound like the beginning of a high availability nightmare. We found ourselves facing two choices, we either promote an asynchronous replica and we risk potential data loss, or we suffer additional downtime by manually initiating a failover to our standby cluster in our standby region. For us, of course, data loss is an unacceptable option, and so we chose the latter approach.

What just happened here? After further investigation, we found that a network switch had partially failed and was operating in degraded state. The rack of the misbehaving switch included one server in our etcd cluster which handled the leadership election. When a cluster leader fails, etcd utilizes the Raft protocol to maintain consistency and select a new leader for promotion. However, there's a simplification of the Raft protocol, which all members just need to state whether they're available or unavailable, expecting them to provide accurate information or not at all. This works well for full failures like machine crashes, but you run into cases of undefined behavior, when different nodes in your Raft cluster start telling each other different things. In this case, we have node 1 and node 2 and node 3, which had degraded network switch, so the network link is degraded between them. Node 1 didn't think that node 3 was anymore a leader, but node 1 and node 2 still had a link, and node 2 and node 3 still had a link. Node 2 acknowledged that node 3 was still the leader. In this situation, node 1 tried to initiate many leadership elections, and so it would continuously vote for itself. Node 2 still saw node 3 existed and would vote for node 3. What happened here? We get into a deadlock state where no leader is elected. When we get into a deadlock state, then the cluster becomes read only. Because the cluster is read only, clusters in both regions are no longer able to communicate with each other. This initiated both clusters to actually fail over to the primary replicas that we saw. Typically, when we fail over, the synchronous replica is promoted. What happens to the old primary? When we fail over, the synchronous replica is promoted, the old primary needs to actually begin undoing transactions that it has committed, because there could be some transaction history that could have diverged between the new primary and the old one, while we kept it available under a network partition. After it unwinds all the way back, unwinding our diverge history kind of like a git branch being undone. The synchronous replica needs to then send over and basically receive and replay all the new and correct transactions that are happening on the new incorrect primary that it missed. Our primary failed once again, because we had no more synchronous replicas to absorb the RPS. Once the primary failed, there was no longer a synchronous replica, meaning that we had downtime. Also, we need to figure out why was it taking so long for the synchronous replica to try to replay all of those transaction logs from the new primary.

There are three things going on here. This is a pretty large-scale issue. Number one, we have a hardware failure, that is just an unfortunate situation that happens. Number two, we had a Byzantine Fault with Raft. This is a really rare event. It's known with the Raft consensus protocol that this is a faulty situation. Most implementations of the Raft cluster or the Raft consensus protocol will choose to use a simpler algorithm versus something that uses cryptographic hashing, but it's harder to understand. A third problem here was that Postgres was taking a very long time to resynchronize, when the primary was sending back the write ahead logs for the new synchronous replica to replay. For us, we decided to pick the third option to solve because this was under the most control, actually, optimizing Postgres internally. Typically, when you have the write ahead log, this is what it looks like. You have the original timeline with the original history. You have the old timeline, which diverges on its own path, because it's no longer communicating with the original one and receiving replica write ahead logs. What we actually found when digging into our logs was that most of the time Postgres was spent in rsync, just copying over 1.5 terabytes of log files from the primary to the old synchronous replica that was resynchronizing. What's the solution here? We optimize Postgres, by instead of leading all the way back in history of time and copying those files, we just went back to the last diverge point where the timeline forks, and then we just need to copy off files from there. That's like 5% of all the data that was originally copied.

Doing this, our replica rebuild times reduced from 2-plus hours, to just 5 minutes. That's like a 95% speedup. We also submitted our patch for open source contribution. As another cool feature, now we're able to resync and do this whole failover resync in our cross-cluster failover and resynchronization across multiple regions. What are the key lessons of this very confusing, but large-scale incident? Number one, we need to anticipate the effects of degraded state and not just fully failed state. For this, we actually have some internal tools that do chaos experimentation randomly on a cluster. Also, another thing we found is that it's quite a good investment to go and build and fix software in open source yourself. You build the in-house expertise, that otherwise will be very hard to find. It's better than just trying to go figure out, we should just use this as a separate tool because we're having problems with this one. You don't know if it's going to be any better. That's our open source commit that we're sending off for Postgres CommitFest.

Access Data from the Edge? (Cluster of Clusters)

Ravichandran: Not fun just maintaining one cluster in one region, you can think about all the different failures, like hardware failures, even some of the failures, if they fail cleanly, that's so much better. There are cases like this Byzantine hardware failure, where components are not fully failed, but they are actually operating in degraded state, which is much more challenging to figure out than actually building your systems and architecting for full failures. We take this monolithic like Postgres clusters in a couple of regions, and distribute them across four regions, or three regions. They will pretty much look like, from an edge application, let's take a Cloudflare Worker standpoint view, it will look like a cluster of clusters. This is where we are currently heading down to. We had like primary region based out of pretty much most of the time in the U.S., we spread it to EU. Within EU we have like a couple of regions. Then now we have branched into Asia. Pretty much all of them uses Postgres streaming replication, especially using the hub and spoke model, where we have the primary region, which replicates to all the regions. There is no cascading replication involved here. We have to watch out for all the fun things that Justin mentioned about like replication lag, keeping all of them in sync is important. Also, when there is a very high replication lag, we have to somehow make sure that we address the SLAs that we have provided to the application teams sometimes, which means we have to divert the traffic all the way back to the primary region itself.

Having this kind of distributed Postgres is cool, but what is the actual benefit? Where can we use the strengths of a distributed Postgres? This is when we introduced smart failovers. Look at this new topology and figure out which region could be the primary. One of the first attributes is like, pretty much as simple as like, follow the sun. A, keep your primary region keep moving, as it progresses. For example, start with U.S. Then when the day ends, move your primary region back to Asia, which pretty much solves the latency problem, because your writes are going to be much faster since your primary database is closer to the vast majority of population, or clients' applications. All we're trying to do is still we are working around the fundamental limitation that Postgres is pretty much a single primary system. If Postgres can be made like a multiple primary, without a lot of the downsides, we don't actually need to do a lot of this dance. The second one is sometimes we have done based on the capacity, certain regions have more capacity than other regions. For example, like shipping hardware to a particular continent has become challenging. Especially during COVID, for example, we couldn't get hardware to Asia in time. We had to pretty much keep most of the time our primaries away from Asia. The traffic, which I already touched upon. The last one is the compliance based, where we are getting into more challenges that, for regulatory reasons, we have to have certain kinds of data in a certain region, and how do we do them? Once again, we can do that using smart ways.

If you think, it's not that simple. It seems like you have Postgres in one region, and then put them out in five regions, and whatnot, and keep failing over and some of the optimizations, especially what we talked about earlier about pg_rewind, using that we can rebuild the clusters pretty much in less than one minute. You have your primary region in U.S., fail over to Asia. What happens to what used to be your primary region in U.S., you can rebuild that within a minute, which is good. That seems like we can keep doing this all day. It's not that simple. One of the biggest challenges that we run into is the dependency graph. If you think about it, just failing over to your databases is not that hard, at least fairly hard, but seems like a solvable problem. The biggest challenge comes in when you bring in your applications inside this. How do we do just your databases without failing over your entire applications? Especially nowadays, applications are duct taped, like with multiple components, like message queues and Kafka and analytical databases, it's not just easy. Anyone who is here thinking about like, I'm just going to do this cross-region and I'm going to fail over, just look at your application dependency graph. Even not just like first layer of application dependency, you have to go all the way down, like track down all the dependency layer, because even the last one of the dependencies is still going to suffer from a network call. The other big aspect is, it requires a lot of coordination. At least we haven't still completely automated, so we have to like jump on a bridge call across 20 different team members from different teams, figuring out how to do this. In our case, it's something like this, a dependency graph. For example, like we have databases that you know pretty much their level zero of dependency. Then you have fundamental systems, for example, entitlements, like building an authentication and configuration, which pretty much is being used by rest of, let's say, primitive applications like R2, and SSL, and whatnot.

Database Trends

Finally, I just want to quickly touch upon the database trends especially as a practitioner that I've been looking at and have some knowledge. One, we are looking at more of providing embedded data at the edge, where we are still keeping your monolithic Postgres at, at least like three to four regions, and then bring a lot of those key data to embedded database, for example, SQLite. We can keep your Postgres somewhat monolithic, or as edgy, and then keep real edge using your SQLite. Especially when you're using SQLite at the edge, make sure that it is still like Postgres wire compatible. There are open source projects, where they take the Postgres wire protocol, and then replace, or underneath put SQLite as the storage engine. In that way, applications feel like they're still talking to Postgres. The other idea is persistence at the edge, obviously, Cloudflare Workers and whatnot, others are definitely looking at bringing more client-side data. The other interesting one, which you folks might have noticed, is bringing more of separated storage and compute, for example, the typical architecture where we have like Cloudflare Workers running in Europe, whereas your database is in North America. However, that's changing. A lot of the newest features we are bringing are colocating both your storage and the compute. Even for example Cloudflare Workers has a new feature called Smart Placement, and pretty much what it does is it moves the Cloudflare Workers close to where the database is. Earlier, we pushed the code more close to the clients, but what we noticed is that a lot of those client applications are actually spending a lot of time talking to their database. Pretty much any business application is like chatty, they want to talk to the database five times to even respond to a single request back to the clients. We pretty much, like a full circle, we moved out the compute close to the end users, and now we are slowly bringing back that compute back to where your database is. Lastly, this is another one where we are running into a lot of new challenges, is how do we do this data localization, especially for Europe region, still using Postgres. This is where we are betting heavily on logical replication. Logical replication provides the flexibilities where we can even replicate on a row level on a specific column level. Still, this is a little bit in more of an exploratory phase. We haven't yet made anything production using logical replication. I have a strong feeling that it's going to be this way.

 

See more presentations with transcripts

 

Recorded at:

Apr 03, 2024

BT