InfoQ Homepage Presentations How Do You Distribute Your Database over Hundreds of Edge Locations?

How Do You Distribute Your Database over Hundreds of Edge Locations?

View Presentation

Speed:

39:14

Summary

Erwin van der Koogh explains a new model that Cloudflare has developed to distribute a database over hundreds of locations, and where it could go next.

Bio

Erwin van der Koogh has been passionate about technology in general and programming in particular since his first BASIC program at age eight. In his 20+ year professional career he has helped tens of different companies on three different continents develop better software.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Van der Koogh: My name is Erwin van der Koogh. I'm a product manager at Cloudflare, working on the Workers platform. This entire talk is going to be about evolution. Seeing as this is a conference about cloudy things, we're going to be talking about cloud evolution now. Because over the past couple of years, we've seen this trend to see logic move from the region to the edge. With regions, I mean things like AWS's U.S.-East 1. All major cloud providers like Azure, or Google, or Alibaba have a similar concept. When I talk about edge, I mean PoPs, or Points of Presence. They're spread out geographically, so that as many customers are as close as possible to a particular PoP. One of the most important things they do is cache responses, so that your request doesn't have to travel half the internet, before you can get a response.

There are some obvious differences between these two, which greatly influences how things evolve on them. Regions have lots of specialized different kinds of servers, they have lots of different products and services running on top of them. A region consists of multiple data centers, but they are extremely close to one another. Edge on the other hand is very different. There's a lot less servers, but they're very uniform. They very much look the same. That's because there's very fewer products running on these servers. They're pretty isolated, because you want to geographically distribute them all over the world. Of course, there's region's cousin, the multi-region, which is more regions, which means they're spread out geographically.

History of Compute

Let's first look at the history of compute. It starts out simple enough, with mainframe timesharing. Then we get to things like virtual machines, CGroups, containers, micro-VMs. Then we start to see the first successful commercial Functions as a Service. AWS Lambda turns up, quickly followed by Azure Functions and Google Functions. Finally, we have our first product moving from running into a single region to running in multiple. Lambda@Edge now allows you to run the same function in 13 different AWS regions. There is this massive chasm here. None of these products have made any progress towards the edge. There's a good reason for that. That has to do with the environment. Because a product that has evolved in a region won't be able to thrive on the edge, which is why if you want to be successful in there, you have to grow up on the edge.

What does it look like over in an edge world? We started out with things like reverse proxies and virtual hosts. Nothing programmable yet, mostly static configurations. VCL came along, the Varnish Configuration Language, which allowed people to write much more programmable configurations. Things started to take off properly with V8, which formed the basis of Cloudflare Workers. WebAssembly came along, which Fastly quickly adopted for their Compute@Edge. Akamai and Netlify soon followed suit with their JavaScript based solutions. We know that this chasm is impossible to cross. Because when AWS already had Lambda and Lambda@Edge working, wanted to run something on the edge, they had to adopt NGINX as an njs, to build CloudFront Functions on top of.

Data Patterns

Let's talk about data. What patterns can we see there? Starts up pretty similar. We've got relational databases, SQL, which leads to things like MySQL and other databases, and one of the things like RDS, a managed service. Then there is, of course, NoSQL. Because if you have SQL you need NoSQL, which leads to things like document stores and key-value stores. A popular document store is MongoDB, and a popular key-value store is DynamoDB. We also see our first multi-region native database appear, which is Azure's CosmoDB. MongoDB releases MongoDB Atlas, which is their multi-region offering. DynamoDB Global Tables go to multi-regions. Of course, there's things like GraphQL, and FaunaDB that do multi-region as well. Still, the exact same chasm. None of these tools have made it all the way to the edge. Why? The first reason is pretty simple, and it's money. Storage is cheap but it's not hundreds of copies of my entire database cheap. The next constraint is consensus. It's impossible to create any consensus on the state of an entire database, with latency in the 100-plus milliseconds, and hundreds of nodes.

Evolution of Cloudflare

What does growing up at the edge look like for data? To start there, I want to start with the evolution of Cloudflare. What did it look like for us? Our first database product was an internal thing. Quicksilver is a key-value store that is used as the backend for our control plane. One of the most critical components of our infrastructure, because without it, no updates can be propagated to our edge locations. Quicksilver wasn't affected by the money constraint much because configuration data is really small. We only need to store the latest value, not an entire history. Because we have centralized writes, and then fan out any changes to the edge, we didn't have a consensus problem either. The next iteration was Workers KV. Workers KV is a key-value store for Cloudflare Workers. Money wasn't much of a problem, because the full dataset is only stored in a few central locations. At the edge, we only cache recently used entries. The story with consensus is pretty similar. Not only are the writes centralized, but there's no attempts to create consensus. The last write to arrive at the server wins. Many people, myself included, wouldn't call a database with last write wins semantics, a proper database. The truth of the matter is that if you have an application with lots of reads, some writes, and a few updates, KV is a perfectly fine database for your use case. It turns out, a lot of web applications fall in this category.

Durable Objects

Then we get to durable objects, and this is where things get really interesting. Because durable objects are the new thing in town, they are actor like objects with attached persistent storage. This sidesteps the money problem by sharding the full dataset over multiple clusters. Those clusters are spread out all over the world. A cluster is made up of a small number of close by nodes, which is also what helps it solve the consensus problem. How does this all work in practice? One of the most fundamental things to understand about durable objects is that there can be only one. There can be only one durable object per identifier. If you now have the Princes of the Universe theme from Queen, in your head, I'm not sorry. What this means is that there can only be one durable object available, as we can create durable objects close to where your customers are.

How to Get a Durable Object

Let's show some code examples, to make things a bit more concrete. How do we get a durable object? There's two different methods that each have slightly different semantics. The first one is with idFromName, which allows you to create a durable object with an external identifier, such as a user name or a company identifier. This will check if a durable object with that name already exists somewhere in the world, even in other clusters. If it does, it will connect that existing one, wherever it may be. If not, it will create a new durable object. The newUniqueId will create a durable object with a unique identifier. The most important thing here is that it creates that object as close as possible to your user. Everything is the same except for these two methods. Also note that the only way to invoke a durable object is to go through a regular worker. The code you see here is a worker that gets a Counter durable object. It calls the fetch method on it, and awaits a response.

How to Initialize a Durable Object

The next bit about initializing a durable object. A durable object is just a class. Initializing can be a bit tricky, because it's possible to have multiple requests coming in at the same time. If you follow this template, everything will work out just fine. We have a constructor, an initialization promise that we store on the class, and that we await before we proceed with any processing of requests. In the case of errors, it will handle the error properly and retry initialization.

How to Use a Durable Object

Then, in the rest of the fetch method, you can do whatever you want, with both instance variables and persistent storage. You don't have to use either of those. In the example here, we increase the current value if the path name of the URL is /increment. We store that before returning the current value to the worker, which can transform that again, if it so desires.

Full Transactional Support

Durable object has full transactional support, a real proper database. For example, if you want to create a lookup table, which can look up a name and Twitter handle, or a Twitter handle and a name, you can do these two puts in a row either inside a transaction, and either both of those will succeed or both of those will fail.

What Would You Use It For?

This leads us to an important question, what would you use durable objects for? There's a lot of different things. You can do things like per user caching. Create a new durable object with a new ID, and it gets initialized close to the user. They can now query that for caches for just them. Maybe you want to create a queue to do offline processing of things. You can do event streaming. Durable objects can receive and initiate WebSockets. A user can create a WebSocket to a durable object, then the durable object can stream events to the user. Or you can take that one further, and do things like collaborative editing, where you have one durable object per document, and multiple people have a WebSocket connection open to the durable object.

Another thing is going to be a lot of interest in going forward, where a lot of regulation is coming into place, where certain governments or large organizations would have to store their data in a particular region. What we now have to do, is we have to spin up copies of our applications in all these different regions to support data in different regions. A durable object, you can specify a jurisdiction in which you want to store that durable object and its associated data. You can guarantee that a user's data, for example, never leaves the EU. A much fun one is multiplayer game. We released the original Doom multiplayer on durable objects. More importantly, we haven't began to scratch the surface of what's possible with this region 'global.' This is just our journey, our evolution and our model. It's quite possible that different people will come up with different ideas on how to store data at the edge. That has me so extremely excited, because data on the edge is just in its infancy. I for one, can't wait to see where this will take us.

Questions and Answers

Fedorov: You mentioned that you want to store durable objects close to the users and ultimately to Workers that process data, and send them to the users. Today at Cloudflare, how do you decide where to place the durable objects on, because I assume that you run with many points of presence, many locations?

Van der Koogh: Yes, of course. Durable objects aren't run in every server location, because we've got everything for small, like a one rack in an ISP to our own data centers. Clusters are a handful, four or five colos that are close together. For example, a user connects to a worker and they're looking to create a durable object. That durable object will be created in that colo, in that data center. If they don't, it'll just go to the nearest colo that has durable object support.

Fedorov: Do you have a control plane, or do you use something like DNS to actually connect the request or user with a specific server? What's driving the routing of this request?

Van der Koogh: What we have is yes and no. There's the two methods to create a durable object. The easy one is when you go, give me a new durable object with a new unique identifier. In that case, we don't need to do much. The colo itself can generate that ID, because the colo ID is part of that identifier. Nothing needs to be done. It's the owner of that. It specifically is the owner of that durable object, and the rest of its cluster is the backup for that durable object in the case of any outages. That doesn't necessarily require a lot of a control plane.

What is slightly different is that if you have an external identifier, or anything else, like if you do the from name, what we need to do is we need to check if there's any copy running already, because there can only be one durable object instantiated at any time. That goes through the control plane to get consensus on, is there a durable object with that particular identifier? If there is, it will get back the identifier. It connects to the correct data center to talk to that durable object. If there isn't, it will generate an ID. That ID will be cached by every data center for future reference, like they don't need to go to this central repository. That's how the control plane works. Looking at an ID, everyone can figure out where to talk to. You first try the data center that it's supposed to be running in, and if not, there's fallbacks to other machines in the same cluster.

Fedorov: What do you do when there are network splits, when you lose the segment of your network, what would be the resolution for you to recover from that?

Van der Koogh: The data center that the durable object is running in is also the master of the persistent storage and the rest of the clusters used to replicate the data. Network splits between clusters are fine, because there's nothing shared between different clusters. If you get the correct network split, I'm not entirely sure what happens. One of the other question was, do we use any database technology. The thing that we use to store the data and run the clusters is CockroachDB. I'm not 100% sure how exactly those were configured. Network splits are very unlikely, I'm not going to say impossible because it's probably a not impossible.

Fedorov: What about a fire in a single PoP in a data center.

Van der Koogh: Fire in a single PoP is definitely something to survive. If the heartbeat of a durable object stops, the rest of the cluster will take ownership of the durable objects. Again, that's where the fallback comes in place.

Fedorov: Then it gives you replication strategies to try to minimize the risk of losing multiple data centers at first, at which point it becomes almost an impossible event.

Van der Koogh: Yes. Again, the difference between that region versus the edge, in a region, all your data centers are pretty close to one another. They're kilometers or tens of kilometers at most away from each other. We don't have that. We may have two, at most three data centers in one city, so our nodes are spread out, by definition, basically.

Fedorov: What's the ideal size for the durable objects? Do you have any recommendations, or do you know the range of that support?

Van der Koogh: The fun thing about durable objects is they're great without persistent storage, just using it as a method of coordination. One of the things we built for the company that I ran before being acquired by Cloudflare, we just built block streaming. We were streaming data. We were serving data to zero to n participants, and a durable object was perfect for that. Because the builder would just stream the build blocks to a durable object. If anyone was interested in it, they would connect to the durable object and start streaming the events. Because it was a durable object, we only used in-state memory for storing the build log up to that point. That was stuff that was really useful. You don't even need to use storage. If you do, I'm not aware of any upper limits. It'll be hundreds of megabytes, gigabytes of data, will be fine. I wouldn't worry about it.

Fedorov: Another question on semantics. You call them objects. Why are they called objects not documents? Is there some object oriented style that can be used to design for the APIs, or is there another reason?

Van der Koogh: This was before I joined Cloudflare. There was a big back and forth between, do we call them durable actors or durable objects? Because they have a bit for both. They definitely have that encapsulation of behavior, because it's just JavaScript that you do in your fetch. You have [inaudible 00:28:56] data. It fits that object in traditional object oriented model.

Fedorov: How do you manage access to the durable objects? Do you have any access control and how do you manage permissions if you do?

Van der Koogh: You need to have a worker in front of your durable object. That's where you would do your access management, most likely. Or you can just pass that information through to the durable object for the durable object to do it. There's no inbuilt permissions or access control. You have to do that in the worker itself, or you just forward the HTTP request to the durable object, and they do it.

Fedorov: If you detect a remote durable object, so you don't have anything stored locally on the node that processes the request. Do you then fetch and materialize that locally, or just pass the information remotely? How do you manage the storing and the location of that object?

Van der Koogh: There will only be one. We will never remove a durable object away from its data. If there's a remote set of durable objects, we will connect to that remote durable object. It's a request-response. It's even modeled like an HTTP response, even though it isn't technically HTTP. They talk but you basically make an HTTP request to a remote durable object wherever it is in the world. That's why it works so well, because you have your in-memory state, it's always going to be the same. If we would instantiate it locally, you now have multiple copies and your state would be ruined.

Fedorov: In that case, if users moved, do you have any data migration mechanism. The object was created and the user was in the U.S., but then the user moved to Europe, will they have to keep talking to U.S., or you have some mechanism to move that while keeping this single copy concept?

Van der Koogh: Right now, you'd still be talking to the location. It's definitely one of the things that the team is working on, it's on the roadmap. Like, how do you automatically migrate durable objects? It's not just if the user moves. If you have two things that connects to a durable object frequently, like there's the U.S., and there's Asia, there could be possibility to go, maybe it'd be better if we move this thing to Europe, because it's closer to both. There's a bunch of scenarios where we could do automatic migrations of your durable objects. The groundwork is there, but it hasn't finished yet. It's definitely one of the things we're really excited about, like below this abstraction layer there's a lot of potential for optimizations.

Fedorov: Is it similar that you're looking into some of those optimizations when you have the same user accessing from two different devices, or potentially on two different networks? Would it fall into the same category of optimization, so there are more nuances?

Van der Koogh: No, it's the same thing. It's just, how do you move the durable object to the place where it's the most optimal? It's not in the product right now.

Fedorov: This is a relatively young product, and there are plenty of ways to have applications to find the users and the proper optimization for that.

Van der Koogh: Absolutely.

Fedorov: Durable objects as you presented it, it makes sense if you run your code on the edge, or maybe on the client side. What if you run the code in the data centers? In the introduction you mentioned that this is the previous architecture, how we've been running the code in the region or the multi-region. How can one benefit from durable objects if your code is still in a data center, like it's still in a full region not at a PoP?

Van der Koogh: This isn't an all or nothing thing. Again, like the build streaming example that we had, that was one of those things where the rest of the application was running perfectly fine in U.S.-East 1. We picked up the durable objects for just the build log streaming. You get by far the most benefit out of it, if you move to what we Cloudflare call region global, or region Earth. That's how we're thinking. This is one of the foundations that we are building for that region Earth. A thing where there is no U.S., or EU, or APAC, like, no, you've just got data. We'll figure out where to store that data, depending on the constraints that you've set. That's the thing I hinted about. For example, you can set jurisdictions for individual DOs, for durable objects. I can go, Sergey is in the EU, so his data is going to be stored in the EU, whereas, I'm in Australia, so my data is going to be stored in an Australian data center. It's those kinds of things where it starts to become really powerful.

Fedorov: There is a question about leveraging some of the legacy systems that do ID management. Imagine if you have a legacy control system that does manage the IDs for you, how would you be able to introduce the durable object concept by integrating that with the control plane that already exists?

Van der Koogh: If you have a global ID in your external system, and you want to use those IDs in a durable object? I'm not sure if those legacy systems need to be part of the control plane. I assume that you may or may not want to have a durable object for things that either are in your legacy system or aren't yet in your legacy system. Durable objects, again, because it's just JavaScript, they could check if something is or isn't in there. I don't think there's any plans to allow external parties to integrate with the control plane.

See more presentations with transcripts

Recorded at:

Feb 10, 2022

Erwin van der Koogh

InfoQ Software Architects' Newsletter