InfoQ Homepage Presentations Understanding Architectures for Multi-Region Data Residency

Understanding Architectures for Multi-Region Data Residency

Bookmarks

View Presentation

Speed:

Download

45:44

Summary

Alex Strachan discusses challenges to build multi-region data storages, understanding why and when a business needs to do this, who are the real stakeholders, and who owns what.

Bio

Alex Strachan is a Staff Software Engineer working on Rippling’s regional cell-based architecture and next generation identity framework. Prior to joining Rippling he tackled a number of problems from scaling Lob’s print and mail API to handle the volume from Fortune 100 customers to helping Minted re-model their products to 1000x reduce the client data requirements.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Strachan: This talk is really about how to do data residency without making yourself and others miserable. The big thing, the motivating example is we want to put our data somewhere other than just one place. You could do that a lot of ways. You could do that for a lot of reasons. You could do disaster recovery. You could have geo redundancy. We're really talking about, I want some of my data to be somewhere else and have not any of it shared between those two regions. Truly the data lives somewhere else. That's what we mean by residency.

Roadmap

Who am I? What are the first principles we're going to be talking about? Then we'll actually go through some architecture diagrams towards the end. I work at Rippling. I'm a staff software engineer on the performance and architecture team. I've been working specifically on making Rippling a multi-region data residency enabled platform. We've been a global HR, and IT, and finance company for a while now. We are enabling EU, UK, and Canadian data residency in the coming years. We've been actively producing this exact work that we're doing today.

First Principles: Know Your Customer (KYC)

The first thing that you really need to know is, who are you doing this for, and why. You really need to know who your customer is. It turns out that this is not a GDPR requirement. The first thing anyone says, you come up and you say, we're doing data residency, we're doing data sovereignty, and everyone goes, the GDPR thing? You actually don't have to do this for GDPR. That's the reason. Go find out who actually wanted this, because it's not a direct GDPR requirement. You can and many very large companies, top of the industry, do this without doing European data sovereignty. Why are we doing this? It turns out that actually, this is a thing that your customers want. Someone wants to buy this from you. Typically, this person is a company in the EU who is concerned about data privacy, is concerned about regulatory exposure to the U.S., a number of perfectly legitimate reasons that someone living in the EU doesn't want their data to go through the U.S. where it could be exposed to U.S. jurisdictional concerns. You as the architect of this, need to go find out who actually wanted this thing. Why do they want it? You need to go talk to sales. You need to go talk to your customers. You need to go talk to legal because the requirements for this project are actually going to be, what do you want to be able to say in the contract? What is it that you want to promise people about their data residency? Step one is, figure out what exactly are the requirements. At Rippling, our biggest thing was, we have customers in Germany, who had specifically asked for this. We want to enable this for European customers to feel safe about having their data in Europe. That means a whole set of restrictions around like what we could actually do. We don't have our regions able to talk to each other. Other people can, because they have different requirements. It's really important. I can't stress this enough, like two-thirds of the first phase of this project is just going to be you sitting in meetings with legal understanding exactly what it is, who it is wants this thing.

Other reasons that you might want to do this, cell-based architectures are great. You can help scale your system. There's a ton of reasons to say that like, "I have a lot of European customers, I would love to have their data near them." You can do that without doing fully separated, fully isolated data residency. That wouldn't be a great use of it. Having your data separated and partitioned like this actually can be really good from a scaling standpoint, because you have independent databases and independent architecture. Great, distribute things. I just told you this wasn't for GDPR, but this can be a part of your compliance story. You may have jobs, audit jobs, all sorts of stuff that runs that's part of government A's package of compliance. The UK has a bunch of labor laws that are interesting and different than the U.S. We have to comply with them but we don't necessarily need the UK jobs running against the U.S. data. If you have everything in one big place, you have these jobs, and there is the potential, however hard you work against it, that the job could access the wrong data. It happens all the time when things are in the same place and share credentials. You may find that some governments don't get along that well with each other, and that their rules are mutually incompatible, or that their compliance jobs are odious to other governments. Those would be great reasons if you have to implement jobs like that, to have it in a different region, so that the data against which your compliance jobs are run is specific to the government to which it applies. That's why. Know who you're doing this for, know why you're doing it. Don't just say, "We have to do it for GDPR. We have to do XYZ thing." It is really going to depend on you, your company, and what you're trying to do.

First Principles: Truth and Trust

The next big principle that we were going to talk about is truth and trust. In a multi-region deployment, you need to have a firm answer on what the source of truth is, for a particular piece of data. That is generally true. That's not just a multi-region concept. You should probably know where your data is supposed to be. However, in multi-region, I think it's very important to separate that which can be divided from that which cannot. You should pick something that your business treats as an atom, a container in which every single thing lives within one region. At Rippling, that is called a company. At your company, that might be something else, it could be a team, because your teams are in different regions. It could be an org. We say that all of the sub companies and subsidiaries all live in the same region. Again, that's going to be up to you. I highly encourage you to define an atom. When you say inside of the atom, everything that is part of this will be in one region. When you make a cross-atom call, necessarily you are talking about a cross-region scenario. It is not hypothetical that you will have something in a different region. You may get lucky and they're in the same region. Any cross-atom call is necessarily now a cross-region call. Pick your source of truth.

Trust, so you have different regions, they have different data. You're going to find yourself having to have them talk to each other. There's a lot of reasons they need to talk to each other, things like auth tokens. My user signed in in the EU, and is actually part of a U.S. company. They need the ability to log into their U.S. instance and get the U.S. data. Does that mean that the U.S. needs to trust the EU issued auth token? Maybe. Maybe not. You're going to have to decide from first principles, because this is not something you can do after the fact. You have to decide what is the level of trust that you can accommodate in your business between these two regions. For starters, if your only regions are going to be U.S. and EU, that's actually a relatively high degree of trust. If you have governments involved in your cell architecture, and you're going to put a cell in a government that is hostile to another government, you are going to very have to think about how much those cells can trust and talk to each other. This comes up all the time with like, we have a U.S. data center, and we're going to add a Mainland China data center. It is very important to think about this before you do it the first time. Because if you have good answers to all of this, and you have a good philosophy and architecture grounding your design before you've built it, when they say, "We're going to use XYZ country, it's our next big market," you're ready. If you're grappling with that later, you're going to find yourself having to revisit a lot of your fundamental design patterns that you did the first time.

You're going to have to do a threat model. Straight up, you're just going to have to sit down with your security team, maybe security consultants. You're going to want to understand what is it exactly that we are protecting ourselves from? What exactly are the regulatory requirements in the various countries? What can we trust? The big ones are, access tokens are going to be for sure something you're going to have to think about. The other one is database access. You may have a dozen other things you have to think about. Those two are absolutely concrete. Every single application that does this is going to have to answer those two questions. There's cryptography you can do to enable the trust. You can make a direct call once you get in. There are all sorts of stuff. If you go stateless, is it ok that the window of trust extends from when you authored the token? Revocation delays might not be acceptable with replication. It's going to depend entirely on what you're trying to do. At least those two you will have to think about. Usually, those two conversations, when you have them with your team, will lead to a bunch of other conversations around the other things that are relevant to your business. Start with those two.

First Principles: Do the Same Thing Every Time

The last core principle of multi-region and a lot of software engineering. This is not particularly multi-region specific. However, the problem here is exacerbated by multi-region. We'll go through this. Do the same thing every time. If you have an option where you have a fork in the road, and you could just not have that fork in the road, that's a better design. We'll go through a couple examples of what I mean, and then we'll apply it to multi-region. This is an example from network routing that is from back in the day. My mentor Tim Worsley told me this back when I worked at Minted. You say, I have a router, I'm designing a router, and I'm designing a routing path. I want it to be as fast as possible most of the time. Then if there's a lot of load, I have this queue, and I can just queue up requests. On the face of it, that actually sounds like a relatively reasonable design. You're like, ok, most of the time, it'll be pretty fast, and that's cool. Everyone likes low latency connections. Then I have some strategy of dealing around overflow, spikes, I have a way of dealing with this. Diagram this situation. We've got our router. We have 100 units of computing, whatever it is. Routing your request takes 1 unit, and adding to the queue takes 1 unit. Problem, if you get a single traffic spike, you may be knocked off the internet forever. How did we start with such a seemingly simple straightforward design, and ended up with a pretty bad failure case. If you're running at more than 50% capacity, if we go back to the diagram, one cycle, so I could do up to 100 requests a second in regular routing mode. It takes a cycle to add to the queue. That would be two cycles per request for queued requests. If I have 51 requests a second, and I go into queue mode, that's 102 compute cycles to do my original baseline 51 requests a second load. Not only does my queue grow to infinity, I also have the spike to add on, because I will never dequeue those. I'm just permanently delayed. This is never going to recover.

Obviously, we don't have situations where we just have queues growing to infinity very often. You're like, add more workers. I bet a lot of your systems when you go in Datadog or something, look like that right-hand chart. We've modeled a router, and there's just some 10 requests a second, or baseline load, 25 requests a second limit. There's some degradation when it goes into queue mode. At the period from 40 to 43 seconds, I gave it just 20 extra requests. In theory, I had capacity to handle that in a second and a half. What I actually get is this big latency spike, where my latency goes up to three-and-a-half seconds over like an 11, 12-second window. By just this less than one extra second of capacity, in this real system, I get this huge latency spike on the right. This actually happens a lot if you have a garbage collection pause, or stuff like that. How does that apply to multi-region? You're going to be doing a lot of routing, not network routing, but like a person comes into your system and needs to get to the right data center. You have a choice. You can either, "I'm going to say they're going to be at the nearest data center most of the time because the EU customer data is in the EU. Then, only if they're in the other region will I incur this extra cost to run." Try to avoid designs like that. Stick with designs where you do the same thing every time, and you say, I'm going to come in. I'm going to figure out where you go, and I'm going to route you. We'll cover more concrete on that.

The other thing to do is maximize the likelihood of it working by reducing coverages. I put the clown emoji here, because this is meant as a joke. As your feature set grows to infinity, the percentage of those that work converges to zero. No one has infinite features. You do have a limit to how much you can QA and test a specific code path. As you add complexity, as you add new features, stuff that wasn't tested or wasn't covered in tests, or people aren't using, tends to break at a constant rate. That's work to fix buttons that people haven't clicked on. If you have a part of the app that people don't go to regularly or your QA has been neglecting, what you'll find is that things fall off the working path. The actual thing is, probability of something works is directly proportional to the fraction of users that use it divided by the complexity of the feature. It's not really good math if you tried to do a probability distribution as a percentage. It's not going to work. This is conceptual. If you have a situation where you have a split between your low user percentage scenario, so our scenario where we have, most of the people are going to route to the local region. That most of means that the people who are traveling to the local region are a small minority, low percentage of users. Routing between regions is a high complexity item. The probability that works, is actually quite poor. Really try to reduce the number of branches, try to reduce the number of forks in the road that you can do with. Do the same thing every time.

Things specifically to be cautious of, the geographic nearest region, the one we just covered. The problem with that is you have this very rarely used code path where you fork in the road, and you go to another region in exceptional circumstances. That's actually really hard to test. Because to have your QA do that, they need to set up an account for the other region. First off, you have to have region architecture stuff for them to test on. This brittle, rarely used code paths, it could be broken, and the six customers it impacts, you'll find out in a week that their experience is degraded and now you got to go debug it. It's brittle. It's rarely used. Try to avoid that. Another really similar example would be, most of our writes are actually going to U.S.1, like most of our customers are there. We have the shared table that we're going to replicate out from the U.S. data center. The problem with that is, your application is able to make the expectation of synchronous consistency in the U.S., so you route to the database, and you read and you got the same answer. When the EU path tries to do that same code, it writes to the database, and it reads, and it will not have synchronous semantics, it will be asynchronous. Try to avoid situations where you have that difference, because those bugs are going to be hard to track down, because your test ran probably in an environment similar to the U.S.1 where you had synchronous consistency, and just even stuff like write latency. When you have to go cross-region to write, it's going to vary between the regions. You may not have a performance issue with a job the first time you try to run it in the EU. It's going to make this huge network hop 150 milliseconds, to generously U.S.1 in Virginia or something. It's used to having 1 millisecond latency to the database. It is 150 times slower all of a sudden. That's probably going to break the first time you run it.

An application of this principle as a diagram. The left-hand diagram is exactly what we said. We're going to have the app write to the database in region 1 and region 2. We're just going to have region 2 change out its write path to write to the U.S. database, we'll replicate it back. Region 1 here has a synchronous read path to its local database, region 2 doesn't. This would be an example of something that would create subtle bugs and should probably be avoided. The right-hand diagram, however, doesn't have this because we have perfect symmetry between the two regions. When possible, try to maintain this symmetry. That's a nice design because you maybe have a little more complexity, you have to stand up this like global region of data. Both region 1 and region 2, and presumably your test architecture, which you'll implement some version of write database and replica, all have the same code path. If it works in one, it works everywhere. Now the probability of it working, if we go back, is proportional to the percentage of users divided by the complexity. The complexity term doesn't really come into play if you're at 100% of the users. Your app is just broken if the path doesn't work, and that you'll know about a lot faster than if you broke a niche edge case.

Architecture Patterns: Where to Draw the Circles

Let's talk through what we actually do to do this. Your application is an onion. You have your edge. You have your app servers. You have your database. We're going to put circles around it somewhere. To everything outside of this layer, the inside contents of the circle present as one item. Client routing, the easiest one is you just give everyone subdomains, eu1.yourapp.com, us1.yourapp.com, and you have the client now. This was the first-generation solutions of doing cell-based architecture, the real pioneers in the field. This was all that was available. This works great but it is a little bit annoying to your customers because they have to implement all of the routing logic themselves, they have to know. For most of your customers, as long as their domain of concern is within the atom that you decided on in the source of truth, they're going to have a fine time, because they just keep interacting with the same region, and everything works the way they think. It's going to get hard if you move someone's region, though. Your customer submits a ticket saying, I'm in region 1, I want to be in region 2, all of their integration is going to break, because all of their webhook URLs, all the stuff they stuck in their code somewhere, every time they make an implicit assumption on the URL based on the client domain knowledge of your routing, all that is going to break. You will find use cases where this is important, and is something that you have to do. However, maybe don't consider it as your first-choice strategy.

Gateway routing. The next thing is we have atoms. We talked about the source of truth and an indivisible part of our application. What if you put the atom ID in the request, and you have a map at the edge that just knows where the things are. That actually works pretty well, for most use cases. For traffic that lives within the atom boundary, that actually is a really good way to route you. You just include, in our case, the company ID. You include the company ID in the request as a header, as a query brand, it doesn't matter, path param, fine, anything. That atom ID is now your routing key, and you just know what region. If you change regions, just update the mapping in the edge and the traffic will go to the correct region. This doesn't work so good if you have cross-atom traffic. You have a partner that needs to access two companies and they do different things in their different regions, you can't actually route with this old way. You can get to the partner's first region and then fan out and do stuff. You're going to need to do something other than just gateway routing in that case.

What do we do now? You could have your regions be able to talk to each other. That's actually a pretty good idea. When you have these regions, they are able to make calls. Again, you're going to come into the idea of trust between the regions. However, if you can have a situation where access cross-region is allowed by your security policy, by whatever it is you're trying to do, and it's ok to have cross-region data access, which again, often is the case, because this is not a GDPR compliance matter, data access for a legitimate business purpose is often permissible. Not always, check first. You will find that something like having your application be able to talk cross-region is going to be useful. This is another great example of, I have a query that needs to access 10 companies that I know about. If you have your atoms in some routing key, that's a great way where you can build a batch API. You have like a list of company IDs that this applies to, and the framework can automatically route to the appropriate regions. Again, use your atoms when you're able to. Sometimes you just won't be able to. You need to run an audit job on everything in every region. You could do that too. You just set up a service that doesn't work around the atoms, and then the regions can crosstalk. The problem with this is it's very complicated, and you get a lot of regional latency when you do actually have to make these requests. You got to remember the New York to Shanghai latency, just about the worst it gets is like 400 milliseconds. Most good APIs are already returning before that round trip has been made. That can be a killer. You'd carefully consider how many of these you're doing. You'll need to structure your app in such a way that it gathers the data requirements upfront, and they flow through the request, rather than, "I was missing this piece of data. I was missing this," and just iterating through and making a ton of these calls.

Database-Managed Locality

Database-managed locality. There's a lot of databases out there that will say, we're a global database. They often are. They're actually really cool. They do stuff like replication. You can give a region key and it'll store the data in that region. We do have to talk through that a little bit, because you have to think through what kind of use case they were trying to enable to decide whether this magic database is going to work for you. There's accelerators. This is typically an active-passive topology, think like a CDN. You write once, all the regions get an eventually consistent copy of it from which they can pull to accelerate the data flow in the local region. Global databases, these are true global databases where they have some genuine active-active topology cross-region, where you can write from anywhere, you can read from anywhere, and they guarantee eventual consistency on it. Accelerator is not really that applicable for data residency, because inherently it is copying the data, which was the whole point of the project. However, they're really good at some stuff. My example of this is like the U.S. tax code is the same everywhere. Everyone who has dealings with U.S. business follows the same tax code. That data is a perfect example of something to replicate cross-region, using one of these eventually consistent accelerator type patterns where you write, and it replicates everywhere, and everyone has a nice local copy of this. Why is that a great example? The data is consistent. It's not related to a specific company. It doesn't change very often. That'd be great data to use this for. This can be a part of your solution. Obviously, it doesn't speak directly to the residency aspect of the solution.

True global databases, you do run into a problem where you either have an active-active topology, where last write wins. That can cause some issues, obviously, depending on your use case. If you have regions that may act on the same object at the same time, one region can say, I committed this. The second region says, I committed this, but only the data from one of those two commits is going to survive the reconciliation. If that's a write pattern that's applicable to your use case, that could be great. This could actually be a huge win for you. That's not often the case. If you have a conflict on a write, you probably need to resolve it some other way than just having last write win, for a lot of systems. Obviously, if you do this synchronously, you have to have a quorum between all the nodes, that's going to cost you the cross-region latency to achieve. That's going to be really slow. In the spec sheet, you don't typically see something about partial replication, where you can say, I want this set of data replicated, and not this other set. A few vendors can absolutely do that. You're going to have to carefully read your spec sheet, understand what technology is going to work for you and not, and then thoughtfully think of this. Unfortunately, we're just going to stick a database proxy in front, is not typically a one size fits all solution for enabling data residency, as convenient as it sounds like it should be.

Summary

Pulling it all together, you are going to have to ask your team who it is you're building this for, what exactly the requirements are. It may surprise you, the answers. This is an incredibly important part of the exercise. When you're working through the trust and truth, it's important to define an indivisible part of your architecture and then an encapsulating idea, so things above can be cross-region and things below never are cross-region. That will help immensely in keeping yourself sane when developing code. Do the same thing every time. When you see forks in the road, especially related to region resolution related to business logic in that domain, I would strongly encourage you to think of ways to try to work around those forks in the road so that the code path followed is consistent. Again, when you're testing, when you're building your test and rollout plan, when you're making changes in the future, it's going to be hard to know that you've broken one of these delicate mechanisms that's on a rare code path, try to get rid of those.

Questions and Answers

Participant 1: We got a strong slant for data residency from a regulatory reason or some other business reasons. What if you're just interested in having data reside close to users for faster round trips, and you don't want or don't trust database replication?

Strachan: First off, the big differentiator between those use cases is, is it ok for my data to reside elsewhere also? Maybe data replication isn't the technology that you're going to choose to do it with. The differentiator between, my users want a fast experience, and I want to do data sovereignty and data residency, that makes data residency harder, is the lack of ability to replicate the data. Absolutely, you can use a lot of these techniques. Particularly the database accelerators are great. You can run audit jobs. If you're in a world where eventual consistency is ok, that's an awesome way to go about it. A lot of the disaster recovery scenarios and whatnot. The thing that makes it tricky to say like there's a huge win for if you don't trust database replication, is that all of the other layers at which you can do these are pretty difficult to combine effectively. If you're in a situation where the atom of truth for you is very well aligned with your business, and not a lot of things are going to make any cross-atom call, or even you could say, nothing is allowed to make a cross-atom call, then you can use some of these designs, like the gateway routing would be a great way to do that, if that's applicable to your business.

Participant 2: I totally agree with trust as a first principle. I got a little bit lost with the example of, do I trust the authentication token and another locality, is that not just a policy decision at runtime?

Strachan: It can be. At Rippling, you can be employed by multiple companies. You in the context of your first login is with A company, and then you may switch to go see a W-2 from another company, you're a contractor for two companies. There's a ton of different scenarios where you could legitimately interact with multiple regions. The question is, you logged in, we gave you a token at first. You bounced to the EU. We decided that that first token for us is not good enough. You're actually going to have to go through a token exchange and get a token specifically. We did not trust in general the tokens. The policy that we use is like, if you're going to access a role, you're going to access your employment with the company, you need a company specific token as a general user. For stuff like where our partners are talking across companies, that's a different use case, we have a different set of challenges there. For regular users, what we actually said was you need an atomic token, which, multi-region or not, you're going to need to log into that specific company. For us, we don't trust the token at all. We don't even trust the token within the context of a multi-company call. You just get a specific company, which is an atom and a specific atomic token for that region. That was our solution to this.

Participant 3: The thing is, usually when you write a client to get anything, you need to have expectation of time. If you're making a call locally that time bound is pretty tight, cross-region is much higher. Also, it's dependent on how far the other region is, and that varies a lot. Are there any kinds of rules of thumb saying, if the region is this close, and then you can paper over that difference, say, I would just set a generous bound, and treat every region more or less the same. Or there's a point when you say we'll have per-region expectation, so, essentially, forward the configuration depending on the destination. Is any of the design patterns you talk about better or worse?

Strachan: First off, by doing that for configuration, we'd be in violation of do the same thing every time. It's a great principle, you might not always be able to apply it. The design I would suggest that probably is most applicable to this is your application layer routing. You probably don't want to leak this detail of cross-regionality or latency to your clients. They really would prefer not to care. They're going to make a request. Who they're making it to, and what atom they're trying to interact with, is going to be a detail of what they have to do. They don't get to choose. They don't really get to control it. It's going to take as long as it takes given the scenario at play. Obviously, your client is going to have expectations. Typically, those expectations are derived by their typical use case. Your developer is going to start working with this API and get some pattern and latency. Part of setting those expectations is to have your thing work the way it's going to work in prod for them while they're developing it. If they're always interacting with their own company, and it's always 400 milliseconds, because it's very far away, at least, it's consistent. I think, to some extent, there's no true answer to that question of like, how do you stitch this all together, so that, truly, the client does not have to care at all and can have the same expectation no matter what region they hit. The obvious answer is like, if you do hit New York to Shanghai, and it's 400 milliseconds, unless you're going to add a sleep 390 in every other region, you're going to have a different experience.

In the fanout step, you have this scatter-gather pattern within the application routing framework. You've got all the remote regions. You might say, I need to run this on every remote region. If you can scope that, so if you do atomic routing on your scatter-gathers, so you say, I'm going to access these companies. The odds that you hit a very far region, like the South Africa region, go down, unless there's a legitimate use case to do it. Again, it's all about, when you can, use your design pattern. Because again, if the client has a legitimate use case for accessing highly remote data, it will take longer, and then it's just an expectation management problem. To the extent that you can, try to keep things consistent. Especially for your first time, or early deployments of multi-region, would highly advise that. You can do this, and then as your company grows, and you find you have these niche use cases, you can add routing infrastructure later that will help deal with some of these specific use cases. Probably for the beginning, I would start simple.

Participant 4: What happens when you start the process if the final atom size isn't what you want to be data resident and that actually changes over time. Your next customer wants more things to be data resident, if you didn't make that decision in the beginning, how do you deal with the transition to decide, do you move the line or how do you adjust the customer expectations? More practically, how important is it to build the ability to move customers between regions as the first feature you might want to build?

Strachan: I can tell you exactly what our reasoning is. It's quite simple. If you're a Rippling customer today, obviously, it wasn't a blocker to you buying Rippling. If you're not a Rippling customer, it may be a blocker, and we have enough people who are not Rippling customers who are telling us they want this. As a business, you could go out and say, "If it wasn't a blocker, clearly, you've already bought us, we don't care. We're going to build it only for new customers." That'll probably get you until the first renewal cycle when the customers that was a tough sell, to say, we're going to do it right in the U.S., and they're like, you have data residency now, move us. It's going to be a tough sell at that renewal to say, "We've built this thing that does exactly what you wanted, always in your dreams, but you can't have it." The answer as I would say, it is not a launch blocker to multi-region. However, you should expect to need to do it probably within the first year or so, or at least have a good timeline or answer on when are you going to enable it?

The balancing act here is between the complexity of breaking up those atoms and not. If you draw the line higher, and you say more of these things are going to be data resident at the beginning, you have more problems, because you've made more complexity for your initial launch. That complexity pales in comparison to moving the line afterwards. Set the line relatively high up, as you can. Everything that is customer data, ask your legal team what it means to have customer data. Another example at Rippling is like, you're a person, and you have some personal data, but you're employed by the company, and they own the data about your employment. Where do you draw that line? What is personal data, what is company data? For us, the answer is actually much more of it is company data than you would think. Maybe if you came in this naively. If you understand the HR and compliance space, actually probably wouldn't be a surprise to you where we drew the line. It's something where it's like, if you draw the line too big, you're making a small amount of pain upfront. If you draw the line too small, you're making a great amount of pain later. I would err on the side of saying something like, all company data has to be within the same atom. You could go larger than the company, you could say the whole org, like a collection of companies actually lives colocated in one region. That may or may not work for your business. Those are different things you're trying to balance in terms of where you draw that line.

Participant 5: I come from the telecom space. Our main KPI is basically to drive down the cost to serve. We're basically driven by ensuring that our customers are on-demand, [inaudible 00:43:30]. We operate a lot of ephemeral instances. We have to do a lot of renewal ourselves also. Do you have any tips in dealing with let's say dynamic risk or atoms that dynamically responds to load. In this case a region doesn't always exist, sometimes it exists for inputting this, and then it goes back out.

Strachan: Yes, to the extent that you can follow the gateway routing pattern, you're going to have a much better time with ephemeral atoms, because, yes, it pops into existence. As long as you update the global datastore that is powering this mapping of thing to region, you're going to be fine with ephemeral regions. If you can't do this, it gets ugly. If you have this scenario, you're going to have to update the map dynamically in the application router, and that's going to get hairy.

See more presentations with transcripts

Recorded at:

May 24, 2024

Alex Strachan

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Understanding Architectures for Multi-Region Data Residency

Summary

Bio

About the conference

Transcript

Roadmap

First Principles: Know Your Customer (KYC)

First Principles: Truth and Trust

First Principles: Do the Same Thing Every Time

Architecture Patterns: Where to Draw the Circles

Database-Managed Locality

Summary

Questions and Answers

Related Sponsored Content

This content is in the QCon Software Development Conference topic

Related Topics:

Related Editorial

Popular across InfoQ