BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Architecting a Centralized Platform for Data Deletion at Netflix

Architecting a Centralized Platform for Data Deletion at Netflix

49:41

Summary

The speakers discuss the architectural challenges of executing safe data deletion across distributed datastores. Balancing durability, availability & correctness, they explain how to orchestrate multi-system deletion propagation without impacting live traffic. They share lessons on controlling tombstone accumulation, building continuous audit loops, and gaining trust with a centralized platform.

Bio

Vidhya Arvind is Tech Lead & a founding architect for the Data Abstraction Platform @Netflix, Previously @Box and @Verizon. Shawn Liu is Senior Software Engineer @Netflix, building reliable and extensible systems for consumer data lifecycle at scale.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Vidhya Arvind: Have you run this command and immediately panicked? Yes, I have done that. The next command I ran is pwd, because I want to see where I ran this command. Imagine a single command can become disastrous, and years of production load gone in the wind. A chilling reminder that we have to take data loss seriously and deletion seriously. I want to talk now about a production incident that happened and walk you through what happened there. A late-night deployment, an engineer had a lot of caffeine and he wanted to deploy across the fleet during off-peak hours, of course, and a fatal command, rm -rf. I want to delete everything. Then that keystroke reminded him that there is disaster waiting for him, and panic hits.

A cascading impact of that, he has to deal with it for the next few days or maybe even weeks. Have you ever felt safe deleting data? I never feel safe about deleting data. There is unintended deletes and there's not deleting. Both have cost. Unintended delete is when you have an incident where you didn't intend to delete anything but it happened, now you have to deal with the precautions. Not deleting anything also has cost, storage cost and customer trust being eroded. When you have these two and you have critical systems databases which are having different variety of mechanisms to delete the data, really, it's a balance between durability, availability, and correctness. How we are doing that is going to come with human cost.

This is how I looked when I deleted the data. There is a human cost involved in all of these incidents, when there is a data loss. That is, you're racing against time. You have stress, guilt, and fear hidden. You're trying to see, how can I have not done it or undo it? The core of this crisis is to see, how can I never do it? Can I put 100 guardrails before somebody hits the delete button? Deletion cannot be an afterthought. It has to be thought through very carefully. I'm Vidhya Arvind, Staff Engineer at Netflix.

Shawn Liu: I'm Shawn, Senior Software Engineer at Netflix.

Pillars of Safe Deletion - Durability

Vidhya Arvind: I want to start by telling you about the pillars of safe deletion: durability, availability, and correctness. Durability ensures that we are deleting data, and the data stays deleted is important. How do datastores delete data? When you think about that, you want to think about a few ways of deleting data. You have time to live. You can set a time to live in a database. You can do hard deletes, like issue a delete command, or you can do soft deletes like GC does, like mark-and-sweep. When you are maintaining databases like this, like Cassandra, EVCache, RDS, DynamoDB, Redis, Elasticsearch, all of these systems do deletions in a very different way. The tradeoff for each of these operations are performance, operational risk and cost.

Let's talk about time to live. It's a very common operation. You set a time to live in some of these databases and the database takes care of deleting the data when the time hits. Cassandra natively supports TTLs. It does not delete the data immediately. It keeps the data until the GC grace period passes and then it tries to compact away the data. DynamoDB, you can mark with an attribute with expire at, and then it goes and does some background tasks. EVCache, it's an LRU cache. It is not automatic but lazily deleting data. Redis, similar. It supports TTL and it has periodic background cleanup that happens. Elasticsearch does not. Elasticsearch and both RDS does not support native TTLs.

You have to do some scheduling or mark-and-sweep approach. Elasticsearch has lifecycle management that you can do. In all of this, the common thing is background processes like compaction, vacuuming, or some code that you've written that goes and deletes the data. What is the cost of these deletes? The cost of delete is when compaction, vacuums, background tasks are running, it has a resource utilization cost. It has CPU costs, CPU spikes, when there are too many things to delete. There's increase in latencies because you're reading tombstones along with the data. There's read timeouts that can happen. There's storage footprint that can increase.

Let's talk about hard deletes, another type of delete I talked about. It's, you're issuing a delete command. Each of the systems, again, deals with it differently. Cassandra, again, when you issue a delete, it's just a marker. It's all immutable data. It just issues a marker and it's, again, tombstones underneath. DynamoDB immediately deletes data. EVCache, it has LRU cache lazily evaluated, but then your slab allocation is not gone away. It's still in place and some kind of merging has to happen. Redis, very similar story. Fragmented memory, you need some kind of merge process. Elasticsearch, segments are not merged. It stays until the background process comes and does something.

RDS also needs vacuuming. Again, some of the systems need some kind of background processing, not all, and the risks are different. Soft deletes, again, this is application level. You do a mark of the data by saying my column is deleted, and then you run a background process to come and delete the data. The hidden cost of deletion for Cassandra and Elasticsearch and RDS are high. Others like DynamoDB and EVCache are low. Everything has cost and you really need to look at the cost. Can you trust automation to delete the data? You often want to automate all of this so that we can get away with stuff.

It's like pushing a domino and thinking, ok, it's going to flow through to the end and delete every resource, and you often feel that automation sometimes fell short. There can also be ghosts in the system. Have you noticed the ghosts in the system? You have resurrecting data, lingering data that reappears again and again. Zombie processes, data resurrection, all of these are really important things to look at. I want to talk to you about an incident that happened just last month at Netflix. We had a misconfiguration in our Cassandra cluster. That misconfiguration was overlooked and our processes were not getting restarted greater than 24 hours. What happens when your process is down for more than 24 hours?

Your GC craze hits and the data that was supposed to be deleted is not deleted. Now you bring the node up again. It's an operator error. When it comes back again, the data reappears. The ghost comes back. It's a cascading impact. Now the data that's supposed to be deleted is coming back again. Now what do we do? How do we fix this data? First you have to identify it. What complicates this problem more?

Shawn Liu: One of the biggest complications comes from the way the data is copied and transformed across different systems. For example, you may have a primary Cassandra cluster as your source of truth. The same data can be indexed in Elasticsearch for search, cached in EVCache for fast access, and stored in S3 for backup or analytics. All of these are just a copy of the same underlying data stored in different places to serve different needs, whether it's performance, redundancy, or analytics. Now the real challenge comes when you need to delete data. When a single piece of data exists in so many places in the system, deleting it is not as simple as deleting it from the source. You have to make sure the delete operation is propagated to all the downstream systems, whether it's cache, search indexes, or databases. No copies are left behind.

That brings us to a key question, do we really know where all our data is stored and how the data is connected and transformed? Keeping track of all this is getting much more difficult as your system grows and your data flows get more and more complex. What are our options when it comes to deleting data in such complicated systems? One option is to just delete the source or root record but leave behind any copies anywhere else untouched. Here's the problem. If you only delete data from the source but don't delete it in all downstream systems, you end up with dangling pointers.

For example, you may delete a record from Cassandra cluster. The copy of the same data could still exist in EVCache, S3, DynamoDB, and Elasticsearch. Although maybe an LRU cache can eventually evict the stale data on its own, other systems like S3, DynamoDB, Elasticsearch will continue to store the data, leading to unnecessary storage costs. That means you are not only paying for unnecessary storage but you also leave behind the reference to the data that should have been deleted. A better approach is to delete data not just from the source but from all the downstream systems elsewhere. That means when you delete a record, you also asynchronously propagate the deletes to every copy, whether it's caches, search indexes, or other databases.

This fanout strategy helps us make sure that no stale or orphaned data exists anywhere in the systems. By doing this, you're making sure you avoid dangling pointers and reduce unnecessary storage costs. It's a more comprehensive way to ensure that data is truly deleted everywhere. Actually, this is the approach we used when building the centralized data deletion system at Netflix. We'll dive deeper into how we build that.

Availability

Before moving on to deletion itself, it's important to highlight one more key aspect, which is availability, which means making sure the system stays safe and operational when deletions are happening. As we think about keeping systems safe during deletes, a natural question comes up, is it actually safe to delete data? This is especially important when they have accumulated a large amount of data that should have been deleted. What are the risks and the challenges involved? Let's take bulk deletes in Cassandra as an example. When bulk deletes happen in Cassandra, there are many tombstones created. Then, when a read comes in, Cassandra still has to scan all these tombstones to check which records are actually valid.

This actual work can increase the latency and slow down the reads. In the worst case, this will cause timeout and missing service level objectives, directly impacting user experience. Before the compaction kicks in, cleanup hasn't triggered yet, so the system continues to accumulate tombstones. Once compaction starts, it becomes very resource-intensive. It consumes a lot of CPU, memory, and I/O resources. If those operations are not controlled well, it can end up starting critical production workloads, leading to service degradation or even outage as the system struggles to handle both delete workload and regular traffic. Bulk deletes can also cause compaction storms. This is especially challenging in LSM-tree based datastores like Cassandra.

This happens when a large number of tombstones accumulate and compaction tests are triggered at once. Those processes consume a lot of resources. It can create cascading effects that can destabilize the entire cluster. Until compaction is completed, those tombstones are still sitting on disk, so there's ongoing storage costs for the data pending compaction. In summary, bulk deletes can exhaust resources, slow down reads, trigger compaction tombstone, and compaction storm, causing cascading effect that destabilize the system and lead to storage bloat. That's why we need to handle them very carefully to keep the system healthy when the deletions are happening.

As Vidhya mentioned earlier, deletion comes with hidden costs, and background processes can trigger resource contention and unexpected performance issues. How do we mitigate these risks? Here are a few effective techniques we can use. One of them is partition-level deletes. Deleting individual rows one by one can cause a lot of overhead and leave behind many tombstones, which will increase the storage cost and complicate data management. Instead, deleting an entire partition at once is much more efficient because it minimizes the number of tombstones created and make it easier to manage your storage footprint. The second technique is to spread out your deletes instead of letting them all happen at once. You can do this by adding TTL jitter, which introduces some randomness to each item's expiration time.

By clustering the deletes together, those deletes can spike resource usage at the same time. Instead, if you spread the deletes across the day, you avoid the spikes, and your system keeps running smoothly. Instead of a huge resource hit, you get a steady, controlled, manageable flow of deletes. You can also use resource utilization metrics to control how quickly the deletes are processed. For example, we track the datastore, compute and storage usage, and make that information available to the consuming applications. Those applications can use that information to prioritize requests based on available resources. Low-priority delete requests don't have to be handled immediately. We can delay them and process them asynchronously. This can help reduce the pressure on the system and prioritize live traffic.

To protect the system, it's very important to rate limit the speed of deletes. We start with low rates and increase gradually as we gain confidence that the system can handle the load safely. We also monitor metrics like compaction and resource utilization to dynamically adjust the rate as needed. If there are failures, we use exponential backoff to avoid overwhelming the system. This gives the system time to recover and helps us avoid cascading issues and outage. On the left, this is some out-of-control behavior we want to avoid. On the right, this is something we are aiming for, deletion running smoothly at a controllable, managed flow.

Correctness

Now we've covered durability and availability. Let's turn our attention to correctness, making sure our deletes are truly accurate and complete.

Vidhya Arvind: Correctness is important when you especially have concurrent writes and race conditions in your system. In a distributed system, that's a normal process. In your left, client A is doing delete of id A. In your right is a client B doing an insert of id A. Which one wins? Do we ever know without concurrent writes? When you have concurrent writes, the conflict resolution has to happen within the system. There's a few techniques that the industry uses, last write wins. Cassandra uses last write wins. Conditional writes, RDS and Postgres systems use conditional writes. There are systems like EVCache which does best effort conflict resolution.

In those cases, really the end state is unknown, but that's ok for a cache. A small amount of TTL you have, it can expire quickly. A last write win system, like Cassandra, you want to issue deletes with timestamps option and deletes. Here, I have an example of client A doing x equal to 1 at time T1. Again, client B goes and does x equal to 2 at time T2. Client A again goes with x equal to 3 at time T3. All three operations go to database. What is persistent is x equal to 3, the last thing that was written to the database using the timestamp. During concurrent updates, it's really important to dedupe the data, dedupe the writes, so that you get the last data as written.

Idempotency is a very important token that you use to do all your writes. What is idempotency token? When you do writes, you want to generate a timestamp that is unique to that write and use an auto-generated token or a randomly generated token, attach it to your every write that goes in. That gives you the safety net that you need. It ensures order of operations and it allows us to hedge and retry these requests. Idempotency token is critical for your correctness, especially in a last write wins database. Where do you really generate these timestamps? When the database itself generates timestamps. Is that enough? By the time it reaches the database, the order of operation is really not known. Only the client knows the order of operation because the timestamp is generated in the client side.

It is advisable to use the timestamp from the client side and not take into account when the data reaches the database. Compare-and-set, another semantic that we can use during these operations. Client A here reads pending status and issues a delete. Client B also reads the pending status and issues a delete. When both of these operations go in, one wins over the other. The second operation gets rejected. The result is a successfully applied operation. One single operation got applied, and that is a good scenario to have. You can even use timestamps in scenarios during conditional writes as well.

What are the consequences of not deleting data in the system? Trust but verify. You want to give your customer enough information to delete the data, but sometimes they don't delete the data or accidentally this data remains in the system that we want to audit and verify. Shawn talked earlier about deleting data from everywhere. How we do that with Netflix's centralized architecture is the next thing we are going to talk about. In our lifecycle of deleting data, we first identify what data needs to be deleted, wherein all is the data stored. That is the identification process for us. Then you want to audit the system to see, is this data supposed to be deleted or did it resurrect, or what is the MRO for it?

Then you issue deletes. You want to issue deletes in an optimal fashion. Then you continuously monitor things, like there is data resurrection possible. Let's talk first about identification. When you want to identify data, you want to take input from multiple sources. Data owners have best information about what data is held in each of these systems. You want to use a self-serve mechanism to identify these sources. Then you can use schema registry and annotations also to mention that this is a critical piece of data that needs to be deleted. Here I have a URI which specifies a dataset that has critical information. Customer ID is in the ID field, and I want all of that data to be deleted when the actual customer goes away. Next is audit and validation.

Shawn Liu: Once identification is done, the next step is audit and validation. Let's walk through how we handle that. Let's start from the offline audit scheduled workflows. We build these workflows to figure out exactly what data can be deleted. Here's how it works. First, we back up the identifier table and data table in S3. Then we run a scheduled audit job against those backups to match and validate and find the data that are eligible for deletion. Finally, the results are written to an audit table which gives us a clear and complete list of the data to be deleted. This systematic approach helps us to ensure that we are deleting the right data and nothing gets missed. After finding what needs to be deleted, we move on to validation.

In this step, we double check our list of deletable data is both accurate and up-to-date. For example, we can use a tester identifier table to check if those IDs present in the data table. By validating and checking them, we make sure that everything flagged for deletion is correct. Then the validated result is written to a final set. Because those backups could be a few hours ago, some data may have been already reinserted or restored in the meanwhile. The validation job also accounts for those updates and ensure that only the data that are truly safe to delete ends up in the final set. This careful step helps us to ensure our deletion process is both safe and effective.

After we identify, audit, and validate the data that needs to be deleted, we are finally ready to perform deletion in our systems. Let's take a look at how that works in practice. At this point, audit and validation jobs have already given us the complete final results of data that can be deleted. From there, our deletion service takes over and removes data, whether it's in Cassandra, RDS, or through delete endpoints in different backend systems. As Vidhya mentioned earlier, different storage engines behave differently, and it can have its own performance implications. We took those factors into account when designing the delete service. For example, if the database is experiencing CPU or storage usage spike, we'll back off and wait until it's safe to proceed the deletes.

This approach helps us avoid overloading any system. We also make sure all the delete operations are logged in our journal service, so we have a complete record of what was deleted, when it happened, and by which service. This is done asynchronously via a Kafka topic so it doesn't slow down the main delete process. Following this careful and controlled process, we can ensure that deletes are performed safely, efficiently, and with full traceability.

Last but not least, the final phase in our lifecycle is continuous monitoring. Just because we've issued a delete doesn't mean our job is done. We need to still monitor the system to make sure everything gets deleted as expected. Continuous monitoring means we keep running regular audit cycles to catch any failed or incomplete deletion. Anything that doesn't get deleted in the last cycle, it will show up again in the next audit and get queued for deletion again. We also track the key metrics, for example, number of deletable records, how long the data has exceeded its retention window, number of successful versus failed deletions.

These metrics help us quickly spot issues and ensure the deletion process stays reliable over time. To give our users more control and flexibility, we also support a pluggable deletion interface. This is especially useful in the cases where the order of deletes matters or where users want to manage their deletion themselves. Here's an example of how a pluggable deletion interface works in practice. You can see a user can specify their customer deletion plugins, callback URLs, and payload templates. This allows delete service to trigger deletion in different systems using whatever protocol or logic that works best for each situation. This approach makes our deletion framework highly extensible, so it can support a wide range of requirements across different teams and datastores.

Here is a high-level view of our deletion architecture. Everything starts with control plane, which triggers the audit jobs. The audit jobs scan both identifier tables and data tables to find what needs to be deleted. Next, the validation job confirms that the right data has been identified and prepares a final set of records for deletion. Then the delete service takes over, removing data from corresponding systems. Also, we make sure to journal all the deletion operations and results in our journal service so we have a complete record and we can recover if needed. This architecture ensures our deletion process is complete, reliable, and traceable from end to end. Once we build this foundation, the next step is to earn our customers' trust.

Vidhya Arvind: Gaining customer trust is a very important step in all this, especially when you're centralizing things. Nobody trusts a central team to delete their data. Their data is very important for them. I want to take an example of a tester account where your device is doing some testing, from device all the way to the microservices. There are a few microservices in between. Your data is passing through all of these microservices and reaching a database. Here, the example, the tester accounts are created in accounts database, profiles database, playback systems, gaming systems. All of these data is proliferated everywhere. It's spread across the system and growing over time. You want to not have that data grow all the time.

You want to validate that this is a valid delete and go delete it. The other thing you want to do is trust but verify. Your customer is saying, "I deleted the data. You, as a central team, don't go delete my data". When they say that, it's great that you are deleting the data, but I also want to trust but verify. Audit jobs help us validate that the data that was supposed to be deleted remains deleted, and it is really deleted. It's not because there's a failed process or a bug in the system that did not delete this data. You don't want to find that out later. We also want visibility when we are deleting the data. You want centralized dashboards, which whenever you are auditing the system or finalizing a set of deletables or deleting the system, you want a centralized place where all the dashboards are collected and every step of the way you're monitoring the system.

You want customers to trust you more, so you want a robust data recovery system. The journal that we wrote will help you recover the data within 30 days. You can't store the data forever, even otherwise. It's a low storage option. It's just logs. It can stay in S3 object store, and can stay there for 30 days before it expires. It helps you recover the data. When recovery is in place, there is more customer trust that as soon as an issue or a bug is found, you can recover the data. Real time deletes has its own dilemma. You are processing the system when there's live reads and writes happening. Legacy deletes can slow down the system.

You want to do slow and gradual propagation of deletes. You want to rate limit. You want to orchestrate all the deletes. You want to adapt best practices for each and every database. Can we do better? Of course, we can. We are working on a future which is bulk delete with optimal optimization strategy where our journey is to take all of those deleted rows, like tombstones, create those tombstones as SSTables. In this case, it's an example of Cassandra. We create Cassandra SSTables. Cassandra imports can be used to download the S3 files and load it up into Cassandra, and immediately compact away the data. This is where we are going so that we can reduce ingestion costs and improve performance.

Lessons Learned

What are the lessons learned? You want identification to be in the forefront, lineage, wherein all your data lives. Identify that early and have a system to have the lineage information. Audit the system. Mistakes can happen. Data can remain. Failures are easy in a distributed system. When failures happen, the data remains in your system for longer than what you want. Trust but verify. You want to automate all validations and recovery process so that we can gain customer trust. Make it event driven so you don't have to manually intervene and do anything to minimize reliance.

Impact and Insights

Shawn Liu: Let's start with outcomes. We've been able to identify and manage deletion across a large number of datasets. Most importantly, throughout the entire journey, we have data loss incidents. If you look at the chart below, you can see our daily deletion counts increasing over time as adoption grows. Even with this growth, the system is still stable and deletion keeps running smoothly and safely. This really shows that with the right guardrail and process in place, you can run large scale deletion efficiently, safely, and with confidence even at this scale. Building on what we just saw, I want to highlight the scale of our work. We've systematically identified 1,300 datasets that are eligible for deletion across the platform. Out of those, we've been able to do auditing on 125 datasets so far.

We are actively tracking and validating a growing set of our datastores, and that coverage continues to expand. Finally, we've identified 76.8 billion rows of deletable data across those datasets. Those really show the scale of our work and the scale of the data we are managing, and also shows the importance of having a robust automated process to handle deletion safely and efficiently.

Now we've talked about the impact of the work, let's turn into some key takeaways that you can bring back to your own systems. Here are a few key lessons we've learned along the way. First, continuous audit. Even with automated workflows, checking what actually gets removed helps catch issues early and keep us confident in the system. Second, build a centralized deletion platform. This will give you a clear visibility, making troubleshooting easier and keep everything consistent. Third, adapt to each system. Every storage engine behaves differently. What works for Cassandra may not work for RDS or S3, so tune your approach to each system. Next, apply aggressively those resilience techniques to build resilience into your process. Spread out TTL. Use resource usage to slow down.

Rate limit, and use load shedding to keep the system stable under heavy load. Next, optimize for batch deletes using native format and tools. This will reduce overhead and increase and improve the performance. Finally, build trust in the delete process. A reliable, predictable process helps everyone trust the system end-to-end. When you put all this strategy together, you can now push the first domino, knowing that everything will fall into place safely and smoothly.

Questions and Answers

Participant 1: I was curious, what makes data deletable at Netflix? Then also, your deletion volume seemed pretty high for a sustained period, and it's curious. Is that expected to go to zero at some point?

Vidhya Arvind: Yes, that's right. What you're seeing is we are auditing the system continuously, but we have not audited all our systems. When we add more systems that we are auditing, then those numbers go up. When we do forward deletes, which is from today on, whatever came in as deletable, we delete the data, then it's a straight line. After we do a backlog of deletes using our bulk load, it goes to zero. MRO has to be zero at all times when we finish the backlog deletes, as well as the forward deletes. That's the goal. What is deletable is any data. Mainly when you see tester data that you're running in production, that kind of spills over everywhere. You want to test end-to-end, and you want to test in the production system.

When that happens, it spills over everywhere, that's deletable data. Any data that a previous customer had, that's deletable data. The system failed and did not delete the data, that's deletable data. In the centralized view of things, anything that the identifier table thinks as deletable is identified as deletable.

Participant 2: My question is the journal entry, what's the intent for that? Is that just log in the records that got deleted, or is it actually capturing the detail so that you can do a restore in the event of a delete that wasn't truly intended? You said you have your journal where you log the deletes. Is that just log in the keys for the items that were deleted, or the details?

Vidhya Arvind: No, it logs the whole thing, because a recovery system needs the actual data to recover the data.

Participant 2: It gives you a more targeted way to recover so you don't have to do it from a backup.

Vidhya Arvind: Yes, but we store it in very low cost S3 systems where a petabyte or a terabyte could be like a dollar or something like that, so that we can reduce the cost in storing that for 30 days, but still, it's possible for us to recover. Usually any issues that shows up, it shows up within a week. For us to have more buffer, we keep it for 30 days.

Shawn Liu: Also, we have a strict TTL on the journal tables that limit our recovery window, currently it's one month. You have to spot an issue in one month and we'll be able to offer you a recovery capability.

Participant 3: I guess the deletion service has some fine-tuned code so it doesn't take too much of the resources?

Vidhya Arvind: We use the downstream systems. For example, if Cassandra has a compute and storage utilization metric that Cassandra emits to the downstream processes so that we can monitor them and slow down our deletes, that's how it works. RDS has the same. Cassandra has the same. EVCache has the same. You get the utilization from the downstream system itself for us to know.

Participant 3: I wanted to know if the ratio of resources it can consume is applied across the board or it's defined by each one of the clients that are using the service?

Vidhya Arvind: It depends on the system. We have some buffer in the system. For example, Cassandra, we usually have 30% buffer in the system. When the live traffic is even taking that buffer and going against that utilization buffer, we slow down our deletes. At least these kinds of deletes are not that important that it has to go right there. This data has not been accessed. It's been sitting there. It was supposed to be deleted, it somehow hanged on. Tester data, sometimes we just once leave it there forever to remain. Those cases, it's not important to go and delete the data immediately. We can wait and slow down your deletes to go under the radar.

Participant 4: For the audit job, you said there are times where the same data might be copied to different systems. Are these audit jobs independently checking each system or do they know where the data has been copied throughout the systems?

Vidhya Arvind: I'll take Cassandra as an example because that's majority of the database, or larger fleet than any other database. For Cassandra, we store the backup copy of the data in S3. We can go from S3 to get the audit results by reading the S3 files and processing the data immediately. We don't do online auditing because that also affects the system and the resource utilization goes higher when you try to scan the whole table and find the delete data. It's also read throughput and things like that. You don't want to affect your online system in any way when you're performing these deletes.

Participant 5: My question is about your delete service. Let's say you call the delete service, ask to delete the Cassandra, but delete service is down. Then, how do you manage that? Because you have different sources, you are storing in S3, Elasticsearch. Then, delete service is down. How did you roll back that?

Shawn Liu: Two aspects. First, we have a continuous audit. First deletion, especially for the expired member or tester data, doesn't have to be handled online or really quickly. We can catch up on the next day. The next audit cycle will cover what gets missed. Also, we have separate stacks or a separate fleet for different delete services. There's a direct delete service running on a series of instances. We also have an async stack, which contains the delete service instance, who calls the customer deletion endpoint. We do this by shards. One fleet issue will not affect the delete service in the other shard.

Participant 6: How does your recovery system handle recovering the data back to all the places where it was deleted from? Where do you maintain all of that log?

Vidhya Arvind: The journal service itself, think of it as like a time series database. We have timestamps, and we have information about the datasets, which had the delete. Take an example of a customer comes in and says, "I am not seeing my data. It's supposed to be here. It was there yesterday". Immediately, we can query the database and see if that was in the deleted journal. As soon as you know that it's in the deleted journal, you know that there is some bug that happened. That date is known. It goes against the timestamp. Using the timestamp and trying to see how far this has happened, and going back and getting the list of events that happened during that time, all the deletes that happened, it's already in the journal system.

If you are trying to recover through the online system, again, you have to run these insert commands. How we really do for last write wins command is you don't want to replace the inserts. If there is an insert that happened after we deleted the data, we don't want to go before that insert. We take the original timestamp and add a millisecond to the original timestamp and insert our data in the middle, so that newer inserts are still intact. Those are the conflict resolution strategies we use. That's online deletes. What is more effective is bulk deletes of offline. You create an SSTable with all the recovered data that needs to be there, and then you load it immediately. That's going to be even more faster.

 

See more presentations with transcripts

 

Recorded at:

Jun 04, 2026

BT