Transcript
Zhou: We're going to talk about how we delivered millions of notifications within a few seconds during Super Bowl. I'm Zhen. How many of you know Duolingo? Duolingo is the most popular language learning platform. We have the most downloaded language learning app worldwide. Our mission is to develop the best education in the world and make it universally available. What do we have? We currently have 31.4 million DAU, which is Daily Active Users. We have over 100 courses offered in 40-plus languages. On top of that, we also have a very unhinged social presence. You may know us from a lot of things, and I assume internet memes is one of them. You may have heard of like Spanish or vanish. Some other stuff. Also, our marketing team is really hard at work. You may have seen some of the pretty wild TikToks or memes that revolves around Duo or our employees. That is really our marketing team hard at work.
Let's switch gears a little bit and look at this number, 123.4 million. Does anyone have a guess of what this number could be? That's actually the number of viewers of last Super Bowl. This year, the Super Bowl game between Kansas City Chiefs and San Francisco 49ers brought an average of 123.4 million viewers. That is the highest number of people watching the same broadcast in history. If you ask ChatGPT to create an image of a typical Super Bowl ad, it looks like this. Basically what you would expect from a ChatGPT image. It's all over the place. It has cars, angels, neon signs, everything. Why are we really talking about this? This is because everything was fine for us engineers at Duolingo for a while, until one day we heard that Duolingo would run its first Super Bowl ad as it expands beyond TikTok. Our chief marketing officer expects the brand to be even more unhinged in 2024. Then we saw some design docs that revolves around Duo's Butt, so we know what's coming here. What really dawned on us was that we are buying a Super Bowl ad. This is a 5-second ad that was played during this year's Super Bowl. At the time, we know that it has something to do with Duo's Butt. There's one last thing. The marketing team asked us to do a push notification at the same time. What could possibly go wrong?
Overview
We'll take you on a deep dive into the Superb Owl service, which we named conveniently after the Super Bowl. First section, we'll be talking about how we dissect this impossible task. Then we'll talk about how we cope with the changing requirements. Followed by talking about the architecture deep dive into how exactly the service works. Last but not least, we will talk about what we learned and what we would have done differently.
A Dissection of An Impossible Task
Let's start out with a dissection of an impossible task. When I use the word impossible, it may sound strong to you. It's just sending notifications. When the marketing team came to us, they asked us, we want you to send millions of notifications in a few seconds, can you do it? We were like, hold on, let's do some research. We did some research. We came back to them and said, I don't think we can do that. Then, the next thing you heard was that them telling us, we already paid for the ad. Actually, at that moment, it wasn't really up to us, we have to do it. They told us that they want to send 4 million notifications within 5 seconds. When they first came to us, we were actually a little nervous, because here you see how fast our notification typically goes out is at a rate of 10,000 notifications per second. If you do the math, that's actually a lot of notifications per day. What they're asking is to send notifications at 80 times faster speed. If you're a site reliability engineer, or you work in the backend, there might be already alarms that are going off. This is what I'm talking about. Back in 2022, Coinbase launched its bouncing QR code like campaign during Super Bowl, and it was so popular that it crashed their site. We probably don't want to do the same. On top of that, we also discovered an issue with our Android app, so that when we send out a notification to the Android app, it sends a request right back to the backend. This sounds a little bit similar to DDoSing ourselves. If you want to send 5 million, 4 million notifications within a few seconds, you probably want to avoid doing that.
In theory, leadership just came to us, "You should just ship it." I don't actually know if ship it is still a principle at Duolingo. In practice, this is a complicated problem. Because here you see, we want to build a system to send notifications fast, we also want to make sure that we do not DDoS ourselves. We also want to handle the organic traffic, because people actually click on those notifications and come back to the app to practice learning new languages. Our solution is basically somewhere in between those three problems. To make matters worse, we have people coming to us from different perspectives, product managers, or other engineers, they come to us with suggestions or inquiries. Can we add an icon to the notification? Can we send to more cities? Can you localize the push into different languages? Can we use the vendor for this? All of these questions. At this point, you might be imagining there's like your favorite superhero from Marvel or DC to come and save the day. In this case, we would probably call it the coding man or something. There's no coding man in real life. What we did was just very simple, we divide and conquer the task. I was in charge of building the system to send notifications fast. Two of my wonderful colleagues, Louise and Milo, they're respectively in charge of making sure that we're not DDoSing ourselves, and making sure that we can handle the organic traffic.
Coping with Changing Requirements
Let's move on to the next section. I'll talk about how we coped with the changing requirements. On the first day of the project, my technical program manager, Paul, came to me and asked me, "Can you give me a timeline on this project?" At the time, I wasn't really a senior engineer. I said, "Sure, but where do I even start?" Because it's such a complicated project. All we knew was, from August 2023, we know we want to send 4 million notifications in less than 5 seconds. Super Bowl happens the next year on February 11th, and they're not going to move that date for us. How do we account for all these uncertainties in between? As you may know, to make matters worse, requirements are always changing. People are suggesting all sorts of things to do. You should do this. You should design this system in one way than the other. To give you an example, I said, they came to us saying we want to send 4 million notifications within 5 seconds. That was actually a lie. That actually changed. Because when they first came to us, they told us that they want to send notifications to 3 markets, with 2.6 million notifications in total, and they want to send all of it within 8 seconds. A few days later, they changed their mind. They were like, how about we round out to 3 million notifications? We want to send all that in 5 seconds. A few months have gone by, they came to us, they told us, we actually closed a new deal, and we now have the ability to show the ad in 7 markets. Wonderful, but that also means we want you to send to 4 million users. Because we're in the United States, and people have a lot of digital devices, and potentially, we want to send to 6 million plus devices, which is 6 million plus notifications, all of that within 5 seconds. You can see how requirements can be changing all the time in a project.
Our solution was to figure out, what does not change? We figured that our operating principle does not change. One of our operating principles at Duolingo here is to test it first. We're obsessed with A/B testing. We can't A/B test this project, though, but I'll talk just in a bit how we tested it. Our principle in this project is to not ship anything that we cannot test. Here's a few examples. When they came to us and asked if it is ok to send to a larger audience, we said, yes, maybe, because we can probably test it. When they came to us and asked whether we can send notifications with an icon. We said, yes, because we can test it. When they asked, is it ok to add the ability to send notifications to different markets at different times? Sure, as long as we can test it. Is it ok to add 2 million users the day before, after basically we finalized the audience list? No, probably not, because we cannot test it. After that, we established the timeline. Back in August, we know that we want to send to roughly 4 million users in less than 5 seconds. We want to have the MVP ready by September 2023. We know that we want to make sure that we cannot DDoS ourselves by the end of December 2023, so that enough time could be given so that new client changes could be rolled out to users. By January 2024, we want to make sure that all testings are done, so that when Super Bowl really starts, we're ready because we have tested everything.
Architecture Deep Dive
Let me take you on the core of this presentation which is the deep dive into the architecture of the system to send notifications. To send 4 million notifications in 5 seconds, it's a lot of challenge. At Duolingo, we have an operating principle of, embrace challenges. To summarize the challenges in this project, we have three. The first one is speed. I'll call it how fast. The second one is scale, which is how big. Third one is timing, which is what time. The first challenge is speed, of course. The speed is 800,000 notifications per second, roughly. If you think of the population of Boston, last time I checked on Google, Boston had a population of 650,000 people. I think that was back in 2022. Now it's probably a little more. To send notifications to all these people, at our rate, it will take less than one second. It's just that fast. The second problem and challenge is scale. At Duolingo, we host a lot of our microservices on AWS. Unfortunately, that cost us money. I wish that it was free. To build a system that sends notifications really fast, you also have to make sure that it's scaled up to the point that it can do its job. You also want to make sure that your backend is scaled up fast enough to do its job. On top of that, you probably don't want to spend too much money on cloud infrastructures. That is also a challenge at scale. The third challenge here is timing. For those of you who are into sports event, you'll know that sports events are unpredictable, because people get injured or sometimes it goes faster or slower. Even the more predictable ones like the soccer games, they're usually fixed, 90 minutes, but they have added time. It's also the same case with the Super Bowl. Super Bowl has timeouts and everything. The problem with an ad campaign during Super Bowl is that you don't know exactly when your ad is going to air. You see, our plan is to show the notification right after our ad airs. We have to build a system that is on-demand, so that when some marketer presses a button, potentially a big red button, then our notification goes out. We have to build such an on-demand system. We want to make sure that maybe if two marketers press the same button at the same time, no duplicate messages go out. That's the challenge of timing. To summarize, we have the following technical requirements to build this system. We want to send notifications at the target rate. We want to get all the cloud resources we need just in time. We want to ensure resiliency and idempotency. I just explained idempotency here by the marketer example. I hope you all understand.
Here is our system. It's basically an asynchronous system. We have a lot of our services on AWS. I'll walk you through some of the use cases that we have, to basically show you how the system works. Let's assume that it's a few months before the Super Bowl, you want to upload a notification campaign. It starts out with an engineer creating a campaign with a whole list of user ID. The server after receiving that request will acknowledge the request and then asynchronously fetching all the data from DynamoDB. After getting the data, they will put the parsed data into S3, locks the result into CloudWatch. A few months have gone by, and maybe it's Super Bowl Day already, you want to get ready because you want your backend infrastructure to be scaled up well enough to handle the incoming traffic, as well as being able to send out a notification. What happens then? The cloud operations admin will actually have to scale up the ASG, the automatic scaling group first, and then the engineer who is responsible, at the time it was me, would scale up the workers in the ECS console, just change a few numbers. The workers will then fetch data from S3, which we previously stored all the data, and store that data in memory. The data will be mapping from the user ID to the device IDs. After all that, the workers will log their complete status to CloudWatch. Maybe a few hours have gone by and someone was watching the ad, and the ad shows up and someone presses the big red button. What happens after that? The marketing admin hits the Go button. After that, the API server will actually send out 50-plus messages to the FIFO, First In, First Out SQS queue. The interim worker in between after receiving those messages, they'll dispatch 10,000-plus SQS messages to the next queue. Finally, the last tier of notification workers will send notifications by calling the batch APNS and FCM's API. They are Apple and Google's notification API. After all that, after we're done making all the network requests, the workers will log their process time into CloudWatch, so that we can analyze their performance.
Let's come back to some of these technical requirements that we discussed earlier. Because if the system doesn't answer all these questions, it's still not working. The first question is to understand, does this send notifications at the target rate? It's hard for me to convince you, but some of you who are familiar with SQS might notice that SQS in itself has an in-flight message limit of 120,000 messages per second. Just to send 4 million notifications within 5 seconds, would put us at 800,000 messages per second. How does that happen? Because actually, we simply use the technique of batching. We batched 500 iOS users together in one message, 250 for Android, and that put us under the limit. Second question we want to answer is, can we provision all the cloud resources we need in time? To solve this problem, we actually got in touch with a technical contact from AWS, who helped us draft an IEM, an Infrastructure Event Management document, which included detailed steps, like when and how you would need to scale up, as well as some of the more concrete steps such as things like the cache and the cache connection limits, and also Dynamo limits that you have to consider to address in these kinds of situations. Also, we also gave the Superb Owl service its own dedicated ECS cluster. I'll tell you why we did that. Last question we want to answer is, can we ensure idempotency? To solve this problem, we actually cheated a little bit, we used a FIFO, First In, First Out queue from AWS SQS service. You see, the FIFO queue in itself, it deduplicates messages by message identifiers. It also has a deduplication window of 5 minutes. When multiple marketers press the same big red buttons, it'll try to send the same messages through the queue, the queue will deduplicate the message for us. We basically just hand off that work. To keep in mind that the queue itself does have limited capacity for the FIFO queue. It has limited capacity of 300 requests per second. If you're building a system like this for your own purpose, you also might want to use a cache or a table if you want to get past some of those limits.
Let's get back to the system diagram graph. Does it send notifications at the target rate? Yes. Can we provision all the cloud resources we need in time? Yes. Does it ensure idempotency? Yes. Some of you may say that we're probably done here. I'm inclined to say so, but actually not. Let's look at this timeline of where we are now. We basically are still at September 2023, is when we built our MVP. Next subject is everyone's favorite subject, which is testing. How we tested it, at Duolingo, we love to test things by A/B testing. We are obsessed with A/B testing every single detail in the app. This is honestly not a thing that we can A/B test, because we don't want to leak our precious marketing creatives to our users. You don't want to show the Duo's Butt on something else, because that will ruin all the fun. We had to come up with our own way of testing the service. I'll walk you through a few examples of how we tested the service. First, we tested the throughput of the service. This is what I call, how we tested the MVP. At the beginning, we tested with not really notifications, we tested with what we call silent notifications. This is effectively an empty payload that we send to the client devices and see what happens. It turns out, it seems that we ran into some bottlenecks. The bottleneck we ran into was the thread count. Here you see we actually built our service in Python. I know many of you might disagree. If you are an expert in Python, you will know that when you build multi-threaded programs in Python, you run into issues with global interpreter lock and all sorts of bad things. That's the issue that we ran into. What could we do? We decreased the number of threads, and continued with the test, decrease it from 10 to 5 to 1, and the bottleneck seems to be gone. Then, we decided to test with a larger audience. Because at the beginning we tested with 500,000 users, now we want to test with 3 million users. Then the bottleneck came back. As you can see, we decreased the number of threads, but to get the same performance, we have to have a lot of tests. It was really problematic to scale up to that many test count. We decided to bin pack and put multiple processes within the same test. We experimented with a few setups and got to a point that we're ok with.
The next thing we tested was the cloud resources. This is when we tested whether we can scale things up fast enough. The first thing we ran was pretty straightforward. We want to make sure that we can scale up the system to send the notifications really fast. In this case, it's the Superb Owl service. It's actually relatively simple. You just go in, you change one number, and it scales up, just like magic. Then, we want to scale up the backend. When we did that, go through a few more steps and understand what services we need to scale up. We found those services. We changed a few more numbers. They scale up just like your usual life. Then the third question we want to answer is, can we actually scale up both the Superb Owl service and the backend. It turns out that we ran into some trouble. Here, you see that when you have services in the same AWS ECS cluster, and when you try to scale up all of them, AWS is actually doing some maybe like bin packing problems and trying to figure out the optimal solution, or some good solutions. In reality, what we observed was that when we placed the Superb Owl service in the same cluster as the backend, it has to wait in a virtual queue. It has to wait for all the other backend services to scale up so that it can scale up, or vice versa. That's not really the scenario we want to see happening, because we don't want to see our backend services waiting, because our notification service is scaling up. We don't want our notification service not being able to scale up because backend is trying to consume more resources. We gave it a dedicated ECS cluster, so that it got rid of that queue. Last but not least, we want to ensure that we can scale up both the backend and the Superb Owl service in less than 3 hours, so that we don't have to waste money to have this service running around, maybe days before the Super Bowl. We did that. We tested. It works.
We move on to test on real users. I said that, we don't want to leak the creatives of Duo's Butt to our users. I never said that we couldn't test on real users. Actually, we tried to test in plain sight using our new notification system. In October, we tested on 1 million users using a Halloween theme notification. In November, it was like a year in review theme notification. In January, we tested down 4 million users with a welcome back from New Year message. This is all us testing in plain sight without leaking the creative. One of our internal tools called Zombie mode really helped us in this process. Zombie mode is essentially a mode that we created so that when we turn it on, the user's device will stop making the requests to the backend for a certain amount of time until it tried to check that flag again. It helped us in the scenarios where backend cannot handle a surge in traffic, and it needed time to recover. Thanks to that mode, we were actually more comfortable with testing with real users, because we know that we wouldn't have extended periods of outages. One of our lessons learned here is to always actually send yourself a copy before sending it to the user, because you never know in what way your internal representation of the message can leak out to the user. You never want that to happen.
Last but not least, we tested within our microservice architecture. Here are some of the things that we saw in that IEM, Infrastructure Event Management document. The first thing that we considered was, is our memcache, Redis hitting connection limit, or can we use a proxy for some of the tables that we have? This is actually a real thing that we did. One of our services was having problems scaling up past 400H tasks because it was hitting the connection limit. It wasn't really a priority to address, so people put it off. Until this point, you really need to scale up, so what we did was we put a proxy behind the datastore. It actually can scale up to however many tasks you want. Problem solved. Second problem is, will our DynamoDB throttle? DynamoDB claims to be infinitely scalable, but some of you may know that it doesn't handle the spiky traffic really well, especially if you're provisioning your resources. We considered, should we switch some of these tables to on-demand briefly, or should we overprovision it to a point that it wouldn't throttle even with a spike in traffic? The last two are the things that we considered, but we actually never addressed, but I thought it would be relevant to mention here. One of the things that we considered was whether our retry policy was reasonable. Because you can see, we have so many microservices, and when one microservice call the other it might retry three times, and the other call the third one, three times, and it goes on forever, and it grows exponentially. You don't want that to happen. Last but not least, we know that we are going to request a certain subset of our users. Maybe we could have cached some of their user data that we know that are going to be requested. Because of time limit of this project, we obviously didn't get into the detail of all that. We got to a point where we were happy.
On the day of the Super Bowl, we have a pretty well written playbook. We have engineers on Zoom really carefully watching everything, scaling things up. We have marketers potentially at their TV, holding popcorns. That's how I imagined what they're doing. We have a square go button. Some of you might ask, why isn't it round? I might explain it to you just a bit later. When Super Bowl starts, everybody was watching. We were watching for different reasons. Our marketers were waiting anxiously for the ad to show up. It didn't show up for 20 minutes when it was supposed to. When it finally did, two marketers clicked on that square button, and one message went out. The campaign results were actually wonderful. Ninety-nine percent of notifications were out in 5.7 seconds, 95% in 3.9 seconds. There was a Figma blog post written about it. There's also a Wall Street Journal on our Super Bowl campaign. I think someone on Twitter or X actually posted about our notification campaign as well.
What We Learned, and What Would We Do Differently?
It's time to move on and talk about what we learned and what we would have done differently in this project. Some of the mistakes we wish we had avoided, relates back to the very fundamentals. We started out this project using Python. It turns out, it actually gave us a ton of performance bottlenecks. We thought about using Async Python. We tried it. It didn't really work with the tooling that we had. Actually, we didn't end up trying because at the time, our original notification stack was built in Python. We don't want to reinvent all the technology, all the authentications, the connections that we establish with the outside APIs. We stopped there because we wouldn't have enough time to test. Actually, one month after the project, my manager came out to me and said, have you seen this HackerNoon article from 2019 that says, "Using Go to build a system that sends millions of notifications within a few seconds?" I was like, that's interesting. I wish I had read that article. It could have given us a lot more inspiration and lessons. That also uses Go, which is a completely different stack than what we used at Duolingo. That could have helped us. The second thing that we wish we had improved was to optimize memory. I told you that at the end, we want to send 6 million plus notifications. To pack all these 6 million plus device IDs and user IDs into the memory of a service, it was actually a lot. We ended up using 7 gigs of memory for each task. You may remember that we have a lot of tasks, which means that it requires a lot of cloud cost and intervention. My colleague Milo worked really hard on that, and we finally got it across. If we had time, we would have optimized memories and resource usage in general. Last but not least, we want to have better observability into the client behavior. I told you earlier that our Android app used to have a problem, when you send a notification to the app, it sends a request right back. That's effectively DDoSing ourselves. That's not a problem anymore, but we discovered new problems. We discovered that when someone clicks on the notification, it triggers multiple requests to the backend, that number may be 10, 20, 50. That's not good. Because even if 5% of people click on that notification, and if that multiplier is 50, you're effectively making two times our notification requests back. That's a lot of requests back to your backend in a short amount of time.
Some of our lessons learned here are the following. The first one is to be open minded about the design, but be more rigorous with testing. I wouldn't say our design was the best, it was really far from the best. We went with one of the designs that we could iterate and test on. We know that testing will be the core of this project. Because, as long as you can test it, you can make sure that the system works the way that it's intended. The second lesson learned is to always build the system with resilience and robustness. As we just mentioned, multiple marketers can press the same button at the same time. You don't want duplicate messages to go out to your users. This is just one example of the things that you should take care of when you design a system like this. The third thing is, things can always go wrong, so accept it. I'm not really saying to do nothing about it. Actually, we took a rather active approach. We wrote a very detailed handbook to tell engineers on that day what you should do on the exact time of the day. Also, it included backup plans. What if this A thing fails? What if this B thing fails? We even have a backup plan for the whole notification not going out. What should we do then? I think what we are saying here is to face it with a more active attitude.
Questions and Answers
Participant 1: I was curious about the number of instances in your ECS cluster, and if you guys try to vary that number or do some tests. Was it fixed at 5k as the max amount for the ECS cluster?
Zhou: We did experiment with different instance sizes, and everything. One of the things that we could have done better was to optimize memory. At the end, we were bounded by memory, because we tried to store all this information in memory. That led us in the decision of how we decided to select the instance types, which ended up being memory intensive instance types. Five thousand, I think is just one of the numbers that we were comfortable with that makes sure that we can send out notifications that fast. There could be a limit, but I think it's mostly handled by my colleague with some negotiation with AWS.
Participant 2: When we talk about workloads that need to scale very rapidly and are very occasional, my mind goes to serverless. I'm curious, did you consider maybe using AWS lambdas for some of the workloads instead of looking heavily into your container cluster, and what was the role of serverless in general?
Zhou: I think that was definitely one of the suggestions that was being made early on in the project. I think someone suggested this to me and promised me, like even if it's lambda, it can also basically pre-spin up all the things under the hood, and you know you're going to get all these resources. At the time we just felt safer working with all the existing containers and the ECS services. I think, without seeing the machines spin up for you, it still puts us into the area of like, we're going to be more concerned whether this is going to work or not. This is definitely valid, because this is basically like a one-shot operation and you pay for what you do. It would make sense in another scenario, if we could test it to use lambda.
Participant 3: Did your team worry about whether FCM or APNS was going to handle the load, or freak out? Did you have to reach out to anyone to make sure that that didn't happen?
Zhou: Basically, the question was about whether we're concerned that we're going to get rate limited or throttled by FCM or APNS.
Yes, we were concerned. Also, at the very early stage of the project, we took initiative and we reached out, we asked some of our business people to reach out to the Google and Apple's reps to understand whether we are going to get rate limited or not. We only actually green flagged most of these after we knew that it's not going to get rate limited.
Participant 4: I'm curious about how you folks approach testing, because you talked a little bit about the cost. How many times do you have to actually keep testing that, and then actually multiplying the cost versus the time that you have to do testing? A little bit about how you approach the problem, testing times cost.
Zhou: Our philosophy was that, at least we want to minimize the risk. The costs in tests are relatively smaller, compared with the cost of the system not working at all, first of all. Also, we try to leverage the system as much as we can when we test on real users. As you can see, we tested on millions of real users in actual campaigns. In those campaigns, we would have spent the money anyway. That was basically no cost for that test. On top of that, we did run multiple cloud tests, basically, to try to scale up the infrastructure only, and not try to send anything. I think it's worth it to do maybe 3, 5, or even 10 times, until you're at the point that you're comfortable with the system's ability to scale up, because it's actually our first Super Bowl campaign, and we don't want to mess it up. We want to be as much risk averse as possible here. Cost is a little less of a factor if you're considering we are already spending on the Super Bowl ad.
Participant 5: As the Super Bowl is a very popular event, were there any concerns about capacity not being available when you wanted to scale up because this was such a high spike usage?
Zhou: Absolutely. I think this is definitely something that we talked with our AWS representative. This is something we definitely raised early on, whether we are going to have enough memory, whether we are going to have enough resource allocation for us. That's definitely something that you would have to discuss when you anticipate that your program is either compute or memory intensive.
Participant 6: For this project, I know you have to do a lot of scaling. Was there any strategy around optimization for the service itself? Did you use any specific tools for that?
Zhou: I think we are not known for performance as a language learning company, as you may tell. In this project we mainly focused on getting it to a point where we can send notifications that fast. The way that we tested was actually just pretty simple. We logged some of the process times to CloudWatch. We would process those logs afterwards and see how much time it takes for each function or each step to execute, and figure out the bottlenecks and speed them up. I think there are definitely more performance frameworks out there that we could leverage. With the constraints of this project, we didn't go with all that. Definitely it's a good question and concern. If we were to do this again, we'd definitely do it right.
I asked my colleague, JP, to initially make a big red button, make a round button. Later, actually, a marketer discovered that when you click on the corners of the button, which appears to be white, but it actually will trigger the same function as the square. You would call it, in video games terms, like the hitbox is still a square. That's not good. As much as we want to avoid human errors, we definitely want to avoid that, so we made it a square, so everybody knows what they're doing.
See more presentations with transcripts