BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Slack’s DNSSEC Rollout: Third Time’s the Outage

Slack’s DNSSEC Rollout: Third Time’s the Outage

Bookmarks
33:26

Summary

Rafael de Elvira Tellez discusses a case study of what happened when a large SaaS company enabled DNSSEC.

Bio

Rafael de Elvira Tellez is a Senior Software Engineer for the Demand Engineering team at Slack. Demand Engineering enables fast and reliable delivery.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Tellez: In this presentation, we are going to talk about Slack's DNSSEC rollout and a very interesting outage that we had on September 30th last year. My name is Rafael de Elvira. I'm a senior software engineer on the traffic team at Slack. Our team looks after ingress load balancing, TLS, DNS, CDNs, and DDoS protection at Slack, among other things.

What's DNSSEC?

What is DNSSEC? DNSSEC is a suite of extensions intended for securing data exchanged by the DNS protocol. This protocol provides cryptographic authentication of data, authenticate the denial of existence, and data integrity, but not availability or confidentiality. These are the record and key types used by DNSSEC, which I will be referencing later on. The first one is the DS record. This is the delegation signer. It's used to secure delegations between nameservers. We also reference to this record as the chain of trust. Then we have the DNS key record, which holds the public key that resolvers can use to verify the DNSSEC signatures in the RRSIG records. These RRSIG records hold the DNSSEC signature for record set. Resolvers can verify the signature with a public key stored in the DNS key record that we just mentioned. Then we have NSEC and NSEC3 records. This can be used by resolvers to verify the nonexistence of a record name and type. Then in terms of keys, we have the Zone Signing Key, which is the private portion of the key. This key just digitally signs each record set in the zone while the public portion DNS key verifies the signature itself. Then the Key Signing Key will sign the public key exclusively, which is the DNS key, creating an RRSIG record for it.

Here's a successful DNS record request sequence, where the user will make a request to the recursive resolver. Then the recursive resolver will attempt to track down the DNS record that the user just requested. It does this by making a series of requests until it reaches the authoritative DNS nameserver for the requested record, which is like the common use case. This is a very basic interpretation of a DNS record request. Then the sequence is the same one. The resolver does DNSSEC validation. The only difference is that the additional records requested are returned to perform this validation. Let's keep in mind that DNSSEC only validates the DNS responses received up to the recursive resolver. It just ensures that the request, like it hasn't been tampered with in transit, and is legitimately what the authoritative nameserver has. Bad actors can still man in the middle between the recursive resolver and the end user.

The DNSSEC Journey at Slack

Now that we've briefly touched base on DNSSEC, let's go through our DNSSEC journey at Slack. Back in May last year, we started off by doing a proof of concept, where we replicated most of our DNS use cases on a controlled and isolated environment. In this environment, we carried multiple tests and gained knowledge on how to operate DNSSEC. Once our proof of concept was successfully completed, we worked on an internal RFC that outlined all the necessary work to enable DNSSEC signing on all Slack domains. Then prior to starting the work to enable DNSSEC signing on any of these domains, we spent significant time simplifying our DNS setup and making some necessary zone changes in preparation for DNSSEC.

Once we deemed our zones to be ready for DNSSEC, we started with the actual rollout plan. By design, DNSSEC can only be enabled on a per domain basis, making it impossible to enable DNSSEC on a per subdomain or subzone basis. This design choice significantly increases the risk of DNSSEC rollouts. To overcome this, we categorized all our domains by the risk basically. We have low, medium, and high, in this way in the case of failed attempt, or like any bugs in the early stages, we would only impact domains that are used for things like monitoring or other non-critical or less impactful Slack features. For each of these domains, we follow the same extensive testing and validation steps, making sure that all these domains were still resolving correctly after DNSSEC signing was enabled.

Third Time's the Outage

At this point, you must be thinking, yes, but how did it all go wrong? Slack.com was the last domain on our DNSSEC rollout. Prior to slack.com, we rolled out DNSSEC to all other Slack public domains successfully, with no internal or customer impact. Obviously, slack.com had to be different. In this section, we're going to cover the three times we failed to roll out DNSSEC to slack.com. First, on September 7th, last year, we did our first attempt to enable DNSSEC signing on slack.com. However, prior to publishing the slack.com DS record at the DNS register, which is the last step of a DNSSEC rollout, a large ISP in the U.S. had an outage, and many of our customers reported being unable to access Slack as a result. Early in that incident, and before we were even aware of this ISP issue, we decided to roll back the slack.com zone changes as a precaution.

One day later, on September 7th, we enabled DNSSEC signing on slack.com following the same procedure as on the previous attempt. Again, prior to publishing the DS record, our customer experience team noticed a small number of users reporting DNS resolution problems. We started investigating and we realized most of the affected users were accessing Slack through a VPN provider. Once again, we decided to roll back signing in slack.com zones as a precaution. Then after further investigation, it turned out that some resolvers become more strict when DNSSEC signing is enabled as the authoritative nameservers, even before the DS records are published to the .com nameservers in this case. This strict DNS spec enforcement will reject the CNAME record at the apex of a zone, included the apex of a subzone, which was our case. This was the reason that customers using VPN providers were disproportionately affected as many VPN providers use resolvers with this behavior. To overcome this problem, all our sub-delegated zones were updated to alias records rather than CNAMEs.

Then, after migrating all the CNAMEs at the apex of subzones, we were ready for another attempt. On September 30th, and after careful planning, we made our third attempt to enable DNSSEC signing on slack.com. We started off by first enabling signing on the authoritative nameservers and the delegated subzones as the last two changes. We left a soak time prior to publishing the DS record at the register as we had some issues on the previous two attempts. Then after a 3-hour soak time and successful validations, we were pretty confident to publish that slack.com DS record at the register. This basically instructed resolvers to start validating DNS requests for slack.com. At this point, we had fully enabled DNSSEC signing for slack.com.

Everything looked great for a while on our tests, our external probes, our monitoring were looking good, and we were so happy to have completed the last domain on our DNSSEC rollout. Then after a little while, things started to look a bit less idyllic. Our customer experience team again, started getting reports from users seeing DNS resolution errors in their Slack clients. To give you some context, Slack applications are based on Chromium, and they integrate away to export a NetLog capture. This NetLog capture is an event logging mechanism for Chrome's network stack to help debug problems and analyze performance. NetLogs are extensively used at Slack. On the left, you will find the DNS configuration of a client available inside the NetLog capture. On the right, you will see an attempt to resolve app.slack.com with all the relevant information. During our initial investigation, all the reports were caused by NODATA or empty DNS responses from some DNS recursive resolvers. Most of which were from a large customer using a private corporate resolver and from Google public resolver, so that's Quad8. We were both surprised and confused by this as Google DNS failing to resolve slack.com should mean a much larger impact than our monitoring and customer reports indicated.

Given the data we had at the time, we decided to roll back our changes for slack.com. Then we attempted to do so by pulling the DS record from the DNS register effectively removing the slack.com DS record at the .com zone to basically stop the bleeding, so we could lay our focus on the NODATA DNS responses reported by the customers. After nearly an hour, there were no signs of improvement and the error rates remained stable. Our traffic team was confident that reverting the DS record at the register was sufficient to eventually mitigate the DNS resolution problems. As things were not getting any better, and given the severity of the incident, we decided to roll back one step further. This was a bad idea. We rolled back DNSSEC signing in our slack.com authoritative and delegated zones, wanting to recover our DNS configuration to the last previous healthy state, allowing us to completely rule out DNSSEC as a problem during the incident. We had this false sense of trust in running back the zone signing changes due to the previous two attempts, where we had done this successfully, with the slight difference of publishing the DS record at the register this time. As soon as the rollback was pushed out, things got much worse.

This is an example DNS resolution alert that paged our traffic team. That was me that day. It basically said that slack.com was failing to resolve for multiple resolvers and probes around the world. This is not a page you want to get. Our understanding at the time, and what we expected was that the DS records at the .com zone were never cached, so pulling it from the register would cause resolvers to immediately stop performing DNS validation. Then this assumption turned out to be somewhat untrue as the .com zone will ask resolvers to cache the slack.com DS record for 24 hours by default. When you stop signing the authoritative zone, it's a high risk operation unless the DS record is not published anymore. However, the assumption of no cache DS records at the .com zone made us think we were in the same state as the previous two rollbacks, having the same end result. However, unlike our previous attempts, this time resolvers had cached the DS record for up to 24 hours once the rollback was completed. At this point, all validating resolvers that had recently cached this DS record started returning serve fails for slack.com lookups.

We quickly realized the mistake we had just made, and we started contacting major ISPs and operators that were on public DNS resolvers, asking them to flush all the cache records for slack.com. Additionally, we practically flushed caches for resolvers that offer a public site like Google Quad8, or Cloudflare Quad1. As resolvers flushed their caches, we slowly saw signs of improvement on our internal monitoring. Slack clients send their telemetry to a different domain called Slack B. Thanks to this isolation, we were still able to receive client metrics that helped us quantify the impact of this outage. Ironically, Slack B had DNSSEC enabled successfully prior to this incident.

Throughout the incident, we considered resigning the slack.com zone without publishing again the DS record at the .com zone, as restoring would slightly improve the situation for some users, but it wouldn't have solved the issue with all the users that were initially getting these NODATA responses after signing was enabled. For us to restore signing, we would have had to recover the key signing key that had been deleted when we rolled back the changes to the slack.com zone. In Route 53, the key signing key is managed by AWS KMS key. KMS is an AWS service to store key security basically. This KMS key was recoverable as per our configuration, meaning that the key signing key would be the same. We were uncertain if the zone signing key would have been the same or different, as the zone signing key is fully managed by AWS and we have no visibility into it.

After the incident, Amazon's Route 53 team confirmed that the zone signing key will indeed have been different, as when re-enabling signing on the zone, the zone signing key will be generated with a new key pair, meaning that all DNSSEC validating resolvers would have continued failing to validate at least until the DNS key TTL expired. During the incident, we decided not to restore signing due to lack of confidence. We now know it wouldn't have helped. At this point and some hours into the incident we were in a situation where Slack was resolving correctly for the vast majority of customers, but a long tail of users of smaller DNS resolvers which had not flushed the slack.com DS records were still impacted. Then as public resolvers caches were flushed and the 24 hour TTL on slack.com expired, the error rates went back to normal. At this point, all we had left was a mystery to solve.

Chasing the Root Cause

Once the impact to Slack customers was mitigated, we aimed to determine the root cause of this outage. Prior to attempting DNSSEC on slack.com, our team had successfully enabled DNSSEC on all our Slack public domains. We had a lot of questions. How was slack.com different? Why did we not see any issues with any of their previous domains? Why did this only impact some customer, or like corporate internal DNS resolvers and some of Google's public DNS lookups, but not all of it?

To understand what happened, we replicated the Route steps using a test zone setup, which was set up identical to slack.com, using a resolver that we knew had problems during the outage. Surprisingly, all our attempts to reproduce this behavior were unsuccessful. At this point, we were like, let's go back to the NetLogs that we have received from the customers during the outage and see, what can we get from them. Here's where we found an extremely interesting clue. Clients were able to resolve slack.com successfully during the incident, but they failed to resolve app.slack.com. Why? This indicated that there was likely a problem with the star.slack.com wildcard record, to which we thought, none of the other zones where we've done DNSSEC in the past contained a wildcard record, this is the only one. This was clearly an oversight that we did not test the domain with a wildcard record before attempting it in slack.com.

We collected evidence, and we reached out to the Amazon Route 53 team who jumped on a call with us to figure it out. After walking them through the evidence we had, they quickly correlated this behavior to an issue with NSEC responses that Route 53 generated on wildcard records. When a request is made to a DNSSEC enabled domain for a wildcard record which type does not exist, for example, a AAAA record type. The answer is a signed NSEC record confirming that while the domain name exists, the requested type does not. This signed NSEC response can include some additional information, specifically designed to not break wildcard records, but Route 53 was not returning this additional information.

Then what this means is that when a client does a AAAA lookup for app.slack.com, it would correctly return a signed NSEC saying there is no record for that. This is correct, since we don't publish any AAAA records in this zone. The expected client behavior then is to fall back to requesting an A record. Since Route 53 isn't returning that extra information, some public resolvers take that response to mean no other record type exist for this record name, so there's no need to query for an A record again. This NSEC response comes with a TTL, so it will be cached by resolvers, meaning that impacted DNS resolvers were being unintentionally cache poisoned by clients innocently checking for the existence of AAAA records. Luckily, this cache poisoning would only affect the regional cache, which means only users of an impacted regional cache for one of these resolvers would be affected while other regional caches would not. Once we knew this, and we were able to reproduce this issue against our test zone using Google DNS, but not using Cloudflare DNS, for example. We had a second mystery, which was whether or not one of these major public resolvers was out of spec. We went back to reading the DNS RFCs, and we found out it is protocol compliant we use a cached NSEC record to decide our subsequent response. This is called aggressive NSEC. It's an optional spec that Google DNS implements. In the end, we concluded that both options were acceptable and the only real way to avoid being in this ambiguous situation is for your authoritative nameserver to correctly return existing record types in the NSEC bitmap.

Our mystery was finally solved, and after a bit of time, the Route 53 team fixed this NSEC type bug described above, together with a fix to allow recovering the same zone singing key if you re-enable signing on a zone within the timeframe. We're very sorry for the impact this outage had on everyone who uses Slack, and we hope that this talk will shed some light on some of the corner cases with DNSSEC, and maybe this will prevent someone else from having a similar issue in the future.

The Final Attempt

There's more. Once all the postings and work was completed, where we did things like writing an extensive internal review, figuring out incident remediation items, and we worked closely with Route 53 to fix the bug. We spent some time writing a blog post for our engineering blog, where we explained what happened and some of the key takeaways for us from the incident. After all of this, we started planning for the next attempt. We collected all the learnings from the incident, and we started to figure out what had to happen for us to be comfortable doing another attempt, hopefully one last attempt. This is the last status update in the DNSSEC project channel a few days prior to attempt number four, where we finally enabled DNSSEC on slack.com, with no issues whatsoever. I think this list is quite interesting, as we had various focuses. The first one is documentation and testing. We focused a lot on runbooks, improving the runbooks and making sure they were really solid and they contained all the necessary steps for doing all operations related to DNSSEC. We tested zone signing key restore, after the Route 53 feature. Then we tested the key signing key restore, testing that the DNS key RRSIG is the same after re-enabling or restoring this key.

Then the second focus is observability. We did some work to have additional logging of DNS traffic, so we have interesting metrics like top resolvers, traffic patterns when resolvers start asking for DS records serve fails and stuff like that. We did a dashboard with all this DNS traffic and other Slack key metrics, and we keep this dashboard in the monitor refreshing constantly during the rollout. In these additional metrics, there are nothing other than Route 53 logs, so we basically enable the logs on the zone, and we ingest them into our logs pipeline. We make sure that we can keep track of all these metrics to use as part of the rollout.

Then the third focus is comms. We communicated heavily, both internally at Slack and within Salesforce. We communicated with the Route 53 team vastly, and then with the NANOG and the DNS-OARC groups. The NANOG is the North American Network Operators Group, and the DNS-OARC is a community of DNS operators, both of which contain a lot of like network and DNS operators that can transmit, what are they seeing in their networks, in their resolvers to us? We can reach out to them if we need help, in case of another failed attempt. Then the fourth, but not least, is the rollout plan. We agreed on running the rollout during the lowest traffic time in the week. We did some zone changes before enabling signing to reduce the impact time if things go wrong. Then the last one is basically stick to the runbook during the rollout, like just follow the runbook precisely and stick to that.

The fourth attempt was the charm. Today slack.com has had DNSSEC enabled for two months without any issues. This is a great accomplishment for us.

Questions and Answers

Bangser: You mentioned that you spent a fairly significant time doing what I've heard described as pre-factory, doing some work before you go into the change in order to make the architecture more friendly to the change. How did you go about scoping this out and estimating this and advocating for it within the organization, because I find people can often struggle to do that.

Tellez: Most of the work is simplifying our DNS zones, like we work with two vendors at the moment, one of which is Route 53, and the other one is NS1. We had this very convoluted environment where we've delegated many zones to NS1, because we really liked some of the features that they offer for some specific Slack features. The thing is, we had many other domains, which were just delegated to NS1, and were no longer using these features. Then we had like this very complicated setup, where we had to keep another link in the chain of trust to have a successful DNSSEC setup. What we did is, we went and did an analysis on all the zones we look after, and we outlined how many of them need to be migrated back to their authoritative Route 53. We just built up a plan, we factored that in into the RFC document, and we just got signoff from key stakeholders saying, yes, this is what we should do. It's actually extra work, but that extra work will save us time in the future, both in operating DNSSEC and possible outages coming out of like human error, having to do with not establishing that chain of trust well, and things like that. The less delegations we have in this case, it's the best for us.

Bangser: It sounds like it also gave you a little bit more alignment over your zones so that your rollout could be a bit simpler, you could reuse some of the same across the zones, which is good.

Tellez: Yes, exactly. We had some interesting things happening there. Some of the migrations required us to build up temporary zones and shifting traffic between zones, because we had to rebuild some of them in NS1, specifically. We couldn't just create the zone, have the traffic shifting there, because it takes a while to populate all the records. We had to create empty temporary zones that were not getting any traffic, then shift traffic there. Then perform the operations that we were doing on that zone, specifically, and then shift traffic back. It was a fair bit of work that we had to do prior to actually getting hands-on with DNSSEC in prod.

Bangser: I think that this in-prod theme that we're talking about across the whole track, and that you touched on is really interesting, and where that phrase that everyone's testing in production, just some people are paying attention to the results or focusing on that. This rollout to slack.com is a great example of that. You'd done all the pre-work. You'd done all the testing that you could have thought of. You weren't skimping and assuming you could monitor for results. Yet it was still a surprise when you ran into things. You mentioned that before you went through, I think it was the fourth and final successful rollout that you added some more monitoring to it, some more logging, some more observability and telemetry. Are these things that were short-term adds, or do you still find yourself using that data, and finding yourself expanding in other ways?

Tellez: They're like more long-term things. Like for some zones, we figured, it's not always useful to have the logs enabled all the time. For critical zones to us, I think it's really important that we have that visibility always available. In the example of Route 53 logs, those are really important for us, and we have built dashboards around them. Whenever we need to do any DNS operations, or we have any network type outage, we can quickly pinpoint if there's a specific resolver that's having their traffic dip. Or we can even see a map of traffic divided by the client EDNS subnet, so we know where our customers are and which point of presence they're reaching, and all this kind of thing. It's not only figuring out the errors on the requests. It's also additional data that helps us figure out when's the best time to do certain things or where should we invest in a new POP if we want to do that, and those kind of things.

Bangser: I think for a lot of us, I've been across the infra side and the software service side as well. I think for a lot of people, each side is a bit of a black box. It's like, how do you debug DNS? What does observability even look like when you're talking about something that is as distributed as DNS, and has many dependencies? Do you feel like there's any learnings that you've either brought in from the software observability experiences, if you have any, into trying to bring observability to DNS? Or vice versa, your experiences in this infrastructure and DNS side of things that you feel like could enable software service owners to get even more insight into their services?

Tellez: I think it's more of the former. I was able to bring more software type of doing things into the DNS world in this case. Because typically, operating DNS is just like doing changes in the zone, rather than actively monitoring the traffic and all those things. Thanks to this way of thinking, like the software development lifecycle, and all the metrics that we have on our clients, or the Slack clients, we were able to get really valuable data. Like this Slack B thing that we were mentioning, that was critical to us, because we actually get the telemetry from the clients into what's going on. The fact that it goes to a separate domain, it's perfect, because any impact to our primary domains, like our production domains, is not affecting this Slack B domain where we get the telemetry. We can rely on this separate ingestion pipeline to make decisions while you're in an incident. Some of the metrics that we were seeing during this incident, in particular, were like everything was stable, and everything looked good. It's because the customers that were reaching us were fine, but the ones that were not fine were not even reaching us. We didn't have that observability at the level of Slack's infrastructure, but we had it on this Slack B portion of the observability spec.

Bangser: I think there's always those realizations at the wrong time when your status page is being hosted by the same infrastructure that your app is, and all of a sudden, you can't even report your status being down. That's always a bit of a facepalm moment that seems so obvious in retrospect, but doesn't always get identified upfront.

 

See more presentations with transcripts

 

Recorded at:

Nov 04, 2022

BT