Atlassian Engineering recently published how it exceeded 99.9999% of availability with its Tenant Context Service (TCS). Atlassian achieved this high availability by implementing highly-autonomous client sidecars, able to proactively shield themselves from complete AWS region failures. Sidecars query multiple TCS services concurrently to accomplish this goal and ensure that requests are entirely isolated internally.
The TCS is a critical infrastructure service at Atlassian, which is called multiple times in the path of every web request for most of Atlassian's cloud products. It provides a highly-available, read-optimised view of "tenant metadata". In July 2022, the TCS served up to 32 billion daily requests, with a peak request rate of 586,308 requests per second. The overall availability exceeded 99.999%, and the highest throughput clients saw a median response time averaging around 11μs during peak times.
Atlassian Engineers designed the TCS with the CQRS pattern to achieve these metrics. As changes occur in the "tenant metadata" catalogue, TCS will ingest transformed views of the "tenant metadata" into AWS DynamoDB. In addition, TCS uses L1 in-memory caches extensively, along with an SNS-based cache invalidation broadcast system. Sidecars deployed with client applications act as remote extensions of our web server caches and allow increased availability by communicating with multiple TCS deployments. The following diagram depicts TCS' architecture.
Source: https://www.atlassian.com/engineering/atlassian-critical-services-above-six-nines-of-availability
As the sidecars' cache-hit ratio generally exceeds 99.5%, cache misses are relatively rare. Consequently, TCS sidecars preemptively send duplicate requests on cache misses - one to a chosen "primary" parent TCS and one to a random secondary TCS. One benefit of this approach is that the sidecar will seamlessly handle parent or network failures. It does not need to detect a failure to react, as the "fallback" requests are already in flight.
David Connard, a principal developer at Atlassian, explains the nuances involved in this approach:
While the logic makes sense and it works pretty well for fail-fast scenarios, you also need to plan for when things fail-slow. Often that's the most problematic failure mode for a system to handle. This is where proper isolation can be most critical. For us, to have proper isolation implies that a failure of any single parent TCS, AWS service, or entire AWS region must not impact our sidecar's ability to operate against a different region.
To achieve this high degree of isolation, Atlassian engineers handle requests using independent task queues and thread pools, entirely isolated for each parent TCS (even down to the HTTP connection pool instances). They utilise request load shedding (selectively dropping requests) and thread pool dynamic sizing (limiting the tread pool size for TCS deployments with lower latency) to guard against fail-slow scenarios when tasks queue up and consume additional resources.
On the server side, the invalidation broadcast system makes cross-region calls to publish invalidation messages. As cross-region latency is significantly higher, it might impact the invalidation broadcast. Connard explains how engineers protect TCS from this issue:
A cross-region outage (like an AWS SNS failure in one target region) must not delay or prevent invalidation broadcasts to other regions from that TCS server. To achieve isolation against this, the TCS server invalidation broadcast system replicates all invalidation broadcast data and processing threads into separate region-specific queues. Isolated worker threads then publish to each target region from just one of those queues. A slowdown or complete failure to send broadcasts into one target region will slow processing for that region only, and will have no impact on the publication of messages to other target regions.
Atlassian engineers also employed several methods to scale the system in addition to making it highly available. These methods include using an SNS fanout pattern, a custom request load-balancing strategy incorporating the sidecar network monitoring capabilities and the adoption of gRPC as a lower-latency, highly efficient alternative to HTTP APIs.