QCon London 2026: Shielding the Core: Architecting Resilience with Multi-Layer Defenses

Anderson Parra, staff software engineer at SeatGeek, presented Shielding the Core: Architecting Resilience with Multi-Layer Defenses at QCon London 2026, where he discussed strategies for handling significant traffic spikes that can overwhelm even well-designed infrastructure.

Parra kicked off his presentation by describing the environment in which SeatGeek operates. This included what Parra characterized as a "traffic stampede." The problem isn't the traffic, he maintained, it's when the traffic arrives faster than a system can adapt.

As shown in the picture below, several signals indicate when a system may collapse.

The Noisy Neighbor Problem refers to a multi-tenant system where one tenant disproportionately consumes shared resources, which degrades performance for other tenants.

The Scaling Gap is defined as the period when scaling lags behind demand. Systems must survive the scaling gap, Parra maintained, and this is where shielding the core begins.

The strategy to shield the core is threefold: Absorb the Burst by handling sudden traffic spikes before they reach core systems; Control the Flow that applies fairness, rate limits, and admission control; and Protect the Core to keep critical services stable during demand spikes.

The defense layer deployed by SeatGeek uses a multi-shield approach:

Edge Shied
Gateway Shield
Platform Shield

Edge Shield

The responsibilities of the Edge Shield include: a Cache that serves requests without hitting the origin; a Queue to absorb sudden traffic bursts; and a Filter to detect bots and invalid traffic.

Using the Cache as a resilience mechanism addresses the issues of fewer cache responses as failures increase, more origin traffic when there are fewer cache hits, and increased failures when traffic increases.

Parra maintained that everything changes with the combined use of the cache with rate limiting. The service remains stable, the cache warms up more safely, and there is a decrease in origin load.

SeatGeek also implements a Virtual Waiting Room that absorbs the traffic and controls the flow.

Gateway Shield

The responsibilities of the Gateway Shield include: a Rate Limit that controls the rate of requests; Fair Access that protects legitimate users; and Validation that rejects invalid traffic.

Rate limiting uses a Rate Limit Gate to protect the platform from overload. This allows client requests during normal traffic but triggers an HTTP 429 Too Many Requests response during high-traffic spikes.

Sources of traffic include humans, fans who legitimately want to purchase tickets, and automated agents, including sophisticated bots and distributed automation. The SeatGeek Fair Access Policy applies rate limits to users and their accounts, and to consumers and their API keys. Limits by IP address are used as a fallback.

Platform Shield

The responsibilities of the Platform Shield include: resource Isolation, which applies CPU limits, schedules priorities, and prevents noisy neighbors; prioritization, which protects critical paths; and providing observability signals that use a queue, CPU saturation, and scaling signals.

Parra described a scenario of three services (labeled A | B | C) and compared them with and without isolation and the subsequent cascading events (or Noisy Neighbor Problem) when service A is affected. Without isolation, when service A experiences a significant increase in CPU time, service B incurs increased latency, followed by a collapse in service C. Conversely, limiting CPU time in service A provides stability for both service B and service C.

Mapping the Flow of Signals and Scaling includes:

Spike in traffic --> Increase in queue size (a signal) --> Reaction by the scaling mechanism (invocation of the Horizontal Pod Autoscaler (HPA)) --> Increase in capacity (more available pods) --> a decrease in queue size.

Signals originate from all three layers of the SeatGeek defense system. Parra stated that a resilient system depends on early signals and that every system needs signals. This provides a faster drain of the queue size shown in the Flow of Signals and Scaling.

The Four Core Principles include: Composition, where resilience is layered; Protect the Core to preserve critical paths; Observe Pressure because signals reveal stress; and Controlled Failure to fail gracefully, if necessary.

The best signals appear before failure, and Parra concluded by stating, "Internet stampedes are inevitable; system collapse, however, is not."

More details on this topic may be found in this white paper.

About the Author

Michael Redlich

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

Edge Shield

Gateway Shield

Platform Shield

About the Author

Michael Redlich

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter