Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Network Automation at Fastly

Network Automation at Fastly

This item in japanese

Ryan Landry, the senior director for TechOps at edge cloud platform, Fastly, has shared how network automation enables them to manage traffic peaks during popular live-streamed events such as the Super Bowl LIV.

Fastly is directly connected to numerous ISPs across the US and endeavours to keep their live video traffic on these direct paths with their partners to deliver video streams as close to the end-user as possible. However, when traffic demand increases, these interconnection points can become congested and impact quality. The live streaming viewers may experience performance issues such as video buffering or reduced stream quality as a result of packet loss. When users have a poor online experience, a majority of them will abandon the broadcast within a couple of minutes.

Fastly has built-in network automation (known internally as Auto Peer Slasher, or APS - underpinned by StackStorm) that activates when the interconnection points become congested and link utilisation is nearing full capacity. APS automatically diverts a small portion of traffic in order to keep the link under congestion thresholds. This traffic is then automatically rerouted via alternate best paths to the given ISP, typically via IP transit. With very large live streaming traffic, this can happen multiple times in a matter of minutes, causing the platform to shed traffic from interconnect partners to IP transit repeatedly. In most cases, the connection state is maintained, eliminating the need for the player to restart a session from scratch. Towards the end of a live event, when peak traffic declines, APS knows to unwind those actions and effectively reverts back to the starting position.

Link utilisation is one measure but it doesn't necessarily highlight potential congestion deep inside certain backbones or ISP networks. Rates of loss and retransmissions are other measures that Fastly observes and takes real-time action on using a technique they call Fast Path Failover (FPF). Their edge caches monitor the forward progress of individual end-user TCP flows. If the flow appears to stall via one given path, the cache triggers an automatic attempt to forward the flow via an alternate path, hoping to maintain a stable state and connection quality. When the amount of automatically diverted traffic exceeds the available capacity of alternate paths, or if FPF is unable to find uncongested alternate paths, Fastly makes a human-based decision about how to reroute traffic next.

Fastly has learned through experience that using an 'all hands on deck' approach to traffic engineering adds complexity. Whilst the network engineering team at Fastly is a lean and efficient group, they further reduce the number of engineers at the controls for major live events; on average to around twelve members. They break the geography into quadrants and assign a lead engineer to each. Each lead engineer is partnered with a co-pilot engineer who monitors alerts and thresholds and feeds information to their quadrant leader as necessary, while providing secondary validation and verification of changes made by the lead. When their automatic shifting of traffic from direct ISP links begins to reach upper limits of available point of presence (POP) capacity, the engineering pair works together to decide how and where to migrate traffic next, usually by altering Fastly's border gateway protocol (BGP) anycast announcements or influencing end-user POP selection via their domain naming system (DNS) management platform.

The automation and systems run 24/7. During one recent major multi-day event, over a forty-eight hour period, the team observed APS performing a total of 349 actions against the network across the ten most active POPs and interconnect partners. While APS handles much of the heavy lifting, the team spends their time tuning the system and attending to other elements of the edge cloud platform's performance. In February, 2020, APS carried out more than 2,900 automated actions across the global network in response to changing internet conditions, while the next closest on-call engineer carried just above 500.

Read more about Fastly's network automation and see real examples of their network automation running here.

Rate this Article