Microsoft recently published a case study describing how a massively multiplayer online (MMO) game used Microsoft Azure to support tens of thousands of players in a single space battle. The case study looks at how architectural considerations like connectivity, latency, and scale can be addressed in an elastic cloud environment that must respond quickly to unexpected bursts in demand. InfoQ reached out to the implementation team to learn even more about this design.
The case study pointed out that designing MMO games is difficult because of the vastness of the world the players live in. These games are processor-intensive and companies traditionally purchase larger and larger machines to shoulder the burden. Game-makers often lessen the system strain by introducing constraints on the number of players or the freedom of the user to explore the world. The Age of Ascent team believed that “creating a game that can scale out across hundreds or thousands of machines is the only way to break these constraints.” Such a distributed design brought new challenges as the game creators had to connect more individual services and build a communication backplane that synchronized data across machines. The team chose Microsoft Azure as a platform, and mapped four key capabilities to the Azure service catalog [emphasis added by InfoQ]:
- The game needed to be responsive, so the ability to partition the game across a number of datacenters, geographically distributed across the globe, offering high-bandwidth, low-latency network access, was critical.
- The game designers didn’t want to explicitly manage the process of connecting gamers to the best point of presence for them, so they used the Microsoft Azure Traffic Manager, which provides geo-routing to the datacenter that offers the best game experience, based on the fastest response time.
- It used the Microsoft Azure Content Delivery Network (CDN) that provides additional “edge” nodes across the globe, to cache frequently downloaded files and higher latency shared data (e.g., distant events).
- To persist game and user data it used the Microsoft Azure storage system, which offers a high-availability, triple-copied, geo-replicated store, to ensure that once data was committed to the system, it would never be lost and would always be available.
The front end of this browser-based game uses HTML5 and WebGL, a JavaScript API for rendering graphics. Age of Ascent also uses Three.js for animating 3-D graphics. Browsers connect to the server-side code via web sockets over SSL. All the user initiated actions – such as moving around the world or firing at opponents – are sent as messages down the web socket to the server-side environment in Microsoft Azure. The game logic spans Azure locations and the Azure Traffic Manager directs the user to the nearest data center. Microsoft helped the game maker design a handful of services that do everything from govern usage to render the right part of the “universe” to the gamer.
The gate keeper service sits in front of the game’s landing page and validates users (and rejects illegal requests) and limit the number of users when the game is under extremely heavy load. Game assets are downloaded from the Azure CDN by this service in order to pre-warm the user’s session and make game entry faster. Once a gamer has joined Age of Ascent, the routing service kicks in. The user’s web socket session is anchored to a worker node in the routing service. These worker nodes act as the link between the browser and the backend services. The interest management engine is described as the “process of reducing the vast size of the universe, and the objects that inhabit it, down to those aspects that are directly relevant to a specific user, at a given point in time and space.” Spaceships or explosions in the immediate area of the user are more relevant than a planet in the far distance. Each worker role in the engine owns a piece of the Age of Ascent universe and is aware of what’s happening in that domain. As a user moves throughout the game, the engine hands them off to worker roles that “own” that piece of the universe. The team built a communication backplane that sends communication between clusters of routing and interesting management servers. With a seemingly endless array of events happening around the universe at any one time, and so many statistics constantly being aggregated, how does the system avoid swamping each client browser with data? The high latency service sends the browser an incremental feed of major events that are happening far away, and statistics like team score. These events and statistics can be rendered on delay as they become relevant to the user.
Instead of estimating the number of gamers they planned to support and provisioning a corresponding amount of hardware, the Age of Ascent team chose to define “scale units” for rapidly adding capacity to the game.
A scale unit is the smallest unit of deployment, which for this project is composed of:
- 12 routing service worker roles
- 12 interest management worker roles
- 2 backplane service worker roles
The high latency service runs as a global service outside the scale unit.
This model was tested earlier this year and the system processed 267 million personalized messages per second. By their calculations, this was equivalent to over 100,000 concurrent users fighting in the same battle.
Microsoft isn’t the only vendor courting this next-generation of gaming systems. Running web-scale games requires cloud providers to deliver a network stack that can support tens or hundreds of thousands of sessions, a data center breadth that can satisfy a global audience with low latency, and fast-provisioning compute services to respond to activity bursts. Amazon Web Services has a dedicated sub-site targeted at game designers where they highlight their wide range of app services, analytics capabilities, and low cost. Google also touts their game hosting friendliness with a focus on devices, their Google App Engine PaaS, and robust storage offering.
Anko Duizer, Director of Technical Evangelism & Development EMEA, responded to InfoQ’s request for more information about the Age of Ascent architecture on Microsoft Azure.
InfoQ: What's the actual inventory of Azure services used?
Duizer: We used Azure Table Storage, Blob Storage, Azure Queues, the old Caching service (not Redis), the new CDN (the EdgeCast one), Traffic Manager, and Worker for the majority of the game, while using Katana and Web roles for the security token servers. We did not use BizTalk.
InfoQ: What language/platform was the backend service written in, and which OS does it run on?
Duizer: C# with .NET 4.5.1 on Microsoft Azure.
InfoQ: How is the solution able to handle failures at each level: routing server, scale unit, data center?
Duizer:
- The client JavaScript deals with failures of the routing server by retrying a connection.
- If the game server goes down, the routing server will reconnect to a new games server.
- If the back plane goes down, we have a spare backplane server that is running live.
- If the scale unit goes down, the client will again retry a connection and Traffic Manager will bounce them to another date center.
InfoQ: How exactly is messaging done in the communication backplane? Using an Azure service like Queues or Service Bus, or via something with even less latency?
Duizer: It’s all web sockets between the roles, so, that is, direct connections and binary messages.
InfoQ: How does the team deploy updates to each various component? Are they done "in place" via rolling updates, or through platform downtime?
Duizer: We deploy updates “in place” via rolling updates. Thanks to the scalable architecture, we can’t effect downtime during the game, because when a unit is down, the client will simply retry using the Traffic Manager.
InfoQ: What constraints, if any, of Azure itself did the team bump into and architect around?
Duizer: One problem was not being able to directly address a server behind a load balancer on port 443. To ensure we could connect the player to the server that maintained the state for their region of space, we created the routing layer and basically dedicated half of our servers to it.
InfoQ: Do players stay pinned to a region to prevent the need to replicate user data globally?
Duizer: Yes. With 100,000 incoming messages every second, they would need to be distributed to all 200 servers – that’s 20,000,000 mps just to keep the state up-to-date. It also would require that each server checked every message against each players’ interest list to see if they should receive that message – that would be something like 30 million 3D distance calls every second, on every server.