Key Lessons from Overwatch 2's Server Issues

The launch of Overwatch 2 was marred by server disconnects, black screens, DDoS attacks, login failures, and endless queues. Unfortunately, this has become a familiar scenario for many AAA games, including Forza Horizon 5, Splatoon 3, Halo Infinite, and Battlefield 2042, which all suffered major network infrastructure issues at launch. The persistence of these problems for weeks or even months after launch is often due to the complexity of diagnosing and resolving them. Developers must navigate a long list of potential issues, and fixing one problem can lead to another. The Overwatch 2 launch, for instance, was disrupted by massive server issues, some of which still linger weeks after release. So, what causes these infrastructure issues? While it's difficult to pinpoint a single cause, common problems often stem from a handful of key issues. Many game studios rely on centralized servers to handle data processing and management, which can be cost-effective and easier to manage but also creates significant vulnerabilities. Centralized servers put all valuable data in one location, making them a prime target for DDoS attacks and creating bottlenecks that can congest player traffic. Large studios often use specific, high-performance hardware with limited global availability, making it challenging to scale games. For example, using CPUs with high clock speeds for Unreal servers can be difficult to find, especially when game instances require faster and more powerful CPUs to support many players. QA teams may only certify game servers on specific models, limiting the ability of DevOps and LiveOps teams to respond to traffic surges and forcing them to stick to established procedures. There's also a temptation to use large servers with many CPU cores to save money, but this can lead to a high density of players per node, making it easier for hackers to target and causing ripple effects across thousands of players. Planning for infrastructure issues is challenging, regardless of studio size. However, there are simple solutions to these problems. One approach is to use a distributed network, where data processing and management are spread across the entire network rather than relying on a central location. This makes the network more flexible, scalable, and resistant to bottlenecks and DDoS attacks. Integrating with cloud-based or edge infrastructure providers can also reduce dropped connections and other network issues by positioning players closer to servers and reducing data travel distances. Testing games on multiple infrastructure providers and widely available machines can help mitigate disruption risks and identify potential issues in advance. Today's infrastructure is complex, and automation is essential to leverage it effectively. While studios have talented engineers, they should focus on game development rather than rebuilding existing tools for automation. Significant financial repercussions can result from infrastructure issues, as seen in the cases of Blizzard's Overwatch 2 and Roblox. Provisioning enough servers to meet demand is also a challenge, as forecasting player numbers is difficult. Flexibility is key; overprovisioning can lead to wasted resources, while underprovisioning can cause player frustration. Infrastructure issues can have long-lasting, adverse effects on games, causing players to abandon them for other options. The recent launch of World War 3 and its server problems demonstrate how angry players can review-bomb a game, affecting its visibility and reputation. Looking ahead, the potential acquisition of Activision Blizzard by Microsoft may lead to a greater reliance on Azure's cloud infrastructure, but history has shown that relying on a single provider is not a foolproof solution. The takeaway is to be open-minded about flexibly leveraging infrastructure instead of relying on static solutions. New generations of flexible solutions can help ensure game launches are successful while minimizing the impact on studios' bottom lines.