Learning from Overwatch 2's Server Troubles: Expert Advice for a Smoother Game Launch
The launch of Overwatch 2 was marred by server disconnects, black screens, DDoS attacks, and endless queues, a scenario all too familiar for many online multiplayer games. Recent releases like Forza Horizon 5, Splatoon 3, Halo Infinite, and Battlefield 2042 have all suffered from significant network infrastructure issues at launch, leaving many to wonder what goes wrong and how studios can avoid these pitfalls. Diagnosing these problems is complex, often requiring developers to sift through a multitude of potential issues, and even then, resolving one problem can lead to another. The Overwatch 2 launch, with its massive server disruptions, is a prime example, with some issues persisting weeks after release. So, what causes these problems? While pinpointing a single cause for Overwatch 2's infrastructure woes is challenging, common issues in AAA shooters include the use of centralized servers, which, despite being cost-effective and easier to manage, present significant downsides such as creating targets for DDoS attacks and bottlenecks for players. Large studios often run games on specific, hard-to-find hardware, making scaling difficult, especially when QA teams certify servers on specific models, limiting the ability to expand to other providers during traffic surges. The temptation to use large servers to save money can also backfire, as high player density per node increases the risk of attacks and single points of failure. Planning for infrastructure issues is challenging, regardless of studio size. However, there are straightforward solutions to these problems. For instance, distributed networks, where data processing and management are spread across the network rather than centralized, offer flexibility, scalability, and reduced risk of bottlenecks and outages. Integrating with cloud-based or edge infrastructure providers can also reduce dropped connections and latency by positioning players closer to servers. Testing games on multiple infrastructure providers and widely available machines can help mitigate disruption risks. Automation and deployment solutions like Kubernetes and containerized payloads can solve problems but also introduce new challenges, emphasizing the need for serious automation to leverage better infrastructure without overburdening engineers. The financial repercussions of a game being down, even for a short period, can be significant, as seen with Blizzard's move to a free-to-play model for Overwatch and the potential lost revenue from in-game purchases. Flexibility is key, whether in provisioning servers to meet demand, avoiding overprovisioning based on long-term contracts, or considering multiple infrastructure partnerships to mitigate risks. The example of World War 3's server problems and the subsequent review-bombing on Steam highlights the long-term adverse effects of infrastructure issues on games. Even large companies like Microsoft, with its potential acquisition of Activision Blizzard and move to Azure, should consider the benefits of additional infrastructure partnerships beyond a single provider to ensure the quality of service and player experience. The takeaway is the importance of being open-minded about flexibly leveraging infrastructure, rather than relying on static solutions, to ensure game launches are successful with minimal impact on the studio's bottom line.