Lessons Learned from Overwatch 2's Server Issues
The launch of Overwatch 2 was marred by server disconnects, black screens, DDoS attacks, and endless queues. Unfortunately, this has become a common phenomenon in the gaming industry, with many AAA titles experiencing similar issues. Forza Horizon 5, Splatoon 3, Halo Infinite, and Battlefield 2042 are just a few examples of games that have faced significant network infrastructure problems at launch. The question on everyone's mind is: what's going wrong and how can studios prevent their games from suffering the same fate? Diagnosing infrastructure issues is a complex and time-consuming process. Developers must work through a long list of potential problems to identify the root cause. Even then, fixing one issue may lead to another. The Overwatch 2 launch is a prime example, with Blizzard's servers experiencing massive disruption and some problems still persisting weeks after release. So, what's behind these issues? While it's difficult to pinpoint a single cause, we can look at the typical network infrastructure used by AAA shooters like Overwatch. Often, the problems boil down to a few key issues: Many game studios rely on centralized servers to handle data processing and management. While this approach has its benefits, such as cost-effectiveness and ease of management, it also has significant drawbacks. Centralized servers can create bottlenecks, making them vulnerable to DDoS attacks and congestion. When players experience network congestion, it's often due to traffic overload on a single node rather than a lack of servers. Larger studios often use specific hardware with limited global availability, making it challenging to scale their games. QA teams may only certify game servers on specific models, limiting the DevOps/LiveOps team's ability to adapt to traffic surges. This can lead to frustration when other models could be used, but the desire to follow established QA procedures prevents expansion to other providers. Studios may be tempted to use large servers with multiple CPU cores to save money. However, this approach can lead to a high density of players per physical node, making the system more vulnerable to attacks and issues. This results in a single point of failure, making it easier for hackers to target the system. Planning for infrastructure issues is challenging, regardless of the studio's size. However, there are simple solutions to these problems, depending on the type of servers and hardware used. A distributed network, where data processing and management are spread across multiple locations, can provide more flexibility and scalability. This approach allows for the addition of new servers as needed and can reduce the risk of bottlenecks and outages. Using multiple providers can also mitigate the risk of service outages. Integrating with cloud-based or edge infrastructure providers can help reduce dropped connections and other network issues. Edge servers can also address latency and bandwidth concerns by positioning players closer to servers, reducing data travel distance. When testing online games, operating and testing on multiple infrastructure providers and widely available machines can help identify potential issues and mitigate risks. Working with platforms and partners that don't require additional internal resources can save time and money. Today's infrastructure and the diversity of services offered by studios have created complex puzzles. Automation and deployment solutions like Kubernetes, containerized payloads, microservices, and CI/CD can solve problems but also bring new challenges. Serious automation is necessary to leverage better infrastructure, and studios should focus on developing the best game possible rather than rebuilding existing tools. The financial implications of infrastructure issues can be significant. When a popular multiplayer game is down, even for a short period, it can result in lost revenue. Blizzard's move to a free-to-play model for Overwatch 2 means that the primary revenue stream will come from in-game purchases, which were likely impacted by the game's launch issues. Roblox, which generates over $5 million daily, experienced an outage in 2021, highlighting the potential financial implications of such events. Efficiently provisioning servers to meet demand is a significant challenge. Forecasting player numbers is difficult, and overprovisioning can lead to unnecessary costs. Flexibility is key, and studios should consider using cloud-based or edge infrastructure providers to reduce the risk of infrastructure issues. Infrastructure issues can have long-term consequences, even if they're short-lived. Players experiencing significant issues at launch may abandon the game and move on to something else. The recent launch of World War 3 and its server problems is an example of how infrastructure issues can impact a game's reputation and visibility. In conclusion, being open-minded about flexibly leveraging infrastructure is crucial. Relying on a static infrastructure can lead to problems when issues inevitably arise. Current cloud and server providers want predictable forecasts and revenue, but the traffic flow for online games can be unpredictable, making it a challenge to deal with from both a technical and business perspective. Thankfully, new flexible solutions are emerging to fill this gap and ensure game launches have the best experience with minimal impact on the studio's bottom line.