An overview of the physical multiplayer borders in online gaming, with a focus on Poker.
This is a direct request from a reader related to some of my past Poker engine work, so if you want a .NET post, then catch me later. The primary focus of the question was the design and architecture that is instituded by the larger poker sites in order support multiple thousands of simultaneous users and games while still handling all of the more basic aspects of online gaming like account management, transactional betting, and chat. Rather than start by diving into that, let's first look at the real physical barriers to entering into a multiplayer gaming market with a piece of software.
- The 2 user problem - Getting two users to play is the first hurdle. Here you are tackling the issues of getting your network communications working, really understanding the synchronization of multiple players on the server side, and basically running through the majority of the work to create a successful multiplayer game (assuming you only want to support a couple hundred users).
- The several hundred user problem - There is a barrier, generally at the server level where several hundred users is almost impossible achieve. At this stage you are running into message routing issues, your CPU is getting bogged down, and you are stressing your hardware a bit. This is where you start to realize a couple of things. First, you can still double or triple your base by fixing the server, the message protocol, and performance tuning some more. We've discussed some heinously simple ways to compress cards for instance into 1 network packet along with a bunch of other information. The second thing you realize is that distribution and adding machines is also going to help, but at an expansive management cost and with synchronization problems built in that have to be handled between machines.
- The several thousand user problem - If you are still working on one machine, which is possible, then you've hit a hard limit in socket and network performance. You have to start distributing at this point and/or buying special hardware. At this boundary you are most likely going to offload chat to a separate server, at least general chat, while still performing in-game chat directly through the game channel (game channel communication is fast generally and is why most games support party chat natively, but offload support for global chat to other software and/or servers). You've solved all of the distributed synchronization issues at this point, and you are able to add new servers easily when capacity rises.
- The 50,000 user problem - The holy grail of network gaming, to get to 50,000 users. Very hard with MMPORG style applications, because the server clusters representing a game are really capped at their capacity near 2500-5000 users. That means they've been truly distributing and have at least 10 game worlds. Here you start to get fragmentation of your population and it is hard for people to find those they want to play with. At this point you have dedicated servers for varying levels of play and you've focused on specific servers for tournaments and competition. Solving the user connection problem is going to be difficult, but it will be the primary and possibly only goal at this stage (discounting bandwidth and throughput issues you are having with your ISP and them bitching about the cost of bandwidth going up when in fact the cost has steadily decreased. I love my ISP folks, they actually give me rebates when that happens.).
Lots of problems at every level. No reason not to tackle them though. In fact it isn't all that difficult to manage. Look at the MSN Online Gaming Zone as a great example. They've logically broken their entire site down into the concept of rooms. Once you've gotten into a room you are either sharing a server with some other rooms or you are on your own server. At this level players can logically choose opponents from those available and start a game. Each room has only 50 tables going at once and while these 50 tables probably don't represent a single server playing the game they could. Players are always running on a dedicated server and synchronization issues with the back-end are kept to a minimum. Basic security consistency checks can ensure there aren't dual-logins and that only one game server can update a players account at a time. Failures are easily assigned to particular servers and they can be brought offline and fixed quickly while games are occuring across the rest of the farm.
How do online poker sites solve this problem? They don't! They buy the software pre-canned in almost every circumstance. There are many reasons for this, but the most basic is being licensed to serve gambling content. Ever been to a casino and play the electronic poker games? Those are extensively tested to make sure their odds are consistent and then licensed for use. You are dealing with people's money so there has to be some oversight as to how the money is handled to ensure some level of security. Think of the gambling site like a brokerage, and you the player, are a stock trader. You'd be pretty scared if the bank's software wasn't licensed and checked to make sure your money couldn't leak elsewhere. You'd be even more frosty if your shares and/or payouts weren't properly reassigned back to your account because of some glitch in the system or maybe even a server crash.
Obviously the original sites produced some of their own software, but a burgeoning market is selling full blown poker engines that are already licensed and ready to go. The company gets a support license, and after the appropriate amount of marketing has hopefully reached the critical mass and is making money to move forward. Gambling is one of the few institutions where the site can actually start breaking even in a rather short time period, especially for sites that are not bent on massive growth, but instead supporting a steady and fixed population of users. After all, 3% of a $50k table is going to yield you just as much as 3% of several hundred casual gambling tables. Gives you a market to cater to if you just want to put a small server out there and still make decent cash by playing to those that want special services.
To wrap-up I'll give some architecture design guidelines that should help to overcome many of the boundaries. Many of these are common sense for the hardware gurus or those familiar with distributed processing, but you'd be surprised how easy they are to overlook when you are designing your system.
- Have a connection management plan from the beginning. There are hard limits on connections and you have to either make sure you can add servers or support more connections through better hardware (probably expensive hardware). More servers is the easier option so design for that direction. If you can write a disconnected protocol and the overhead of reconnecting doesn't swamp you, then plan for that.
- Seperate core services and extended services at the connection level. It should be easy to run lobby services (extended) tournament management services (extended) account management services (required but extended) and any others on their own hardware. Core services support the game and that means everything that has to go over the wire during play. Chat is an example of a core/extended service since you can easily add it to your table services, but at the same time you may get better options from offloading it to specialized chat services. Core services also use guaranteed protocols and require you build in lots of fail-safes and contingencies, whereas extended services (yeah, I know, the account manager) don't have to be well thought out. Chat can be run over UDP and doesn't need to be guaranteed and many other services can be run on generic web servers.
- Logically break down the game process into modules. This is the most important for distributed play. You have to ensure that you can easily shell off a new table to a server that is capable of handling it. Remember the 2 user problem? Well during that problem you created a consistent engine. That engine could logically run anywhere, but most likely it ran within the context of a server. Either one of the users was the server or a specialized server existed somewhere else that they both connected to. The engine hosted by the server then, is your module for distribution. Note the game servers themselves will still host services, such as authentication, connection management, message routing, and account interaction.
Just those three considerations will get you well along the way. There are still many problems to solve, but hey, if you are making the big dollars and want me to give you some more hints, then share the wealth and we'll talk ;-) These problems used to be cutting edge research, but they are now well known domains with many interesting solutions. Most solutions at the higher marks are still very customized and rather than maintain the distributed model, many implementations try to squeeze the extra performance out of what they have. I don't have a problem with that, but it does kill the reusability of their solution just so they can support an extra few thousand users that could have been supported with an extra server. In many cases they don't have proper load balancing and didn't isolate the game modules for distribution. In others they have the problem that load balancing can be overcome by the user (almost every MMPORG has no say if every user shows up in the same zone). New games have overcome these issues (City of Heroes) with ingenious solutions that we either love or hate. Hopefully we'll see more of that modular approach with a focus on reusable and portable solutions.