The Infrastructure and Cloud Deployment Strategy of X.D. Network Inc.

There have been two major projects that have a lot to do with me this year. One is Ragnarok Online; the other one is To the Moon Mobile. Ragnarok Online is the first project in my experience that reach the magnitude of 10Gbps bandwidth over the public network. On the other side, To the Moon means a lot of responsibility to me since it’s a very classic game and has a very high reputation in the past.
Now that To the Moon Mobile is already successfully released. I can take a short leave and get back to some writing.

X.D. Network Inc. (aka X.D.) is a game publisher headquartered in China. I, as the CTO of this major online game publisher, am in charge of the infrastructure deploy strategies in providing sufficient servers to satisfy our customers and game players. In the meantime, we do accumulate quite a lot of experience which I want to share with you now.

History

In the beginning, X.D. only deploys bare-metal servers into rented rack spaces of data centers. When games need to be released worldwide instead only China, we also begun to rent servers from Linode and AWS. That’s the first time that we get to know the concept of cloud computing.

By 2015, X.D. made several attempts to try out the cloud computing. With our contractor’s help, we installed a full set of OpenStack cluster, provided servers to our developer teams and also some backend services for our game players.

When cloud service providers became more and more mature, X.D. started to use Aliyun, UCloud and Tencent Cloud. And at the time AWS entered China, we also get in immediately and deployed servers in the newly AWS Beijing region(cn-north-1).

I think it’s safe to say, that we tried almost all the key players in cloud computing world, in production or at least with serious evaluation.

Problems of cloud computing

Problem 1: Cloud is expensive. It’s still very expensive even now.

Let me expose some numbers in our standardized bare-metal data center deployment costs: X.D. always buy bare-metal rack server with similar specs for easier management. For example, the latest batch we brought, is 2 x 12 Cores(total 48 cores in HT), 512G Ram, 12 x 4T HDD RAID10, which means 24TB storage space. It’s a typical rack server, nothing special except that we prefer larger RAM because in most cases, the online game server needs to hold many players’ info in memory.

Such rack servers are pretty safe to have 3–5 years of life span. If we flat out the cost into each month, it results in only about 300USD/month. For the rack space that we rented from the data centers, even with all the network infrastructures such as routers, switches and firewalls took into considerations, the cost of data centers will not be more than 100USD/month for each server. Therefore, the running cost of a typical 48 cores, 512G memory, 24T storage bare-metal server, to us is only about 400USD/month.

Let’s use this number against the price from cloud computing service providers’ price. You will find out, most cloud server with the similar specs will cost at least 3–7 times more than our bare-metal cost. For some high-level providers such as AWS, it will be more than ten times costly.

Furthermore, from what we learned from our OpenStack deployment, this extra cost is not because cloud providers are greedy for profits. It’s because to build a set of reliable and flexible cloud services, substantial investment in infrastructures is a requirement to support systems such as distributed file system, SDN, etc. Hence, the problem of high price is embedded in the bones of traditional cloud services. There is no easy fix. (I believe the future belongs to container management system and serverless compute, but that’s whole other topic.)

Problem 2: Even with the same spec, cloud server is less powerful

Higher price is not the only issue of cloud computing. Virtual machine technology also has a downside in performance.

First, many cloud server provider can not provide high spec such as 48 Core CPU, 512G Ram, 24T Storage. And even when we compare the performance between the same specs, we found that cloud server performs just simply much worse than bare-metal.

It makes sense if you take the structure into consideration: Virtual machine host must reserve resources for scheduling, managing guests, and ensuring the availability of SDN and distributed file system. Host have many responsibilities, and they all have their cost. This is a tradeoff that current VM system can’t avoid. Needless to say, the competition for resources from multiple VM guests also reduce the computing resources each guest VM can get.

More over, storage I/O and network throughput especially PPS(Packets pre-Second), not only underperformed. Cloud provider have to set a quota on these key factors of system performance, to avoid unstable when the competition between VM guests get intense. These quotas for each guest VM, are much much lower than bare-metal.

Problem 3: All the elastics have their limits.

Yes, cloud computing did give us flexibility. But what we found out are the limits of their flexibility.

Cloud server providers also build their system on bare metals and data centers. When we demand 100–200 units of high-end cloud servers, even the biggest cloud server provider are having trouble. In Google Cloud and AWS, you have a quota on how many servers you can buy. Of cause you can open a ticket to apply for more, but let me warn you, we do experienced refusals on such issue.

If you in a good relation with a cloud server provider, you might get a heads up and ask you to give them a heads up on how many servers you might need, preferably 1 to 2 months before you gonna need them.

Even though, we still meet some situation that all the servers we bought just won’t be able to “boot up,” because there are not enough resources over the platform.

Problem 4: High availability still means it will break

Of cause we can say, high price provides high availability and redundancy. Multiple regions and zones give your business more than triple protections, which is bare metal server unable to match.

We thought so. But it turns out isn’t entirely accurate.

I still have a clear memory of the first time that I received a “retirement” email from AWS.

“Retirement” in this email, mean a running instance of yours will be terminated. It means the running tasks on this instance will not continue unless you do something. And even you do as instructed, the public IP will still changes. And you will face trouble if you are not prepared for this. For us, we get 4–5 such retirement emails each year.

Some providers such as Aliyun or UCloud can make this easier for their customer. They can do live migration, and only a slight downtime will occurs. But when there is larger scale underlaying infrastructure change, the problem is the same: hours of downtime if you are not up to a manual migration.

Another example is the malfunction of AWS S3 earlier this year. Which I don’t think I need elaborate more. Even the unbreakable breaks.

Deployment Strategy

Of cause, even with all the problems, cloud service still have advantages in many types of tasks:

  • It suits for task which requires less than one full rack of bare-metal servers.
    Because bare metals are powerful, it will be a waste if only runs tasks with low system loads.
  • It suits for fast-growing businesses.
    It’s hard to predict the demand of a fast-growing business. Deploying bare metals takes time, which might be a limitation to the growth.
  • It suits for cost un-sensitive bosses.
    Don’t need to explain this if you have boss don’t care about the cost
  • It suits for team which lack of system operator.
    Maintain bare-metal system and the database layer require experience and senior system operator. Cloud service suit for the team which don’t have such guy.
  • It suits for project that so important no one dares to take the responsibility.
    You can alway play the blame game and tell your boss it’s the cloud service provider’s fault when anything goes wrong. Although personally, I don’t recommend doing your job this way. But I believe such thought do exist in many companies.
  • It suits for project that so important no one dares to take the responsibility.
    You can alway play the blame game and tell your boss it’s the cloud service provider’s fault when anything goes wrong. Although personally, I don’t recommend doing your job this way. But I believe such thought do exist in many companies.

Therefore, even cloud computing costs more and performs less. With a good strategy, you can get the most out of it.

In X.D. Network Inc., our online game services are not like tradition .com business. The system loads overall of each of our games are very predictable.

As you can see, if we deploy bare metal servers by the predictions of peak time, when that time passed, most of the servers will become idle. If at that time we don’t have a new game to be deployed, there will be a huge waste of computing resources.

X.D. Network Inc. separate our online games in two categories. The first one is the projects that we predicted will have low system loads, or still in develop or in the test stage. Because during such period, the system loads are low and change rapidly. Cloud service is the first choice for the game in this category.

The other category is our top class game. For this category, we take cloud service more like an extra resource pool. We use bare-metals to handle the system loads under the benchmark. And use the cloud services to bear the system loads for the peak time.

By using this strategy, we let bare metals handle the system loads during the stable stage of a game, which is low upkeep. In the same time, we can take advantage of cloud service for their elastic and flexibility. Under this strategy, we can get to a sweet point which gains the best of both: cost and flexibility.

That is the infrastructure and cloud deployment strategy of X.D. Network Inc. for now.