Finding optimal data centers (Part-1)

Rahul Mishra
Jan 9, 2020 · 7 min read
Store attached Data Center in Walmart Intelligent Retail Lab (IRL), Levittown, New York. Source:https://corporate.walmart.com/photos/walmart-irl-store

Introduction

It’s a cliché that data is the new oil. And, with so much data being generated, effective management of data has become critical for companies in order to leverage it to advance their business. We require reserves for oil to be stored. Similar to oil reserves and rigs, there exists data centers with appropriate infrastructure where massive amounts of data can be stored and manipulated.

Across Walmart’s thousands of stores in the United states, humongous amounts of data are being generated every day, owing to digitization of several processes. It is very important on how we store the data to facilitate fast, efficient functioning of our stores as well as giving cover for downtimes and disaster recovery cases.

Interestingly, this problem arised due to an engineering requirement where the architects were looking into sharding strategies. High availability and high performance were major points of concern. High performance is a criteria because as the size of a database and number of transactions increase linearly, the response time for querying the database also increases. High availability is a criteria as we want to leverage our presence in more than one region as a part of disaster recovery strategy.

There are 2 parts to this problem… first, we need to decide how many data centers are needed and how many shards are needed within. Second, we need to figure out which store should connect to which data center. In this blog series we look into only data center based aspect and not how many shards per data center.

Heatmap of number of stores in each state

As illustrated above, there is uneven spread of stores. States like Texas has a high number of stores while Alaska has very few stores. This uneven spread of stores across a huge geography presents us with interesting problems to solve when we aim to support the operations of these stores using a handful of data centers spread across the country.

Contents of this blog are as follows:

  1. Introduction (above)
  2. Problem understanding
  3. Mapping business requirement to standard problem in literature
  4. Solution approaches
  5. Variants of the problem
  6. Conclusion

Problem understanding:

If we say that the data centers are managed and made available by a cloud platform partner (i.e.; all details about the data center like location, latencies, etc. are provided by that partner), then the problem can be summarized as: Given m number of data centers and n number of stores, we have to find

i. k optimal number of data centers amongst m

ii. allocation of store to data center

The keyword here is optimal. How do you define optimal? It is the most important part of this task and for our case, it is explained in the solution section later in this blog.

Mapping problem to the standard problem:

The problem at hand is very close to Generalized Assignment Problem (GAP) in literature. It is a classical combinatorial optimization problem and it is NP-hard. Let us put it mathematically here,

Let us now draw parallels of our specific problem to GAP. p_ij in the objective function is the profit gained by assigning store_j to data-center_i. As mentioned above, how to find p_ij is the most crucial part and is discussed in solution approach. Each store can be assigned to only one data center and is the second constraint mentioned in above equation. Each data center/shard has its own limits and therefore, w_ij defines how much resources of data-center_i does store_j burn.

Solution Approach:

The solution to this assignment problem is a combinatorial explosion. Given, m data centers and n stores the number of possible combinations are m^n. And therefore, solve it at your own peril if you are not using any heuristics.

Some of the important things that needs to be taken into account when solving this problem is:

• Allocating too many stores to a data center will compromise its performance

• Allocating too few will lead to under-utilization of data center

• Increase in number of transactions per unit time, data size will exponentially increase, hence the response time for querying the data

• Cost incurred to shift a store allocated to one data center to another data center is prohibitively high.

Therefore, it is critical to allocate the most optimal data center to a store with a strong underlying logic supplemented with data-driven approach or else it is going to cost the organisation a fortune.

Some of the inputs that is required to solve this are:

  1. The location of the stores (Latitude, Longitude)
  2. The location of the Data Centers (Latitude, Longitude)
  3. Database capacity of the data center
  4. Network bandwidth connecting store to data center and peak bandwidth
  5. Cost of obtaining a data center
  6. Amount of incoming and outgoing data for a store

For worse-case analysis, we have extrapolated the data generated and also decreased database capacity.

As mentioned earlier, objective function is the most critical part. We have treated this as a multi-objective optimization problem. Various objectives that need to be optimized for are:

  1. Maximize Balance of the data centers using variance/std. deviation using capacity utilization and bandwidth utilization
  2. Minimize Network latency with distance taken as proxy (Here, it means greater the distance more the network latency
  3. Minimize cost (Infrastructure and Annual)

The solution to the classic multi-objective optimization problem can be obtained using different techniques, mostly using linear/non-linear programming or using bio-inspired multi-objective optimization approaches like genetic programming and particle swarm optimization. On one hand, weighted objectives in linear/non-linear programming setup will give single solution whereas on the other hand, proper multiple-objective approaches will give a pareto front. We can then choose an optimal solution on that pareto front. Infrastructure cost plays a lesser role during optimization but is one of the most important things where business has to make decision on number of data centers.

Also, the initial solution is one of the most critical things for the optimization problems. It reduces computation by many folds and therefore, is the heart of this solution. Rest all is simply running the solver with some parameter tuning. For example, in genetic programming some of the parameters that needs to be tuned are mutation rates, mutation direction, crossover probability, penalty, tolerance, pareto fraction, etc. More on this will be discussed in a Part-2. In bio-inspired optimization, it has been proved again and again that initial solution is the most critical thing. Here, we used a two-pass algorithm based on business heuristics to provides initial solution which is near the optimal and can be fed as the starting point. Finally, some metrics are provided to decision makers for evaluating the solutions and comparing them, they are…

  1. Net load per shard/data-center
  2. Net distance per shard/data-center
  3. Net number of stores per shard/data-center
  4. Balance parameter
  5. Capacity utilization
  6. Infra Cost and Annual cost
  7. and more

Methodology for initial solution

Let us consider an example of 3 data centers. In the first pass, the stores were allocated to the closest data center. Load distribution was uneven ( East:South:West :: 4.3:2.1:1 ) and heavily skewed towards East US data center because of higher number of stores and more store density.

Hence, in the second pass, the second closest data centers for each of the stores is determined and the stores for which the distance between the two closest data centers is in the tolerance limits are redistributed to the optimal data center to improve the load balance ( East:South:West :: 2.4:2.4:1 ).

As an example, comparision of allocation after 2 passes is as in figure below:

Comparison of store to data center allocation at different stages

It is seen here that, many of the green colored stores have turned red indicating they are now allocated to South-Central data centre, and some of the red ones have turned black.

For the above solution instance,

  • Load Distribution Ratio (East US: South US: West US) — 2.4:2.4:1
  • Average Distance between Store & Data center — 519 Miles
  • Minimum Days of Retention — 454
  • Maximum Capacity Utilization — 8%

A preview of how these allocations change depending on number of data centers can be seen in the below gif image :

Change in allocations with change in number of data centers and algorithm pass

Variants of the problem:

Finding optimal data center locations across the US instead of having pre-defined locations as above, can have large search space and needs to be approached in a totally different way that comprises affinity for natural disasters, security, compliance, etc. as well. This would help us to decide where to build data centers and acceptable fail-over mechanism for the stores. Although, an interesting problem, it has bigger computation explosion. I believe some of the CDN providing companies are already doing this.

Another interesting variant is one store being connected to more than one data center with a max cap to achieve higher availability. The problem formulation remains same except for modification of one constraint.

Conclusion:

In this Part-1, we looked into problem context and formulation, magnitude of problem and solution approach. Here, we also defined metrics that are vital for decision making and reached an initial solution.

In Part-2 of the blog, we would leverage initial solution and defined metrics to look into solutions obtained using multi-objective optimization. Stay tuned!

Acknowledgements:

This entire work is carried out along with Praneeth Doguparthy.

Walmart Global Tech Blog

We’re powering the next great retail disruption.