Structure & Attribute Based Graph Partitioning

As applied in Walmart Assortment Analytics

Walmart provides a huge assortment of products to its customers. For many analyses, it is useful to group products within similar substitutable groups rather than considering them independently. One such outcome of the analysis is a tree like structure, called the Customer Behavioral Tree (referred hereafter as CBT) with the items as the leaf-nodes. ‘Behavioural’ here implies that grouping is not based on explicit item characteristics but implied item similarity based on customers’ displayed preferences. The tree represents an structure where child nodes of the same parent must be more substitutable than the child nodes of different parents (represented best by a dendrogram). This CBT acts as a useful grouping for all further assortment and pricing decisions.

While grouping millions of items itself is challenging, complexity multiplies as Walmart has 4500 odd stores in the US, with more than 90 departments, aggregating to around 7200 odd categories, serving to millions of customers. Apart from the humongous scale of Walmart, some categories, notably General Merchandise categories, don’t have sufficient repeat purchase information for a given household, leading to restriction in the usability of market basket analysis and association rules. After all, people don’t buy items like Hammers or Infant Furniture as frequently as they buy Yogurt or Canned Beans.

Assortment Analytics Team in Walmart Labs in Bangalore tackled this problem by modeling Point of Sales (PoS, transaction) data as an appropriate graph and using graph partitioning approaches to cluster items. For a given category, a structural graph is created using the transaction information for all items, across all stores, for past few years. Each item is considered as a vertex, and graph has an edge between a pair of items if and only if, that pair has been purchased at least once, by at least one household in the considered period. Each edge is given an appropriate weight (computed only from the Point of Sales data) which might be considered as a measure of substitutability.

This post will go into process of partitioning graph for clustering. But, before that, an high level overview of graph partitioning approach is presented.

Understanding Graphs

Graph is mathematical structure — represented as set of ‘vertices’ (nodes) connected by ‘edges’ (lines) — which represents relationship between multiple objects (nodes). A graph could be directed or undirected, or weighted or unweighted.

The graph partition problem aims to partition a given graph into smaller components, with specific properties (usually similar sub-graphs). In other words, the goal of graph partitioning is to partition vertices in a large graph into clusters based on various criteria such as vertex connectivity or neighborhood similarity.

The basic principle of clustering:
Objects in the same group (or cluster) are more similar (in some sense or other) to each other, than to those in other groups (or clusters) “

Similarly, a good partition of a graph is defined as one, in which the number of edges running between separated components is small. One popular algorithm for partitioning is Google’s PageRank algorithm.

Typically, Graph Partition problems fall under the category of NP-hard Problems. Solutions of these problems are generally derived using heuristics and approximation algorithms.

Graph Partitioning

The main motif of a structural graph partitioning is to partition a graph G = (V, E) into k sub-graphs such that each sub-graph is as densely connected as possible and the aggregate weight of edges cutting through partitions is minimal.

Since, this is a NP-hard problem, practical solutions are based on heuristics. There are two broad categories of solutions, local and global.

Local solutions akin to the Greedy Algorithm, though faster, are heavily dependent on the starting point. Global solutions (mainly based on shortest paths) rely on the properties of the entire graph and hence tend to fall slow for large and/or dense graphs.

The Greedy Algorithm starts with a random vertex and greedily incorporates into its path the edge with the largest edge-weight at every step, thus traversing through the graph; whereas the global solutions always look into the shortest path between any two vertices, and look to cluster the graph using that information.

All the above-mentioned methods consider only the structure of the graph, hence assuming uniformity in the characteristics of the vertices. The characteristics of the vertex would depend on the explicit nature of the data, on which a graph is created. In the Walmart use-case, since the vertices were items, their characteristics would adhere strongly to the item characteristics, like attributes (brand, color, size, material, durability, etc.), price, availability, loyalty etc. Attribute based Graph Partitioning targets to partition a graph G = (V, E) into sub-graphs such that all vertices in a given sub-graph have similar attributes and are different from vertices of another sub-graph. For our purposes, we would ideally want to have sub-groups of items which have very similar item attributes, such that they might be treated as substitutes.

Planned Approach

Why not we pool the best of both algorithms, to try and partition a graph based on both its structure and vertex attributes? In this way, we would obtain sub-graphs out of the parent graph, each of which would consist of similarly attributed and densely connected vertices.

High Level Algorithm is explained below :

The Algorithm

Step 1: Define the parent graph G = (V, E, W)

Consider a weighted graph G = (V, E, W),

where

V: vertex set

E: edge set

W: matrix of edge-weights (boils down to a multiple of

for un-weighted or uniformly weighed graphs).

For example, consider the following graph, with 6 vertices, 8 edges and edge-weights Wij:

Figure 1 : a graph with 6 vertices, 8 edges and edge weights Wij

Step 2: Define the attributed graph G_a = (V, E, W, V_a, W_a)

On the graph G defined above, we have each vertex in V having information regarding n attributes, where the ith attribute has k_i level, i in 1, 2, …, n.

Also known is the relative importance of each attribute (attribute weight), defined as a_i, i in 1, 2, …, n.

Here, V_a is the set of vertices, labelled with their corresponding attribute values and W_a is the set of attribute weights.

For example, consider the attribute information of the above mentioned graph with n = 2 attributes, namely Attribute A (with 2 levels and attribute weight a) and Attribute B (with 3 levels and attribute weight b).

Attribute information as per above mentioned example

To redraw the above graph along with the attribute information we have the following, where each vertex is labelled with their attribute values:

Step 3: Overlay the parent graph G with the attributed graph G_a to obtain G*_a = (V*_a, E*_a, W*_a)

Now, we would have to graphically imbibe the attribute information from

into G. In order to do the same, we introduce k_1 + k_2 + … + k_n new vertices (henceforth referred to as attribute vertices), each corresponding to different levels of all the attributes.

So, V*_a = V U {attribute vertices}

From, each attribute vertex, define an edge to each original vertex of G (henceforth referred to as structural vertex), if the concerned vertex pertains to the respective attribute (such an edge would henceforth be referred to as an attribute-edge). Thus, each structural vertex would have exactly n attribute edge.

So, E*_a = E U {attribute edges}

Edge weight of each attribute edge thus defined would be as per the attribute weights defined in G_a.

The diagrammatic representation of G*_a thus formed using the above examples would be:

Step 4: Partitioning G*_a into sub-graphs

Consider a random walk on G*_a, wherein the Transition Probability Matrix is defined as P =((p_ij)),

Define the Unified Neighborhood Matrix R = ((r_ij)), where r_ij is the probability of going from vertex i to vertex j.

Define,

c = random walk restart probability (user defined)

L = maximum possible length of a random walk (user defined, bigger the better)

Then,

In a way, R is antonymous of a distance matrix. In order to create a distance matrix out of R, one could create a distance matrix R* as the following:

Now, one may use common methods of clustering like k-medoids, k-means or any other hierarchical clustering techniques on the structural vertices, to obtain the partitioned sub-graphs.

Implementing the proposed algorithm on an example

Above algorithm is demonstrated on the graph mentioned in Figure 1. We have randomly generated the structural edge-weights from a U(10, 50) distribution and the attribute weights from a U(80, 100) distribution.

In the random sample obtained,

w_12 = 26.60983

w_13 = 27.13616

w_14 = 13.86644

w_23 = 15.31082

w_25 = 37.38426

w_46 = 49.21425

w_45 = 37.61334

w_56 = 47.53184

and

a = 81.42791

b= 87.21310

The Transition Probability Matrix (P) as obtained:

The Uniform Neighborhood Matrix (R) with c = 0.01 and L = 200 is:

Upon applying k-means and k-medoids on the structural-vertex sub-matrix of R*= ((1/r_ij)), we obtain the optimum no. of cluster as 2, as depicted below.

For the problem of item grouping in Walmart, we further infused the flavor of item attribute into the structural graph defined earlier, by labelling each vertex of the graph with their corresponding item attributes, thus returning an attribute incorporated structural graph on the Walmart transaction data. Thereafter the aforementioned graph partitioning algorithm was run on it, to obtain the desired CBT.

Such item groupings have significant role in assessing item’s loyalty within substitutable group, making item deletion decisions, understanding customer demand transfers among items when existing item is out of stock or new item is added, or understanding price elasticities of related items.