Using pandas MultiIndex.from_product to Fill In Missing Data

Eric Ness
When I Work Data
Published in
3 min readAug 29, 2018

While analyzing data there are occasionally cases when you need to get all possible combinations of two dimensions. One such case is when you have event data as one data set and time data as the other. This could occur in financial transactions, web events and many other domains. There are likely some time periods which don’t include any events. If you want to analyze the data for every time period somehow those empty time periods need to be filled in.

One convenient way to do this in pandas is to use the MultiIndex.from_product function. This function will create the set of all possible combinations between the elements of two or more iterables. Here is an example of using this function on two short lists.

[(1, 'a') (1, 'b') (1, 'c') (2, 'a') (2, 'b') (2, 'c') (3, 'a')
(3, 'b') (3, 'c')]

The output shows that an index entry was created for every possible combination of elements from the two lists.

Let’s look at an example now that shows a more practical use. Assume that we are running a business that gets regular orders from repeat customers. Our raw order data would look something like this:

   customer order_date  amount
0 1 2018-01-03 25
1 2 2018-01-05 42
2 3 2018-01-07 116
3 2 2018-01-08 21
4 3 2018-01-09 83
5 1 2018-01-10 4
6 1 2018-01-10 67

One useful report would be a running daily total of orders for the month for each customer. Although the cumsum function can easily give us a running total, there are several obstacles that need to be surmounted first:

  • Some date-customer combinations appear more than once in the data. These will have to be aggregated before a daily running total can be calculated.
  • There are many date-customer combinations that are missing. These will need to be filled in.

The first problem is straight-forward to solve. The amounts just need to be summed by date and customer.

customer order_date   amount     
1 2018-01-03 25
2018-01-10 71
2 2018-01-05 42
2018-01-08 21
3 2018-01-07 116
2018-01-09 83

The second problem will require us to build a MultiIndex. This will be done by creating a list of unique customers and cross joining it using from_product to the date range between 2018–01–01 and 2018–01–10.

[(1, Timestamp('2018-01-01 00:00:00', freq='D'))
(1, Timestamp('2018-01-02 00:00:00', freq='D'))
(1, Timestamp('2018-01-03 00:00:00', freq='D'))
(1, Timestamp('2018-01-04 00:00:00', freq='D'))
(1, Timestamp('2018-01-05 00:00:00', freq='D'))]

This code creates a MultiIndex with combinations of each customer and each date in the date range. Even customer and date combinations where there isn’t an order will be present. The next step will be to reindex the existing order data with the new index so that all of the date gaps will be filled in. The resulting data will be grouped by customer and a running total calculated using cumsum.

amount  running_amount
customer order_date
1 2018-01-01 0 0
2018-01-02 0 0
2018-01-03 25 25
2018-01-04 0 25
2018-01-05 0 25
2018-01-06 0 25
2018-01-07 0 25
2018-01-08 0 25
2018-01-09 0 25
2018-01-10 71 96
2 2018-01-01 0 0
2018-01-02 0 0
2018-01-03 0 0
2018-01-04 0 0
2018-01-05 42 42
2018-01-06 0 42
2018-01-07 0 42
2018-01-08 21 63
2018-01-09 0 63
2018-01-10 0 63
3 2018-01-01 0 0
2018-01-02 0 0
2018-01-03 0 0
2018-01-04 0 0
2018-01-05 0 0
2018-01-06 0 0
2018-01-07 116 116
2018-01-08 0 116
2018-01-09 83 199
2018-01-10 0 199

The final result shows that each customer has a running daily total along the entire date range.

This is just one example of how the MultiIndex.from_product function can help you turn sparse data into a fully populated dataset. Adapt the entire script for this post on Github to your own needs and turn a complicated analysis into child’s play.

--

--

Eric Ness
When I Work Data

Principal Machine Learning Engineer at C.H.Robinson, a Fortune 250 supply chain company.