Using pandas MultiIndex.from_product to Fill In Missing Data
While analyzing data there are occasionally cases when you need to get all possible combinations of two dimensions. One such case is when you have event data as one data set and time data as the other. This could occur in financial transactions, web events and many other domains. There are likely some time periods which don’t include any events. If you want to analyze the data for every time period somehow those empty time periods need to be filled in.
One convenient way to do this in pandas
is to use the MultiIndex.from_product
function. This function will create the set of all possible combinations between the elements of two or more iterables. Here is an example of using this function on two short lists.
[(1, 'a') (1, 'b') (1, 'c') (2, 'a') (2, 'b') (2, 'c') (3, 'a')
(3, 'b') (3, 'c')]
The output shows that an index entry was created for every possible combination of elements from the two lists.
Let’s look at an example now that shows a more practical use. Assume that we are running a business that gets regular orders from repeat customers. Our raw order data would look something like this:
customer order_date amount
0 1 2018-01-03 25
1 2 2018-01-05 42
2 3 2018-01-07 116
3 2 2018-01-08 21
4 3 2018-01-09 83
5 1 2018-01-10 4
6 1 2018-01-10 67
One useful report would be a running daily total of orders for the month for each customer. Although the cumsum
function can easily give us a running total, there are several obstacles that need to be surmounted first:
- Some date-customer combinations appear more than once in the data. These will have to be aggregated before a daily running total can be calculated.
- There are many date-customer combinations that are missing. These will need to be filled in.
The first problem is straight-forward to solve. The amounts just need to be summed by date and customer.
customer order_date amount
1 2018-01-03 25
2018-01-10 71
2 2018-01-05 42
2018-01-08 21
3 2018-01-07 116
2018-01-09 83
The second problem will require us to build a MultiIndex
. This will be done by creating a list of unique customers and cross joining it using from_product
to the date range between 2018–01–01 and 2018–01–10.
[(1, Timestamp('2018-01-01 00:00:00', freq='D'))
(1, Timestamp('2018-01-02 00:00:00', freq='D'))
(1, Timestamp('2018-01-03 00:00:00', freq='D'))
(1, Timestamp('2018-01-04 00:00:00', freq='D'))
(1, Timestamp('2018-01-05 00:00:00', freq='D'))]
This code creates a MultiIndex
with combinations of each customer and each date in the date range. Even customer and date combinations where there isn’t an order will be present. The next step will be to reindex the existing order data with the new index so that all of the date gaps will be filled in. The resulting data will be grouped by customer and a running total calculated using cumsum
.
amount running_amount
customer order_date
1 2018-01-01 0 0
2018-01-02 0 0
2018-01-03 25 25
2018-01-04 0 25
2018-01-05 0 25
2018-01-06 0 25
2018-01-07 0 25
2018-01-08 0 25
2018-01-09 0 25
2018-01-10 71 96
2 2018-01-01 0 0
2018-01-02 0 0
2018-01-03 0 0
2018-01-04 0 0
2018-01-05 42 42
2018-01-06 0 42
2018-01-07 0 42
2018-01-08 21 63
2018-01-09 0 63
2018-01-10 0 63
3 2018-01-01 0 0
2018-01-02 0 0
2018-01-03 0 0
2018-01-04 0 0
2018-01-05 0 0
2018-01-06 0 0
2018-01-07 116 116
2018-01-08 0 116
2018-01-09 83 199
2018-01-10 0 199
The final result shows that each customer has a running daily total along the entire date range.
This is just one example of how the MultiIndex.from_product
function can help you turn sparse data into a fully populated dataset. Adapt the entire script for this post on Github to your own needs and turn a complicated analysis into child’s play.