Predicting Demand at Hundreds of Stores with Multi-Task Learning and Good Features
Harlan Seymour, Sr. ML and Data Engineer @ Afresh
At Afresh, we are building technology that helps grocers reduce food waste and increase their profitability through better store-level forecasting, ordering, and operations for fresh food. Every year, one-third of food produced goes to waste, globally. In the United States, 40 percent of all food waste occurs at the retail level, with the highest occurrence in fresh food departments. Our goal is to apply technology to optimize the supply chain and make fresh food more accessible.
Fresh food is complex, and the myriad variables that can impact demand require equally complex modeling. We are always looking for new features to input into our machine learning models to improve the performance and overall robustness of our system. Every day, we recommend ordering decisions for tens of thousands of store-items (produce, meat, etc.) at each of many, many grocery stores. Part of this system involves building a large multi-task demand forecasting model for all the store-items in a department (think produce, meat, etc.) using many features, including these categorical features:
- item_id: which item?
- store_id: which store?
- store_item_id: which store-item? A cross-product of store_id x item_id.
Sales patterns vary by item, by store, and indeed by each store-item, so these are all key features.
An important property of our demand forecasting model is that it generalizes to new stores (and new items) that were not seen at training time. But what if a new store is opened up, making store_id and store_item_id unavailable in the training set for the model building? To address this, we can represent the store and department using new features like size in sqft of the department, or geographic information like city, longitude and latitude.
One problem with these new features is that they tend to be constant over time, and the machine learning model may suss them out as semi-categorical features, e.g. 2,386 sqft must be store_id = 27, which has similar disadvantages to using the identifiers. The upshot is that forecasting on a new store, not found in the training set, with a previously unseen sqft value, may degrade the forecast quality for items in the new store.
Another idea is to use a rolling (28 day) average of department sales (dept_sales) as a proxy for store_id. The rolling average has the advantage of changing (naturally jittering) day by day, with store department sales criss-crossing each other over time, but at the same time carrying a lot of information about the sales nature of the department.
Here is a chart of sales at 6 stores over an almost 2 year period. Note how an individual store department’s rolling average sales vary over time, and how different store department sales criss-cross each other.
Our demand forecasting model consists of an ensemble of XGBoost gradient boosted trees (GBT) and DNN (deep neural network) models. Within XGBoost trees here are the 4 most branched upon (of many!) features, which stands in as a heuristic for determining feature importance:
- dom (day-of-month)
- woy (week-of-year)
- store_id
- store_item_id
Clearly store_id and store_item_id are very important features. Replacing them with dept_sales (department sales) in a newly trained model yields these 3 most branched upon features:
- dom (day-of-month)
- woy (week-of-year)
- dept_sales
So dept_sales looks to make a good stand-in for store identifiers in the model, at least in terms of how it is weighted by the model. But how about overall model performance? We can measure this by relative L1 error. And we want to see the ratio of forecast to sales versus actual test data sales to be close to 1.0 (a perfect forecast). We don’t want the standard deviation of the store-item ratios to vary much (all store-items ratios of 1.0 giving a stdev of 0.0 is best!).
Per the table and figure, replacing store identifiers with dept_sales actually improved model performance slightly, but outside the margin of error for declaring it an actual, significant decrease in relative L1 error. But it clearly improved robustness by allowing new stores, not found in the training set, to be forecast on. When this model with dept_sales was tested on items at a new store it performed well, but not quite as well as the old store-items.
In summary, the dept_sales feature is an important addition to our arsenal for forecasting demand, which in turn leads to better ordering at the store level, and significant savings when it comes to food waste.
We are growing our team rapidly and are looking for passionate, talented engineers to help us solve more problems like this. If you are interested in joining, take a look at our current openings.