A data science approach to preventing food waste in commercial kitchens.

food saver: Preventing food waste in commercial kitchens!

Sanchika Gupta
9 min readJul 1, 2019

Food waste, as they say, is the world’s dumbest environmental problem and yet we haven’t found a solution for it. Every year, 133 billion pounds of food is wasted in the US alone. This same food can be used to feed 48 million hungry people in the country.

I recently joined a post-doc program by Insight Data Science and stumbled upon a company that is trying to solve the above problem. It got me really excited and I decided to pick it up as my project.

The DATA

The crux of the project is to help organizations predict and therefore reduce their food waste. I have data for approximately for the last one year for more than 300 organizations. These organizations are restaurants, kitchens, company cafeterias, etc. The data collected is for food wastage that can be grouped (and hence donated) in different categories like beverages, baby food, bread and tortillas, pizza, savory snacks, etc. There are a total of 40 different categories of wasted food.

Now let’s jump into the project details. I was given this data in xlsx format and the first thing I did was to convert into a data frame.

df= pd.read_excel(‘Historical Customer Waste Tracking Data.xlsx’)

From this data frame, I identified 10 organizations for which I have at least 5 months of data (the most important thing for any data model is to have sufficient data so that it’s modeled the right way).

Questions to answer

Once I had the data, the next thing was to define the questions that I needed to answer using the data. Organizations can reduce their food wastage effectively if they can predict food wastage by the following 3 parameters:

1. by category(descriptor)

2. by weekday

3. by weights (amount of food waste)

Data Pre-processing

This step involves processing the data to take care of missing values, identifying useful features in data and segregating the data on the basis of different companies. The reason to segregate the data is as follows:

1. First, I have to predict waste for each company, so there is no point predicting the waste for all the companies together.

2. Second, if I were to apply time series and consider all the companies together, the problem is all the companies are not wasting the same category of food. So it does not make sense to do that.

df = df[[‘company’,’pickupat_date’, ‘year’,’month’, ‘weekday’, ’weeknum’, ‘valid_pounds’, ’descriptor’]].copy()

So I decided to segregate the companies:

def segregate_company(df, company_name, start, end):df.pickupat_date = pd.to_datetime(df.pickupat_date)data1= df[(df[‘company’] == company_name)&(df[‘pickupat_date’] > start) & (df[‘pickupat_date’] <= end)]return pd.DataFrame(data= data1)

And then took care of null values and missing data.

# finding out rows for which address is null.null_columns=df.columns[df.isnull().any()]print(df[df[“address”].isnull()][null_columns])print(df[df.isnull().any(axis=1)][null_columns].head())# finding out rows for which company is null.print(df[df[“company”].isnull()][null_columns])#dropped all the rows with null address valuesdf.dropna(subset=[‘address’], inplace= True)# fill in missing nan values for descriptor and companydf[‘descriptor’]= df[‘descriptor’].fillna(“Other”)df[‘company’]= df[‘company’].fillna(“undefined”)

Visualizing Food Waste Time Series Data

# Let’s see a historical view of the valid_pounds for a single companyc[‘company_name’][‘valid_pounds’].plot(legend=True, figsize=(20,4))

Now that we’ve seen the visualizations for the waste produced by a company each day, let’s go ahead and calculate the moving average.

# moving average for 7, 15 and 30 daysma_day = [7,15,30]for ma in ma_day:column_name = “MA for %s days” %(str(ma))temp2[column_name]=temp2[‘valid_pounds’].rolling(ma).mean()temp2[[‘valid_pounds’,’MA for 7 days’, ‘MA for 15 days’, ‘MA for 30 days’]].plot(subplots=False,figsize=(20,4))
# We’ll use pct_change to find the percent change for each dayc[‘temp’][‘Daily change’] = c[‘temp’][‘valid_pounds’].pct_change()c[‘temp’][‘Daily change’].plot(figsize=(12,4),legend=True,linestyle=’ — ‘,marker=’o’)

Indexing with Time Series Data

So here I am converting the data into time-series format. I have chosen the categories, date, and amount of waste produced in each category as the parameters. To work with time-series data in pandas, I am using DatetimeIndex as the index for my DataFrame.

#This function pivots the individual customer data on the basis of 
# date as index and categories as columns.
#And on a daily basis.def cat_pivot_table(raw, pounds, start, end):“””Process the raw data into a table of outputs with each category as a column.”””raw = segregate_company(df, company_name, start, end)df = raw[(raw[‘pickupat_date’] >= start) & (raw[‘pickupat_date’] <= end)][[‘pickupat_date’, ‘descriptor’, pounds]]dr = pd.date_range(start = start, end = end)df=pd.DataFrame(index = dr).join(df.pivot_table(index = ‘pickupat_date’, columns = ‘descriptor’,values = pounds, aggfunc = sum),how = ‘outer’).fillna(0)return dfdf.index

ARIMA time series model

I am applying one of the most commonly used methods for time-series forecasting, known as ARIMA, which stands for Autoregressive Integrated Moving Average.

ARIMA models are denoted with the notation ARIMA(p, d, q). These three parameters account for seasonality, trend, and noise in data:

p is the parameter associated with the auto-regressive aspect of the model, which incorporates past values. For example, forecasting that if the temperature was high over the past few days, you state its likely that temperature will be high tomorrow as well.

d is the parameter associated with the integrated part of the model, which affects the amount of differencing to apply to a time series. You can imagine an example of this as forecasting that the weather tomorrow will be similar to the weather today if the daily temperature and amounts of rain etc. have been similar over the past few days.

q is the parameter associated with the moving average part of the model.

import statsmodels.api as sm# PACF plot of stationary seriesplt.rcParams.update({‘figure.figsize’:(9,3), ‘figure.dpi’:120})fig, axes = plt.subplots(1, 2, sharex=True)axes[0].plot(df_1.Grains); axes[0].set_title(‘Stationary’)axes[1].set(ylim=(0,5))#plot_pacf(df_1.Grains.dropna(), ax=axes[1])sm.graphics.tsa.plot_acf(df_1.value.squeeze(), lags=40)plt.show()

This step is the parameter Selection for my food waste ARIMA Time Series Model. And using auto Arima I will fit my model using those parameters.

from statsmodels.tsa.arima_model import ARIMAimport pmdarima as pmmodel = pm.auto_arima(df_1.Grains, start_p=1, start_q=1,test=’adf’, # use adftest to find optimal ‘d’max_p=3, max_q=3, # maximum p and qm=1, # frequency of seriesd=None, # let model determine ‘d’seasonal=False, # No Seasonalitystart_P=0,D=0,trace=True,error_action=’ignore’,suppress_warnings=True,stepwise=True)print(model.summary())

The above output suggests that ARIMA X(2, 0, 1) yields the lowest AIC value of 1020.839. Therefore I considered this to be the optimal option.

model.plot_diagnostics(figsize=(10,7))plt.show()

It is not perfect, however, our model diagnostics (N(0,1)) suggests that the model residuals are near normally distributed with mean 0 and standard deviation 1.

The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.

The residuals over time (top left plot) don’t display any obvious seasonality and appear to be white noise. This is confirmed by the autocorrelation (i.e. correlogram) plot on the bottom right, which shows that the time series residuals have low correlation with lagged versions of itself.

Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values. Now let’s forecast for the next 7 periods with our model.

# Forecastn_periods = 7fc, confint = model.predict(n_periods=n_periods, return_conf_int=True)index_of_fc = np.arange(len(df_1.Grains), len(df_1.Grains)+n_periods)# make series for plotting purposefc_series = pd.Series(fc, index=index_of_fc)lower_series = pd.Series(confint[:, 0], index=index_of_fc)upper_series = pd.Series(confint[:, 1], index=index_of_fc)# Plotplt.plot(df_1.Grains)plt.plot(fc_series, color=’darkgreen’)plt.fill_between(lower_series.index,lower_series,upper_series,color=’k’, alpha=.15)plt.title(“Forecast of Food Waste”)plt.show()

Hmm… The forecasted part looks pretty flat. Is this really a good forecast? Let’s try Facebook’s Prophet.

Time Series Modeling with Prophet

Released by Facebook in 2017, the forecasting tool Prophet is designed for analyzing time-series that display patterns on different time scales such as yearly, weekly, and daily. It also has advanced capabilities for modeling the effects of holidays on a time-series and implementing custom changepoints. Therefore, we are using Prophet to get a model up and running.

I have converted the raw data into the dictionary of category indexed prophet format data frames.

def process_to_cats(raw, pounds):“””Process the raw data into the dictionary of category-indexed prophet format data frames.””dfc = cat_pivot_table(raw,pounds)cats = dfc.columnsreturn dict(zip(cats, [pd.DataFrame(dfc[cat]).reset_index(level = 0).rename(index = str, columns = {“index”: “ds”, cat: “y”}) for cat in cats]))

Incorporating US Holidays data into the prophet model because it might happen that there is more wastage/ less wastage on some days because of some festival or holiday season.

If the trend changes are overfitting or underfitting, you can adjust the strength of the sparse prior using the input argument changepoint_prior_scale. By default, this parameter is set to 0.05. Increasing it will make the trend more flexible. So I have set the changepoint_prior_scale to 1.

Prophet by default doesn’t fit the monthly seasonality so I added monthly seasonality using the add_seasonality method for a fourier_order of 20.

#fitting the model
def prophet_forecast_hol(dfp, periods):
#Produces the next forecast for however many periods#taking into account US holidaysm = Prophet(changepoint_prior_scale=1.0, daily_seasonality= False,seasonality_mode=”additive”,holidays=us_holidays).add_seasonality(name='monthly', period=30.5,fourier_order=20)m.fit(dfp)future = m.make_future_dataframe(periods = periods, include_history=True)forecast = m.predict(future)fig1 = m.plot(forecast)return forecast[[‘ds’, ‘yhat’]].tail(periods)# Produces the forecast for the next days. Use the prophet forecast model.forecast_results[‘date’] = pt[‘ds’]forecast_results[cat] = list(map(lambda x: max(x, 0), pt[‘yhat’]))dfm = pd.melt(forecast_results, id_vars = [‘date’])return(dfm.rename(columns = {‘variable’: ‘category’, ‘value’: pounds}))

Evaluating the forecast by forecasting a window into the future, computing the sum, and comparing it with reality.

pred = prophet_forecast_hol(dfp.head(t), window)comb = pd.merge(pred, dfp)print(sum(comb[‘yhat’].head(window) — comb[‘y’].head(window))

Using mean absolute error(MAE) for validating my model. In statistics, mean absolute error (MAE) is a measure of the difference between two continuous variables. Consider a scatter plot of n points, where point ‘i’ has coordinates (xi, yi). Mean Absolute Error (MAE) is the average vertical distance between each point and the identity line. MAE is also the average horizontal distance between each point and the identity line.

The Mean Absolute Error is given by:

I compared the MAE values of my prophet model with the baseline model (which aggregates the values for e.g. from past 7 Tuesdays to forecast for next Tuesday). Here I am showing the results for 5 categories but my model gave better results for all the categories.

This graph shows that our model clearly outperforms the baseline model.

From the above graph, you can see some peaks for the 4th of July which can be because of the effect of the holiday on that day. So more food wastage for the day.

There is still an ability to improve if we can get data for

  1. Special events calendar within the companies.
  2. At least 2 years of data.

--

--