Application of modern data science techniques for sustainable menu planning
This project was carried out as part of the TechLabs “Digital Shaper Program” in Münster (winter term 19/20).
As part of our learning experience at TechLabs we applied modern data science techniques to data from a large canteen of a university. Our aim was to predict sales figures better in order to reduce food waste because of overproduction. We gathered data from various sources, cleaned, visualized and interpreted the data. However, our conclusions are limited because of inconsistencies in the database.
Key Takeaways from Data Science Track
- Having never done a complete data science project before, we are able to apply the process described in this blog in our study or job. Now, we can utilize the power of state-of-the-art open source data science technologies.
- Many tools introduced to us like group communication via Slack, Kanban Board in Trello for project management, Jupyter Notebooks or PyCharm as python development environments, Git and GitHub for version control of code and e-learning platforms like DataCamp have brought our tech skills to a new level.
- In data science, everything depends on the availability and quality of the data. If data is ambiguous and a matter of interpretation all conclusions will be, too.
- By far most effort in a data science project has to be invested in data cleaning.
- Be aware of misleading correlations.
- Writing a script in Python (e.g. to categorize or visualize data) can be a hard time and must not be the most efficient way. Using Excel can be more straightforward, especially for small data sets.
Our data analysis process consists of four major steps: (1) defining the scope and hypotheses building, (2) data gathering, (3) data cleaning, and (4) data visualization and interpretation. Furthermore, the application of machine learning algorithms (5) is a bonus topic we conducted additionally after we had finished the key tasks. As the goal of the TechLabs track Data Science is to impart knowledge and skills in this field, this blog article will not only focus on the tasks of each step of the data analysis process, but also on obstacles the team had to overcome and experiences the team made.
1. Defining the scope and hypotheses building
Food waste is one of the world’s major issues. Thus, our plan was to analyze whether modern data science techniques could help to predict the number of sold menus in a large-scale canteen. If so, this would have a noticeable impact to reduce food waste in places where a lot of food is processed.
We came up with various questions we wanted to answer during the process:
● How do sales figures develop over the years? Is there a trend towards vegetarian and vegan dishes?
● How do sales figures fluctuate during the year? Does the holiday season have an effect on sales figures?
● How do sales figures depend on other external variables? Is the demand of soup lower if the weather is above 30 degree Celsius?
● How does one menu affect another? Is Schnitzel the one and only favorite no other menu can keep up with?
2. Data gathering
In order to be able to answer the questions we needed three different types of data: A) menu plans and sales figures of a large-scale canteen, B) weather data, and C) holiday dates.
A) Menu plans and sales figures
We contacted the Mensa am Ring which is a large canteen of the University of Münster, a university with more than 40,000 students. They were happy to provide us with several excel sheets with menus names, sales figures and additional information from 2011 to 2018. This is how the data we got looked like:
B) Weather data
The German Meteorological Service (DWD) offers free and easy accessible historical weather data. Although Münster itself has no own weather station, the nearby airport Münster-Osnabrück has one. For our purpose we assumed that the weather in Münster corresponds to the airport’s weather. Data from DWD is professionally structured and available to download as comma-separated values files (.csv).
C) Holiday dates
Because we used data from a canteen which is mainly used by students, we considered three different types of holiday data: (1) semester breaks (i.e. the time from the last lecture until the official beginning of the next semester), (2) lecture-free time (i.e. the time from the last lecture until the first lecture of the next semester), and (3) official public state holidays of North Rhine-Westphalia. We looked up (1) and (2) on the university’s official website and (3) on an publicly available calendar on the internet. We created an Excel sheet and assigned binary codes per holiday type to each day.
3. Data cleaning
Some people say that up to 80% of the effort of Data Science is data cleaning. Honestly, we could not really believe it until we had to clean our data. (A) menu plans and sales figures as well as (B) weather data had to be cleaned because they were initially saved for different purposes. As the tasks and obstacles were different, the following paragraphs will outline the procedures separately. There was no need to clean the holiday data. Due to manual data collection it already had the right format. Generally, all data cleaning steps were done in Python if not otherwise specified.
A) Menu plans and sales figures
Cleaning menu plans and sales figures was one of the most time consuming parts.
First, we decided not to use the years 2011 to 2014 because our contact person at the canteen already said that these years have no really meaningful values.
Second, according to the data there are a lot of menus only sold once. This is because the cash system is not linked to the inventory management system. There is a manual workaround to transfer the numbers but this process does not always work (e.g. if the person doing the job is on vacation). We decided not to alter the data, because exclusion would have reduced our database too much and data manipulation techniques would have had a significant effect. By leaving the data as it was, we assumed that the errors are equally distributed and will compensate each other. Nevertheless, this is the major limitation of our analysis, especially regarding the correctness of its conclusions.
Third, there is one column with menu names, but sometimes, due to manual input, the same dishes are described differently (e.g. Frikadellen vs. Fleischbällchen) or an annex is added (e.g. Falafel Minz Joghurt vs. Falafel Minz Joghurt Creme) and sometimes there are just spelling mistakes (e.g. Hänchenkeule). In order to overcome this problem, we decided not to use the menu names. Instead we categorized the menus (e.g. vegetarian or non-vegetarian). We tried to do this with an algorithm, but because of the varieties of menu names, this meant high manual sorting and programming effort with still many wrong categorizations remaining. Thus, an additional effortful manual excel based classification process of nearly 1,000 unique dishes was performed.
B) Weather data
There are a variety of weather variables hourly available in the data files of DWD. We focussed on the temperature and the sunshine time. For our analysis we calculated the average temperature in degrees Celsius from 11am to 2pm as well as the average sunshine time in % from 11 a.m. to 2 p.m. (i.e. minutes of sunshine divided by 180 minutes) for each day.
Finally, our composed data frame looked like this:
4. Data visualization and interpretation
We chose a step by step approach for data visualization and interpretation. Three aspects were in focus of our analyzes: A) development of weekly vegetarian and non-vegetarian sales figures, B) average vegetarian and non-vegetarian sales figures per weekday, and C) development of daily sales figures per food counter. We used Python’s matplotlib-pyplot and seaborn packages to visualize the data. The most interesting findings are outlined in the following.
A) Development of vegetarian and non-vegetarian sales figures per week
Over the next paragraphs we build up a more and more complex visualization which in the end comprised a number of different variables.
First, graph 1 simply depicts the sales figures of vegetarian and non-vegetarian dishes for 2015 to 2018. Over the last years, non-vegetarian sales figures (red line) decreased while vegetarian sales figures (green line) increased. In 2017 and 2018, they were nearly on the same level. Noticeable, data for January 2018 is missing completely.
Second, in addition to the first graph, the second graph plots the average temperature from 11 a.m. to 2 p.m. (blue dotted line). Summer and winter times are clearly recognizable. Even the extraordinary hot summer in 2018 is visible.
Third, because new variables make the graph more complex, we smoothed the data series. Thus, developments are still visible while the weekly fluctuations are less distracting.
Fourth, graph 4 depicts sunshine time in % (yellow line) instead of temperature. Because sunshine time does not seem to correlate with the sales figures, sunshine time is not plotted in graph 5.
Fifth, semester break (dark grey area) and lecture-free time (bright grey area) are plotted. These times seem to influence the sales figures, both vegetarian and non-vegetarian. The longer students’ university break is, the higher the effect. It becomes clearly visible that sales figures mostly correlate with semester breaks and not with temperature.
B) Average vegetarian and non-vegetarian sales figures per weekday
Another Aspect which was analyzed is the average number of sold dishes per weekday as shown in graph 6. The average for Saturday is computed by only one value of one single dish. Thus, it can be assumed that this is an error in the original data. Non-vegetarian (0) sales figures are higher for each day in comparison to vegetarian (1) sales figures. The peak of the sales figures for non-vegetarian dishes is on Tuesday and for vegetarian dishes on Wednesday. Starting at their individual peak, sales figures decrease until the end of the week. Friday is by far the day with the lowest sales figures.
C) Average sales figures per food counter
The last aspect of our analyses are the food counters. There are several food counters in the canteen: grill station (orange line), menu (purple line), vegan (red line), stew (blue line), and salat/vegetables/pasta (green line). In addition to the sales figures of the food counters the temperature (black line) is plotted. Based on the graphs it can be seen that separate reporting of vegan dishes began in 2016 and of stew at the end of 2016. Because of false values, the line of salat/vegetables/pasta is generally missing. The sales figures of the menu and barbeque counter are several times higher than those of the vegan and stew counter. In each summer, there is a time in which barbeque’s sales figures go down noticeably below the sales figures of the menu counter. Over the rest of the year they are more or less on the same level. For stew there seems to be a selling stop in the summer months. Further conclusions concerning the food counter seem to be very vague based on the data. Thus, this analysis was not detailed any further.
All in all, our main experience was to create the graphs in Python and explore its nearly unlimited possibilities, even though we have had hard times figuring out something in Python that could have been done in Excel with a few clicks. As the used data set still includes outliers as well as missing data, further cleaning steps need to be conducted if reliable conclusions want to be drawn from it.
5. Bonus: Application of machine learning algorithms
In addition to the key project scope, we applied machine learning algorithms to our data to forecast sale figures. In this context, the Techlabs Jupyter notebook on ‘Bikesharing in Münster’ was of great help.
To forecast a continuous variable like sales figures with known future predictor variables like day of the week, we used a regression algorithm. Having more than 8,000 data points and around 30 features, we chose supervised learning algorithms. After describing A) the data preparation process, B) the process of model training, testing and evaluation is described exemplarily by means of the best working machine learning algorithm, before C) an outlook is given at the end.
A) Data preparation
We went through the following steps to prepare our data for machine learning starting with the pandas data frame shown in section 3.
● As machine learning algorithms can only handle numerical data, we converted all categorical features to binary dummy_features (e.g. Monday gets its own column ‘weekday_monday’ with entries set to 0 or 1 depending on if it was Monday or not).
● For feature correlation analysis, we used seaborn plots to determine whether the features correlate with each other and with the sales figures. Based on an educated guess, the features deemed most important were chosen.
● We split the data into features and target variable as well as into a training and test set. The training set is used to train the machine learning models, whereas the test set is needed to test their performance on an unseen dataset. As we have a forecasting problem with a time-dependent structure, it is important to not split the data set randomly. We used historical data of the years 2015–2017 as training set and 2018’s data as test set for forecasting.
● The pandas data frames were transferred to numpy arrays as input to the scikit-learn machine learning algorithms.
● The feature data was scaled to have the same weights for all features.
● The target variables were skewed which means that they do not follow a normal distribution. This was solved by using a Square Root Transformation.
B) Model training, testing and evaluation
A couple of scikit-learn algorithms were tested. Starting with simple linear regression, ridge, lasso, random forest and extreme gradient boost were tried out. Each model was assessed with nearly the same following steps:
● The models are trained on the test set by using cross-validation techniques. If hyperparameter are apparent in the model, grid-search cross validation was applied to find the optimum hyperparameters.
● Afterwards, the model’s performance was evaluated on the test set data it had never seen before. As a performance metric we looked at mean average error as well as mean average percentage error.
Two machine learning results are depicted in more detail, (i) linear regression and (ii) random forest regression
(i) Linear regression
Linear regression is the most basic model. Per feature, the coefficients m and b of a linear equation y=m*x + b, where x is the feature and y the variable to be predicted, are fitted by minimizing a loss function. This simple model helped us to understand how machine learning is working in general. However, disadvantages of linear regression are that large coefficients can lead to overfitting. This obstacle can be overcome by using ridge or lasso regression. These algorithms use the idea of penalizing large feature coefficients called regularization. We used lasso regression with grid search cross validation to select the most important features as it shrinks coefficients of less important features to 0. Exemplarily, the feature coefficients of lasso regression are shown in the following graph:
(ii) Random forest regression
This is our model of choice delivering the best prediction results. Without going into the algorithmic details, results of the prediction are described below. The following graph shows the weekly sales figures of 2018 with the algorithm being trained on the data of 2015 to 2017.
Looking in more detail at 2018 and comparing the prediction with the ground truth data, one can see that the bigger trend in the sales figures is pretty well captured. However, there are certain larger deviations, especially in times with higher numbers of menus sold.
Engineering possible reasons by looking at the error distribution and comparing it with feature distribution, we found that sales figures in the lecture free times are captured well in contrast to the semester times. This issue should be analyzed in further detail, but was not followed up at this point.
To summarize, we were quite happy with our first machine learning application. Certainly, there are issues that need to be analyzed further and understood in more depth, but overall first predictions are quite promising.
By cleaning the data in detail and spending additional time on optimizing a regression model, a canteen assessment tool could easily be developed. For improved predictions, the dependencies between menus sold on the same day should be considered, too. To achieve this on an abstract level, menus could be categorized in mutually exclusive and collectively exhaustive (MECE concept) categories like pasta or dessert.
The ultimate idea is to have an app or browser front-end, where the canteen staff puts in the planned menu for the next week. Automatically, weather forecast data as well as holiday and semester data are added. The app delivers the predicted sales figures per menu and the required amounts of ingredients are directly ordered from the supplier. Every day, sales figures are automatically transferred to the system, which learns from the past and thereby steadily improves. Thus, food waste due to overproduction could be reduced.
Finally, this tool would also answer the question we started our project with:
Is Schnitzel the one and only favorite no other menu can keep up with?
Daniel Gustav Kronbein
Our special gratitude goes to the Mensa am Ring. They gave us menu plans as well as sales figures of the last years to work with and answered questions of how to interpret certain variables. In addition, we would also like to thank our TechLabs Mentor, Jason O’Reilly, as well as the whole TechLabs team for this great learning experience.