Data Science x Project Planning

James Chen
5 min readMar 22, 2018

--

A non-technical guide to k-NN algorithm and its application on forecasting, from a project planning point of view, and for beginners.

Introduction

The intended audience for this short blog post are data science practitioners who seek to implement predictive algorithms in a business-project-based setting, with special focus on presenting the work process flow. We will briefly introduce the k-Nearest Neighbors (k-NN) algorithm, and put more emphasis on the key phases, as opposed to walking through the technical theory behind the algorithm and its prediction performance.

The example business project here is a typical sales forecasting problem where we want to accurately predict the quantity sold of a number of products in the future, in order to manage our inventory more wisely.

Methodology

The k-NN algorithm is probably better known for its classifier application, where we use a number of nearby points to determine the outcome of our target. The rationale is straight-forward; if we use height and age as our input, and gender as our target, then it makes sense to say that if a person is at age 25 and 6 feet tall, he is more likely to be male, because 5 other people who are at around the same age and with similar height happen to be male.

However, k-NN could also be applied in a non-supervised setting, where we find the similar data points instead. After finding the similar points, we could predict our target based on their statistical distributions, such as mean, median, and mode. This approach could also be applied to time-series data, where we could use the results from similar trends to predict our target. This approach is non-parametric and different from conventional time-series analysis methods, such as ARIMA and LTSM.

Stages

1.Preparation (40% time)

Key: understanding of the business

During the preparation stage, the objective the to collect relevant and accurate data from our data sources, as well as determine the testing methods in terms the products to model on (sampling) and the time window we would like to evaluate our performance from (backtesting methods).

This stage is also the most time consuming stage, as we need to have a deep and thorough understanding of our business, checking and validating our assumptions ,as well as discussing evaluation methods that have the minimum bias to our best knowledge.

While we may be tempted to move quickly onto the modeling stage, the work here requires a lot of patience and experience, and has significant impact on the outcome of our predictions, though the process may be less interesting. It is also common for the management level who are not familiar with the process to think that data preparation is simple, and often, neglectable.

It is recommneded to allocate more project time and resource on this stage than any of the following stage — garbage in, garbage out.

2.Preprocessing (30% time)

Key: understanding of the data

Preprocessing plays a critical role in the process if we are dealing with real world data instead of toy datasets from textbook. We will be dealing with outliers, missing values, and short selling days (newly launched products or sold out situations).

There are also data transformation techniques including taking log data, normalization, dimension reduction methods, and so on. Depending on our understanding of the data characteristics, we could then decide the relevant techniques to apply.

Another interesting point here to cover is the trend extraction step from sliding window. The idea is to breakdown the entire data into smaller pieces, and form more data points.

For example, if we have a product that has 300 selling days, we could take the first 100 days as our first record, and slide the time window by 7 days; that is, the second record will be 8th day to 108th day, and the third record will be 15th day to 115th day, and so on. We will end up with (300–100)/7 = 28 records.

3.Modeling (15% time)

Key: understanding of the model

The performance of the algorithm will be based on a few parameters, including the quality of the data points for our target to be referenced from, the number of nearby points to take into the calculation, as well as our definition of similarity and/or difference.

The key idea here is how we use the algorithm. After finding the nearby data points, we will use the sales quantities of their 101st days as the basis of our prediction, by taking the mean or median of the sales quantities.

It may be helpful to conduct a few quick experiments first before trying a large number of parameter combinations, as we could have a better estimation of the time needed to finish model parameter tunning.

By recording the results of each experimentation on different parameters, we could then understand the model performance on given data better. For example, we may observe that using 3 neighbors yields better result then bobth 5 and 7 neighbors; a possible explanation is that within our pool of 100-day trends, not many of them are similar to each other, and thus when we increase the number of nearby data points, the performance suffers.

Another exmaple is that if we find taking the median outperforms the mean, is it highly likely that the data points we referenced from have outliers, and thus we may need to go back to the preprocessing stage to clean our data.

4.Evaluation (15% time)

Key: understanding of the result

Usually this KPI is determined already in the preparation stage, through discussions with project stakeholders. Just like different algorithms and data handling techniques, all KPIs have pros and cons as well. In time series forecasting, some common KPIs include RMSE, RMLSE (log), MAPE (or MPE), wMAPE, and sMAPE. It is important to know the limitations and assumptions of each metric in order to have a better understanding of future prediction performance.

For example, if we optimize our model towards RMSE, which is essentially the difference between actual and prediction, then it would be hard for us to know whether an overall RMSE of 100 is acceptable or not, since we are not sure if the actual quantity is 10 or 10,000. On the other hand, by choosing MAPE, we may underestimate the predictive strength of our model if the actual quantity is small; also MAPE is insensitive to directions.

From a project planning point of view, selecting the most appropriate metrics to optimize would save a lot of time from re-running the modeling stage with a different KPI.

End Notes

So far we have covered 4 different stages of a data science project, in terms of project planning. It is highly recommended to think through the stages before even pulling the data (though it is hard to resist). Of course, we may face unexpected things during the project, but by carefully considering the key points covered above, we could significantly reduce the risk of re-doing the work and failing to deliver the result within said timeline.

For more advanced readers, please also refer to the technical resources below, if you wish to learn more about time-series prediction:

  1. Uber’s backtesing tool: https://eng.uber.com/omphalos/
  2. k-NN on time-series: https://www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/39876
  3. XGBoosting and linear regression: https://www.analyticsvidhya.com/blog/2016/02/hand-learn-time-series-3-hours-mini-datahack/
  4. Model-stacking approach: https://www.datasciencecentral.com/profiles/blogs/modern-approaches-for-sales-predictive-analytics-1

Open to project-based work.
jchen6912@gmail.com

--

--

James Chen

Engineer by training. Analytics by passion. R and Python addict who hacks and decodes data for marketers.