Predictive Modeling: Best practices and lessons learnt the hard way

Published in

Data Decoded

7 min readOct 18, 2018

Having spent the last six months working extensively on predictive modelling in the FMCG industry and making more than a few mistakes, I’ve decided to put together a list of best practices and common pitfalls to avoid.

Unlike my earlier posts however, this will be more pin-pointed and succinct. It will broadly be divided into the following sections.

Structuring and setting up the environment
Relevant metrics — FMCG
Variable design/feature engineering
ADS Creation
Modeling
Language specific guidelines and pitfalls to avoid — R and Python

Structuring and setting up the environment

Whenever you start a modeling exercise, use proper structure to organize your files and folders.
As a guide, use the template shown below
The idea is to have one folder containing all the data sets, scripts and results of the exercise
Treat this structure as a flow. Start with putting your files in the 0. Raw Data folder, then moving on to data pre-processing and so on. This flow will make it easy to create an end-to-end analysis pipeline.
The flow should be such that the results of any script of the current folder are saved in the next folder. For e.g. after preparing the ADS using the ADS Creation Script, the ADS can be saved to the next folder(2. Data — ADS Only).

Relevant metrics — FMCG

Know the basic metrics — Unit Sales, Dollar Sales, Volume Sales. Read more about them here.

Sales Metrics : Pic Courtesy CPG Insights website

Volume is sometimes interchangeably used with units. However, when you’re doing sales analysis, use units unless explicitly asked to use volume sales.
ACV is slightly tricky to understand; however CPG Insights provides a pretty good explanation here. Read thoroughly.
While calculating discount depths, you will need base prices and volume. There are multiple approaches available; read this for a detailed explanation.

Variable design/feature engineering

This is THE most important step of the entire modeling exercise. Choose your variables carefully.
KNOW the level of your data and consequently your model. Sales modelling is typically done at a market product level, since each product is different and will respond differently in different markets. However, this might not hold true for geographies which are smaller and have less disparity across regions.
Generally, in a sales driver model, you’ll have the following variables:

The best way to is to prepare each variable in a separate script at the level of your data and then merge with the main data set at the same level.
Whether you choose to remove or keep a certain variable depends on the specific use case. For e.g., a product like rice might not exhibit a lot of seasonal variation compared to a beverage. Know your use case and study your EDA results to decide whether to keep or drop a variable.
Sometimes, modeling might just require talking about the variables in terms of their interaction. To give an example, suppose I want to my find how my discount depth and execution affect the sales of my product in conjunction with each other. In this case, I would have to create combined variables for both Execution + Discount Depth and then run a regression model. This might just show me that the effect of a 20% discount combined with ‘High’ execution is the same as a 40% discount combined with ‘Low’ execution. Hence, I can unlock growth without increasing my trade spend.

ADS Creation

If you’ve done all the above correctly, this step shouldn’t be too complicated.
Ideally, all variables would be ready at the level of your data. You would just be merging the variables with the main dataset at that level.
Outlier treatment is a very, very important step; especially when running linear regression models. LR is very sensitive to outliers. However, not all outliers should be dropped/imputed. Read this article to decide what to do with them.

Modeling

Most driver models, including sales drivers, are built using Linear regression(or a variation thereof like Ridge, Lasso or ElasticNet).
Running a lm() function is very easy; understanding the math and intuition behind LR is hard. Read this link and if you do not get it at first, read it again. Read it until you’ve got 100% intuition of LR.
Now read this link to understand why we build a log-log model for sales driver analysis. Understand what elasticity means(not just in terms of price). This will also help you interpret the coefficients of the model results.
Initially, a simple multiple linear regression(maybe stepwise) model will do. However, to improve accuracy and reduce over fitting, we will need to use ElasticNet. Read about ElasticNet here.
A model’s accuracy is measured by the MAPE(Mean Absolute Percentage Error), which should always be calculated on a held-out test data-set. Cross validation should be used to ensure consistent results.
Advanced techniques like PCA might result in a better model(lower MAPE), but they lose all meaning if you’re building a driver model since they will transform your variables representing physical entities into mathematical formulas.

Coding and language specific ‘gyaan’

Which language should I choose to use?

Short answer. Both.

Long answer. Both R and Python are versatile and come with their own set of pros and cons.

I prefer a hybrid approach; all steps till preparing ADS in Python while modeling in R.
However, if you’re going to setup a pipeline, it’s better to stick to one language, preferably R.

Guidelines and pitfalls

Generic

Mention the author and a two line description of the code at the top.
For the love of God, please comment your code; not only will you forget what you’ve written in a while, whoever needs to maintain or reuse your code will probably commit suicide. Additionally, comment why you did something, not what you did. That’s what your code is for.
Use proper naming convention(names that explain the purpose e.g. priceHouse). x,y,z,i,k are fine for loops, but for anything else proper names are absolutely necessary.
While naming variables, use a standard format preferably camelCase.
While naming functions, use kebab_case. Yes, that’s the actual name.
Write modular(functional) code, using user defined functions as much as possible. If you find yourself repeating even one line of code, create a function for it.

R Specific

When importing CSV’s etc., use stringsAsFactors=False, since R will import strings as factors(categorical variables) by default. Factors are a strict NO-NO in R; they’re known to give random errors. Read more here. Always use characters(strings) instead of factor type
If using categorical variables as model inputs, encode them manually as flags using fastDummies package. R(lm)’s internal factor encoding is not dependable and fails often.
Continuing on the same lines as above, the only variables that should go into your model should be numeric(categorical values encoded as flags).
‘dplyr’ and ‘plyr’ should not be loaded in the same code since they mask each other’s functions. If you have to use a certain functionality from those packages, use namespacing to make sure the correct package’s functions are called. In general as well, name scoped functions are a good pattern to follow. For e.g. use dplyr::summarise() instead of just summarise().
Never use attach to bind data sets to memory. Lots of potential issues there.
R was created by statisticians and not coders; hence arrays start at 1. Relevant meme.
When working with dates, use Posixct format and set time zone as UTC to ensure correct merges. Read the help page here.
If working with multiple models from one dataset, use ‘broom’ package combined with ‘dplyr’ to tidy up your model results and save in a data frame. Read more here.
Do not use column numbers/indexes to refer columns. If any columns shifts to the left/right in your input dataset; you’ll be in a fix.

Python Specific

Because Python was written by computer scientists, it is very robust and cleanly written; not many surprises there.
However, just as an FYI, Scikit-learn — the most used package for modelling and machine learning in Python — does not have a few commonly used methods like stepwise linear regression. Hence you just might want to use R for the same.
When checking for NULL’s, NA’s etc., take a glance at the unique elements of the column(apart from isnull, isna functions). Sometimes NULLs and NAs are read as strings which causes them to not be interpreted as NULL’s/NAs. Be careful with this.

I’ve tried to cover as much breadth as possible and given the relevant links for going in depth. However, if you need any help with any of the topics mentioned above, feel free to reach out and ask for help.

Please do let our team @ Decoding Data know your thoughts, questions or suggestions in the comment section below.