Understanding Marketing Analytics in Python. [Part 9] Working with Categorical variables.
This is part 9 of the series on Marketing Analytics, have a look at the entire series introduction with details of each part here.
Exploring Associations in Survey Responses [ cardinal variables] from the Marketing Dataset :
Categorical variables are known to hide and mask lots of interesting information in a data set. It’s crucial to learn the methods of dealing with such variables.
In our example dataset, cust_df : the survey responses by customers for eg are rankings given by customers and are ordinal type of categorical variables. [ read more about types of variables here ] .
Ordinal (ranked) variables and it can be a bit tricky to assess associations among them.
For eg. Many marketing datasets include variables where customers provide ratings on a discrete scale, such as a 5- or 7-point rating scale.
In the dataset which we have been working on, cust_df → we have 2 such variables which are customers’ response on a 5-point scale for 2 satisfaction items satisfaction with the retailer’s service (sat_service) and with the retailer’s product selection (sat_selection).
If we go ahead and plot these 2 values we would get the following :
The resulting plot is not very informative :
1. because these 2 variables ONLY take values which are integers and between 1–5
2. The points for customers who gave the same responses are drawn on top of each other.
3. customers reported most of the possible pairs of values, except that ratings rarely showed a difference between the two items of 3 or more points (there were no pairs for (1, 4), (1, 5), (5, 1), or a few other combinations
But beyond these observations, it is difficult to assess strength of association , so we need to take further steps to improve visualisation .
Jitter: Make Ordinal Plots More Informative
One way to make a plot of ordinal values more informative is to jitter each variable, adding a small amount of random noise to each response. This moves the points away from each other and reveals how many responses occur at each combination of (x, y) values.
We can use np.random.normal() to do this:
Result :
it is easier to see that the ratings (3, 2) and (3, 3) were the most common responses. It is now clear that there is a positive relationship between the two satisfaction variables. People who are more satisfied with selection tend to be more satisfied with service.
This is a very simple example of what we did to handle categorical variables, there are many more ways to do so depending on the type of data and business usecase.
For more details, refer to this datacamp article.
Reference :
https://www.datacamp.com/tutorial/categorical-data
With this we end our series on doing marketing analytics with various kinds of variables in the data. Please refer back to the first story (series key)to get a full list of what we have discussed in all 9 stories.
Hope you enjoyed it :)
Checkout for more stories on further topics on Marketing analytics , I shall keep posting !