Three diverse machine learning applications made understandable

Andreas Berentzen
incentro
Published in
15 min readJun 24, 2021

Introduction

In this article, three powerful machine learning applications are featured and explained. All three of these applications hold relevance in a slew of economic sectors and have proven their usefulness in a diverse collection of business models. Per machine learning application, the core concepts are explained alongside a short briefing of the technical aspects of a few common techniques in which this application can take shape. In addition, case studies are used to make the application more understandable and to give an impression of their potential. This article is written as clearly as possible and a background in machine learning or computer science is absolutely no necessity.

Demand forecasting

The first topic is called demand forecasting. The goal of demand forecasting is to gain insights over the future demand of a certain product or service. This insight is almost always gained through some sort of analysis of historical data that comes in the form of a time series. A time series is a type of graph that shows the quantity of something over a constant moving time. This can be anything from the electricity usage in Utrecht for the past 3 months to the amount of shampoo bottles sold by L’oreal from 1960 to 1999.

example of a time series graph

Time series graphs can be very useful for business strategists to identify patterns in the sales behaviour in order to anticipate sales figures for the future. What is potentially even better is when a computer model can identify these same patterns and create a detailed prognosis for future demand and revenue all the while incorporating much more detail that a human analyst is capable of.

Mathematical models

Luckily there are multiple state of the art techniques that have been created to serve this exact purpose. This section delves into the different forecasting models that are mathematical in nature, the premise of these models is to deconstruct the time series into independent constituents that can be approached and mimicked mathematically. The models in this section are called autoregressive models and work with the assumption that a time series graph can be deconstructed into three constituents that can thereafter be used to create a forecast.

Technical explanation

The first constituent is a moving average, this is used to capture the broad trends in the gradient of the time series irrespective of local fluctuations. Secondly, there is the autoregressive part which predicts the fluctuations around the earlier established moving average. These fluctuations are heavily influenced by patterns of similar fluctuations in the past. This part of the model is especially effective in predicting seemingly random fluctuations that have a certain consistency through the years. The third component is noise, the function of noise is to incorporate a certain degree of randomness into the course of the time series and assumes there is also a degree of randomness/unpredictability in the real world process that generates the data in the time series. This is done by capturing all the unpredictable fluctuations in the past into a probability distribution. This distribution is then used as an error term when new data points are predicted to uphold the same degree of randomness as the rest of the time series. A probability distribution for the error can be created by a maximum likelihood estimation or a parametric density estimation.

example of demand forecasting with an autoregressive model

Not all autoregressive models use these three constituents as a blueprint but this explanation gives an impression of the functionality and applicability of autoregressive models. We’ll now further delve into some examples of autoregressive models and show their efficacy in demand forecasting.

ARIMA

ARIMA is one of the more sophisticated autoregressive models and stands for AutoRegressive Integrated Moving Average. This is a model with a good degree of adaptability and consists of the three earlier explained constituents. The mathematical core and parameters of this model are beyond the scope of this article, but the model calculates the next unknown data point and can use different scopes of the historical data in the time series to achieve this. A great example of an ARIMA application is a case study about electricity usage in China. For a rapidly developing country like China, where a yearly growth in electricity usage of 34% is not unheard of, it is of utmost importance to get a grip on future electricity demands. With usage data from 2006 to 2010 and a variant of the ARIMA model, a prognosis was created with a MAPE of 3%. MAPE stands for mean average percentage error and denotes the absolute difference between the forecast and the real figures as a percentage of the forecast. Thanks to ARIMA models, Chinese researchers have been able to anticipate the future electricity demand in different provinces with great accuracy¹. Examples on an even larger scale take place in the Indian region of Kanchipuram, with 1200 different rainfall datasets recorded from 1902 to 2002 researchers created a rainfall prognosis for a span of 4 years with an ARIMA model. This forecast achieved a MAPE of 6.5%.² processes in the natural world, as well as human dictated processes, have proven to be forecastable.

Time series forecasting techniques such as ARIMA are especially useful in forecasting long time series data that has a somewhat of a seasonal pattern, nevertheless extreme external factors remain a big problem for autoregressive models and all predictive models for that matter. Think about natural disasters or economic crises. Even enterprises with the most sophisticated forecasting techniques will be caught off guard by sudden external events like these. Some developments can simply not be predicted based on the past. In addition to this, not all processes are suited for autoregressive models. If the course of a time series graph is extremely volatile and appears completely random, a different approach to forecasting could prove to be much more effective. Neural networks are also capable of making predictions on time series data, due to their complex and highly malleable nature they can capture deeper structures and patterns in the data that unsupervised statistical models are simply incapable of. On the other hand, neural networks are so diverse in architecture that selecting the right network to make predictions on a given process can prove to be very time consuming. Neural networks also operate as a “black box” which means that they provide no transparency in how the model creates its prediction.³ Its forecasts should be taken at face value which makes them significantly less valuable than forecasts created by statistical models that seem much more deterministic.

Furthermore, there is a principle in data science and epistemology that dictates that models and solutions that are as simple as possible are a better way to approach reality. It’s called “Occam’s razor” and is definitely relevant when it comes to model selection. There is an issue in artificial intelligence called overfitting that occurs when overly complicated models are utilized. This results in a model that perfectly captures the nature of the data it is trained on, it adapted so well to the training data that it no longer generalizes to novel data produced by the same process as the training data. This among other reasons is why neural networks are mostly used on problems that are too complicated for statistical models or when all other options are exhausted.

Data and conditions

For time series forecasting it is very important to have consistent and accurate data. Consistent in the sense that there are no missing values. If a few data points 4 years ago are missing, it’s not a catastrophe for the efficacy of the model but continuous data is very valuable for time series forecasting. It is also crucial that the data itself is accurate, in the case that 10% of the monthly sales of a company are not included in the dataset can mean the difference between an accurately forecasted seasonal fluctuation or a complete miss of such a development. The final decision for a statistical model like autoregressive models or a more complex machine learning model is dependent on the quantity and reliability of the time series data as well as the motivation of the company that wants to acquire insights from their data.

Recommendation systems with collaborative filter

Recommendation systems or recommendation algorithms are terms that have become all too familiar to most of us. From Netflix to webshops, almost all of them use some sort of recommendation system. In this section one of the most widely used and effective forms of recommendation systems is explained in detail. It’s called a collaborative filter and is used by companies such as Netflix, Amazon and YouTube. “collaborative filtering” sounds complicated but the fundamental premise is very intuitive. It is the core of the recommendation algorithm that holds the assumption that if two people rate products similarly, a product that is liked by one of them will also appeal to the other.

Technical explanation

All ratings of users can be represented in a matrix in which every user has a column that contains their ratings. This column is often called a vector, this vector is seen in the context of collaborative filters as the quantification of a user’s taste. Sort of a taste profile of the user. With a metric called “cosine similarity” the similarity between vectors can be calculated. This is not the only way to measure similarity, and in different situations other similarity measures can prove to be more effective but cosine similarity simply measures the similarity in the direction of two vectors which is extremely useful when we want to compare the ratings of different users. With this measure, we enable ourselves to generate ratings for users of different products they have not seen before. This rating is generated by taking a weighted average of the rating of a number of other users that have a high cosine similarity to the user for which the new rating is generated. The weight is often determined by the degree of similarity that the two users have (more similar users contribute more to the rating than less similar users). In a recommendation system, the unrated product that in this way receives the highest rating is recommended to the user. This is a classical implementation of a collaborative filter based recommendation system where similar users unconsciously recommend their favorite products to each other.

There is another way of approaching a collaborative filter where the products play a central role instead of the users. This technique rests on the same principles: All products have a rating vector instead of the users. This can be achieved by using the rows instead of the columns of the exact same rating matrix. In a similar fashion, the cosine similarities are calculated. The products with a high degree of similarity to products that the user already likes are recommended. If a lot of people bought or liked two products there is also a good chance that these products go well together.

infographic of the two types of collaborative filters

Applications

The choice for an item or user based collaborative filter is completely dependent on the quantity and dimensionality of the available data. In this technique it’s important that whichever plays the central role, users or items, there is enough data to facilitate comparisons between them. A great example of a successful implementation of a collaborative filter comes from the well known Netflix, throughout the years they have developed a very advanced recommendation system that uses different types of data like watch duration, search history and demographics to personalize the recommendation even further, but the collaborative filtering still plays a crucial role. This recommendation system is so effective that Netflix itself estimates that it is responsible for 80% of their watch time and saves the company 1 billion USD annually in 2016.⁴ Ecommerce stores and online businesses can also benefit hugely from a recommendation system. Personalized suggestions catered towards the individual customer is simply an effective way to hold attention for longer or to sell more products and a collaborative filter is a great way to personalize these suggestions.⁵

Data and conditions

The most important data for a collaborative filter is sales data or rating data. The optimal situation is one where a lot of customers each have a ton of purchases or ratings so that the similarities are more accurate and there are more users with high similarities. The effectiveness of collaborative filtering is dependent on the fact that every customer already has some data points to base a recommendation on. This is where the first problem with collaborative filters comes in, it’s called the “cold start problem” and it occurs when a brand new user signs up and starts shopping. Needless to say,

there are no ratings to base the recommendations for this user on. There are more than enough elegant solutions for this situation and other techniques can even be combined with a collaborative filter for a more sophisticated recommendation system. Similar to other machine learning techniques: The more and denser the data the more effective the model will be, and in this case, the better the recommendations.

Customer lifetime value estimation

This section about CLVE relates to businesses that sell products that are non-contractual. For subscription based business models there are different techniques for CLVE and customer segmentation that are not discussed in this article.

Customer lifetime value estimation is a technique where the value that a customer will provide to a business over its entire lifespan is estimated and quantified. The goal is to separate the loyal customers from the one-off or ex-customers so that this information can be used for strategic or marketing purposes. An important principle for customer value lifetime estimation is RFM. RFM stands for recency, frequency and monetary value. Recency simply points to the date of the last purchase, the more recent the interaction the higher the recency score. Recency does not strictly denote purchases, other forms of interaction can also count as interaction with the company (i.e. last time a specific customer added something to his/her shopping cart). Frequency refers to the number of purchases or interactions that a customer has engaged in a specific period of time. Customers that interact frequently with the company are likely to be more loyal to the company than relatively inactive customers. one-off customers are usually segmented into their own category. Monetary value denotes the revenue contribution of a certain customer. Monetary value simply means the amount of money a customer has spent with the company within a specific period of time.

The three RFM variables enable the possibility to review and rank customers in terms of their profit potential. The most basic version of CLVE is through the RFM calculation. This is simply a formula that uses the three variables to return a score that denotes the future potential for this customer. A bit more sophisticated is the technique where a customer gets a score for every variable in RFM and the three scores are combined into a final score through a weighted average. These scores are given through a technique called customer segmentation. Customer segmentation is the partitioning of customers into groups based on their characteristics (in this case the characteristics are their RFM values). There are numerous unsupervised machine learning techniques such as k-means clustering that do exactly this. When the different segments are created by the algorithm based on RFM values each cluster is given a score that represents their potential for the future. The resulting scores and their associated customer groups can be very useful for the marketing and strategic sections of the company and give insight into different customer behavior and enable more personalized marketing.

Of course there is more than these relatively simple RFM based techniques. There are a ton of models that use purchase data and RFM scores to create a prognosis. In this next subsection, we’ll delve further into specific model families that have been fine-tuned to serve the purpose of customer lifetime value estimation. These approaches are accompanied by examples and case studies to make their functionality and applicability more clear.

BTYD models

A lot of mathematical models that estimate customer lifetime value fall within the category of BTYD models (Buy ‘Till You Die), these types of models consist of two constituents.

  1. The churn probability, this models the chance that a customer has permanently left the business and won’t buy a product again in any period of time.
  2. Second is an approximation of the purchase process for a customer in the form of a probability distribution. This process estimates how much and how frequently customers will buy as long as they’re “alive” according to the churn probability.

The first constituent can be seen as a coin flip, once the coin flip is successful and the customer is assumed to be “alive”, the stochastic probability distribution is activated to calculate the number of purchases or netto cash flow from the customer to the company for a given period of time. These probabilities are completely dependent on the customers’ purchasing history. In summary, BTYD models forecast the number of products a specific customer will buy given its purchasing history.

An example of a BTYD model is Parto/NBD. This is a relatively old technique that was first developed in the 1980s. And has been used frequently with mostly positive results.⁶ This is a mathematical model that just like other BTYD models forecasts future purchasing behavior for every individual customer. The model assumes a negative relationship between the time passed after the last purchase and the number of products that a customer will buy in the future. This model has proven to be especially effective in customer populations with a long and relatively active history. A beautiful application of different models in the BTYD family is on the basis of two different datasets of supermarket chains that span over 146 weeks (2001–2003), these datasets are used for a sales prognosis of roughly 90 weeks afterward.⁷ The results are visible below.

Repeat purchases as predicted by different BTYD models
Weekly repeat purchases predicted by different BTYD models

An advantage BTYD models have is that every forecasted sale is bound to a customer. The model predicts per customer and the graphs show the cumulative forecasted sales of all these customers. These types of models are so popular that the well-known programming language Python has a library called lifetime that is designed to make customer value lifetime estimation and customer segmentation simple and straightforward.

There are a plethora of ways to approach customer lifetime value estimation, from a simple formula to complex machine learning models that we, unfortunately, don’t have the space for to cover in this article. The complexity of the approach in which a company would like to use customer lifetime value estimation is also dependent on how important the results are for the company’s strategy. A good rule of thumb is: As long as there is a rich recorded history in purchase data there should be a technique to facilitate customer lifetime value estimation that is in line with the needs of the organization.

Data and conditions

The most important data for customer lifetime value estimation is quite intuitive. All models work with recency, frequency and monetary value data. All of these variables can be taken from a purchase history dataset that captures all purchases with their respective customers and date. This data fuels all models regarding this technique, from the most simple RFM formula as well as sophisticated machine learning methods. The more comprehensive the data and the more active the customers, the more useful and reliable the results will be.

A general message

There are still some important factors to consider when implementing machine learning or statistical models for commercial purposes. Firstly are model parameters, practically all machine learning models have parameters that dictate the exact functionality of the model, changes to these parameters result in changes in the output of the model, unfortunately, there is no room to explain all relevant parameters of the aforementioned models as it would make this article significantly less approachable. To tweak model parameters to the optimal values for a task often requires someone with a background in data science or experience with the specific model since there are a ton of ways to configure parameters improperly that will result in useless forecasts.

Secondly, all forecasting techniques in this article forecast step by step. This means that unknown data points are generated one by one, data points forecasted for the distant future also factor in forecasts of the nearby future which creates a forecast based on a forecast. The further into the future we try to forecast the less reliable the result becomes. This is why companies that use forecasting techniques extensively, often create monthly or weekly forecasts that utilize the most recent data available. This is to ensure all forecasts are as close to reality as possible.

Closing words

Economic sectors are becoming increasingly efficient and competitive. As interest rates are low and large capital injections play a strong role in stimulating the world economy, supply chains are continuously being optimized and companies are looking for increasingly creative ways to achieve a sustained competitive advantage in their market. In addition to this, more and more economic activity takes place online and there are absolutely no signs that this is slowing down. The combination of these developments creates a suitable environment for statistical and machine learning models to provide such a sustained competitive advantage for the companies that have the capabilities in their organization to leverage the power of these models effectively. For instance, by creating insights into customer behavior or by applying these insights automatically to influence customer behavior. There are countless examples of lucrative machine learning applications that create a sustained competitive advantage for the companies that have the know-how to utilize them well. Is your organization ready to use its data as a catalyst for growth?

Originally published at http://docs.google.com.

--

--