Deep Learning for business problems

Cyrille Brun
7 min readDec 15, 2017

--

Figure 1 — Mobile usage signature for 30 days (data, SMS, voice) for 36 subscribers (12x3)

Deep learning techniques have proven to be a breakthrough in many machine learning fields such as image classification, natural language processing or reinforcement learning. However, we are still waiting to see the benefits of those techniques with regards to more business oriented use cases such as churn classification, behavior segmentation or forecasting. Those problems rely currently on more traditional approach of supervised learning with manual features engineering and we would like to show that deep learning exhibits some advantages over those techniques:

  • Considering directly the raw transactional data as input to your model instead of manually aggregating transactions (with average, sum or max). That reduces the manual work and the human bias of selecting the variable to aggregate and the aggregation method. The deep learning model will create automatically its own features with regards to the quantity it is trying to predict.
  • Considering additional fields through embedding from categories such as mobile plan or device. Traditional approaches would have considered as input to a model the flag for only the most common plans while deep learning can consider all the plans together and derive meaningful representations with regards to the quantity it is trying to predict.
  • Discover customer or event signatures by interpreting the network. By looking at the intermediate layer of the deep learning model, a business analyst could understand the different dimensions driving churn. Although deep learning models can be complex it is critical to understand why a prediction is made.
  • Flexibility to branch new data sources. Given the architecture of the network, adding a branch for new data types is effortless, hence one network can contain pictures input, transactions, profile data without the need of pre-processing.

The purpose of this article is to present the use of convolutional network and embedding with business data that includes transactions and categorical characteristics of customers to predict prepaid churn. Convolutional network have proven very useful to handle multi-dimensional temporal data while embeddings enable to represent meaningfully and automatically categorical attributes or sequence of attributes that were not considered in traditional predictive models.
In this post, we attach a lot of important in visualizing the results of our network via the concept of the “behavior as a picture” and the results of our embeddings.

1. The data

We are considering telco churn for a prepaid market. For simplicity sake, we define churn as no usage activity for 1 month and we consider the 30 days of usage prior the start of the churn assessment window (with a short buffer window in between called marketing and data delay window). The churn flag is 1 if the subscriber has churned and 0 otherwise.

The usage data contains 12 variables with information such as number of outgoing calls, volume of downloaded data in Mb, number of data connection or number of SMS. All those information are available for each of the 30 days and normalized per column, hence each subscriber has a picture of 12x30 representing its usage behavior. 36 pictures/subscribers are represented in the first image (figure 1).

In addition, we include the latest plan of each subscriber. There are 55 distinct plans going from data only plan, voice plan, social media plan or different packages amount.

The data is comprised of 13,818 subscribers.

2. Convolutional Network for usage

The input to the network is the 12x30 picture representing the usage behavior over the last 30 days.

Figure 2 — Architecture Convolutional Network

We chose a simple architecture with two convolutional layers with one dropout layer after each, one dense layer and the prediction. We use relu activation functions for all the layers but the last one where sigmoid is needed.

  • The first convolutional layer applies filters 7x1 on the rows for each of the variables. It can be considered as the weekly temporal filter and we do not group any columns together as the order of the variable does not hold any meaning.
  • The second convolutional layer applies 1x12, hence taking all the columns together (again the order of the variable does not hold any meaning).

Without too much fine tuning, we are reaching an AUC of 0.87 on unseen data. As a comparison, a fine-tuned random forest fitted with all the variables simply aggregated (mean, max, sum) does not go above 0.82.

What the network sees

Based on (1), we are are looking at what are the input images which are maximizing the intermediate layers in particular the convolutional layer

Figure 3 — Type of inputs maximizing the activation of the convolution filters (top is 1st conv. layer below is the 2nd layer)

We clearly observe that each of the filter are coding for different time period, different variables or combination of variables. For example, the 2 top right pictures shows that this filter represents the week before the last with some differences of their treatment of the last week and the 15 days in the center. The second bottom picture is coding for data connections especially during weekends and for download volume and connection duration (columns 6, 9 and 11). As expected, the second layer is exhibiting more complex patterns than the first one as the network build more and more high level features with regards to churn.

3. Plan Embedding

To enhance our model and our understanding of the churn dynamic, we would like to add the subscriber plan to the model. To first understand what the embedding does to the data, we build a small network that predicts churn only with the plan through one embedding layer only. The model will create a 56 one-hot encoding of the data (number of distinct plan) and project that representation into a 2 dimensional vector.

Figure 4 — Architecture Embedding network

To focus on the embedding effect, we freeze the end dense layer with weights 1 and bias 0. We inspect two outcomes of the model: the end average probability for each plan and the embedding layer 2D projections. We observe that:

  • The 2D embedding projection is actually composed of only 1 dimension which is the average propensity per plan (left figure).
  • The network fits exactly the predicted propensity per plan to the actual average propensity per plan (right figure).
  • All of the plans do not have any churn (for example 55, 25, 53,…), are projected clearly separated from the plans which contain some churn cases.
Figure 5 — Output of the embedding network

We managed to transform a categorical variable which could not be directly used altogether within a model (50 categories is slightly too many with our number of observations to be considered in a tree-based algorithm) into 2 new variables (embedding dimension 1 and embedding dimension 2) even very simple. Those new variables can be used by the rest of the network and the representation of the categories can be adjusted with regards to the convolutional network for example.

4. Multi-branch model

Now that we inspected the effect of the embedding, we can combine both inputs (usage and plans) into one network. We use the same architecture as explained above for each branch, we merge the two branches and add one dense layer at the end.

Figure 6— Architecture multi-branch network

The combination of the usage and the plan with this architecture leads to an AUC on unseen data of 0.88 which is higher than the single convolutional network with usage data. Note that it is very straightforward to combine those two branches through the concatenate layer and it can easily be extended to more branches i.e. more different input data.

Conclusion

As a conclusion, we showed that we could use deep learning methods for typical business problems and achieve higher performance than traditional methods. We could also directly use features that would not have been available without embeddings and this can be generalised to any other “non- univariate numerical” features such as pictures, text,…. As understanding outputs and decision taken by the model is critical, we put emphasis on how to interpret the results of the embeddings and the model via visualisation. For example, the network re-construct some temporal patterns and variables interaction which can be seen on Figure 3.

While the networks and architectures become more and more complex, we believe that representing and understanding what each layer codes for will be central in the acceptance of this type of model in business use cases.

--

--