Talking Data User Demographics

Kaggle Competition

Tataji Yerubandhi

Published in

Analytics Vidhya

12 min readApr 18, 2020

Business Problem

Source : Competition Link

Nothing is more comforting than being greeted by your favorite drink just as you walk through the door of the corner cafe. While a thoughtful barista knows you take a cappuccino every Wednesday morning at 8:15, it’s much more difficult in a digital space for your preferred brands to personalize your experience.

Talking Data, China’s largest third-party mobile data platform, understands that everyday choices and behaviors paint a picture of who we are and what we value. Currently, Talking Data is seeking to leverage behavioral data from more than 70% of the 500 million mobile devices active daily in China to help its clients better understand and interact with their audiences.

In the kaggle competition, participants are challenged to build a model predicting user’s demographic characteristics based on their app usage, Geo-location, and mobile device properties. Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts which are relevant to their users and catered to their preferences.

Real-world/Business objectives and constraints

1. No low-latency requirement.

2. Probability of a data-point belonging to each class is needed.

Existing Approach to the Problem:

. https://github.com/Gautam-v-ml/TalkingData-Mobile-User-Demographics

Here EDA and Data cleaning was done.

He could achieve best score as 2.38 with basic ML Models Like Logistic Regression

Improvement to the Existing Approach :

I have followed the ensembling of different DL and ML models.

Now I will start my Approach:

Data Overview

The Data can be downloaded from the Competition Page_Data which includes:

1. gender_age_train.csv, gender_age_test.csv — These files contain the details of devices to train and test the model respectively.

2. events.csv, app_events.csv — When a user uses TalkingData SDK, the event gets logged in this data along with its timestamp. Each event has an event id, location (lat/long), and the event corresponds to a list of apps in app_events.

3. app_labels.csv — Contains Apps used by the user and its corresponding label_id’s.

4. label_categories.csv — Consists of App label ids present in app_labels.csv and their categories. For example Label ID: 4 falls under the category of game-Art Style.

5. phone_brand_device_model.csv — Contains device id’s of devices used by the users,their corresponding phone brand and device models.

Performance Metric

Multi-Class Log-Loss: Each device has been labeled with one true class. For each device, we have to predict a set of predicted probabilities (one for each class). The formula is

where N is the number of devices in the test set, M is the number of class labels, log is the natural logarithm, yij is 1 if device i belongs to class j and 0 otherwise, and pij is the predicted probability that observation ii belongs to class j.

Exploratory Data Analysis :

Importing Required Libraries

Start of Analysis on each csv we have as part of data :

gender_age_train Data

So,gender_age_train contains 4 columns :

· device_id: The User’s device id registered with Talking Data

· gender: The user’s gender denoted by M or F for Male or Female respectively

· age: The user’s age

· group: This is the Target class for our problem and contains the classes which we need to predict. The first letter denotes the gender of the user and it is followed by the age group to which the user belongs

see the plot below which shows us each group’s count

From the plot above we can infer that the group of male users is more when compared to female users. This is more evident from the Pie chart as well where almost 65% are Male and 35% Female.

gender_age_test Data

This is the test data for which we need to predict the group(target variable)

so the test data has only one column device_id.

phone_brand_device_model Data

We have device_id, phone_brand, device_model here and note that some of phone_brand and device_model values are in Chinese.

First we will check for the duplicates.

So there are a few duplicates and they need to be removed.

Events Data

First we will check whether a device id has multiple event id’s or not ?

For that , Let’s look at the events data for a device with id 29182687948017175

events[events[‘device_id’]==29182687948017175].head(10)

From above we can say that device id can have multiple events.

We have timestamp for all the events, we can have a look at the overall start and end time of all the events to see the time for which all the events were recorded.

So the events data available to us are recorded for a period 8 days starting from 30th April 2016 midnight to starting of 8th May 2016 12 AM.

Now the Most important thing we need to check whether all devices have Events ?

Below is the plot which tells about Event information

So almost 69% of data has no events .

Now we see how the data is spread across the globe?

The tables above show that the majority of the events happen around (0,0) which is located in the middle of the Atlantic Ocean. It is safe to assume that these logs on position are a product of users not wanting to share their position and therefore are useless. The majority of the other coordinates are located in China,only a few pin users in other parts of the world. Although a third of the data bears no information I still think it is worth to use the information on position as the two figures show there is a certain difference of distribution between females and males

app_labels Data

Now we will find out how many unique app labels do we have?

Now that we have 507 unique app labels, we can have multiple label id’s associated with a particular app id as well. Let us consider an app with id 7324884708820027918 which has multiple labels.

app_labels[app_labels[‘app_id’]==7324884708820027918]

app_events Data

We will know number of unique apps and events through below code:

Analysis on is_active column

So most of the apps were inactive while there was an ongoing event. Almost 61% of the apps were inactive and 39% of the apps were active.

Conclusions on Exploratory Data Analysis

1. Only 31% of Both Train and Test Data have Events and App Related Features
2. We need to use Phone Brand and Phone Model Data for Devices without Events
3. We can use event related features along with Phone brand and model features for Devices which contain event information

Data Preparation

Below are the steps that I followed for preparing the data used in my Models:

1. Since we have two types of devices in our data, the ones with event details and the ones without any event details. I separated the devices and created the data for devices with events and devices without events.

2. For devices which don’t have events data, I have used only the phone brand and device model as features.

3. For devices which have events data, I have used phone brand, device model along with event data features like median latitude, median longitude, hour at which event occurred, day of the week when the events occurred, is_active feature present in app events data, list of all the apps used/installed in device, list of all app labels grouped by device id.

To see the more details and code for preparation of the data ,Please check it on my repository: My Github Repository

Now it’s time to use different models to get better predections!

Below are the different approaches followed to achieve:

1. Since phone brand and device model details are present for all the devices in our data, I used these two features to train a Logistic Regression and two different Neural Network models. I used these models to predict the class probabilities for the devices in test data which do not contain events data that we separated in our earlier step.

2. For devices which contain event details, I used the event related features we extracted for these devices and train two different Neural Network models using only the devices which contain event details. I then used these models to predict the class probabilities for the devices in test data which contain events data.

3. Finally I concatenated the test data predictions for devices with events, devices without events and created the whole test data prediction file.

Feature Engineering with two different data sets

Since we have two separate data for devices with events and devices without events, I have prepared the features separately for these data.

1. Devices Without Events: create one-hot encoding of phone brand, device model features.

2. Devices With Events: create one-hot encoding of phone brand, device model, apps belonging to a device, app labels belonging to a device, TFIDF encoding of hours feature, day of the week, apps is_active, standardized latitude and longitude features. Let us call this Event Feature Matrix.

Modeling

As mentioned in my approach, in this section I will walk you through the Models that were used in my solution. There are two parts to this namely, modeling on devices data which don’t have event details and modeling on devices data which contain event details. Let us start with modeling of devices without event details first.

Devices without Event Details

Before going into modeling I will create Train, Validation and Test Data for my Model.

X_train_one_hot is the one-hot encoding of phone brand, device model features of the train data for all the devices (with and without events). X_test_no_events_one_hot is the one-hot encoding of phone brand, device model features of the train data for only the devices which do not contain event details. The reason for this is as I mentioned in my approach the phone brand, device model features are available for all the devices. So I trained the models on these features for all devices available in train data and used the models to predict for the devices in test data which don’t have event details.

Logistic Regression

The Logistic regression model is trained on train_1, cv_1 is the validation data and test_1 is the test data on which we predict the class probabilities.

I have done Hyper Parameter Tuning and I got best C as 0.1(lowest log-loss)

So we use best C value as 0.1

Using only phone brand and device model one-hot encoding as features the Logistic Regression Model has a CV Log-loss of 2.38.

No we go for Neural Networks

Neural Network 1

The Neural Network architecture referred from competition discussion page

input_shape here is the number of features in X_train_one_hot.

I train the Neural Network 1 model 5 times using different random splits of Train, CV on X_train_one_hot data like below

model_list_1 contains 5 models trained on different versions of X_train_one_hot data split using different random seeds. I used each model in model_list_1 and made a prediction of probabilities on test data (test_1) and took the average of all the predictions.

Below are the tensor-board scalars for above 5 models

Neural Network 2

Neural Network Architecture referred from competition discussion page

Here, input_dim is the number of features in X_train_one_hot.

I trained the Neural Network 2 on train_1 once for 30 epochs and used this model to make prediction on test data (test_1).

Please refer My Github Repository for tensor board scalars

Devices with Event Details

Let’s create Train, Validation and Test Data for my Model just like we did before for devices without event data.

Here X_train_events_one_hot_1 is the Event Feature Matrix for train data which we created in our feature preparation step. X_test_events_one_hot_1 is a similar Event Feature Matrix for test data.

Neural Network 3

Neural Network Architecture referred from competition discussion page

The dropout in the input layer adds value here as it gives variability for predictions because when we use dropout in input layer then during each run only a random set of features are taken as input to the model.

Here, input_dim is the number of features in X_train_events_one_hot_1.

I trained the Neural Network 3 model 20 times.

Then I used each of the 20 models in model_list_2 to predict on test data (test_2) and took the average of prediction probabilities.

Please refer My Github Repository for tensor board scalars

Neural Network 4

This Neural Network is a variation of Neural Network 3 but with 2 dense layers and different number of hidden units.

Here, input_dim is the number of features in X_train_events_one_hot_1.

Similar to my approach in training Neural Network 3, I trained the Neural Network 4 model 20 times.

Then I used each of the 20 models in model_list_3 to predict on test data (test_2) and took the average of prediction probabilities.

Please refer My Github Repository for tensor board scalars

Model Ensemble

1. Devices with No Events Data:

2. Devices with Events Data:

Final Predictions :

Finally I concatenated these test data predictions for devices with events, devices without events and created the whole test data prediction file containing 112071 rows, where each row contains the predicted probability of the device belonging to each of the 12 classes.

Results

1. No Events Data: The Model is Trained on One_Hot Encoding of phone brand, device Model for all the Devices.

2. Events Data: The Model is Trained on Events Feature Matrix, only for the Devices which contain Event Details.

3. Avg in Model indicates the Model has multiple runs and predictions are averaged.

The concatenated Test predictions on submission score:

Further Improvements

1. We can use different weights on different model combinations to improve the log loss

2. We can use ensembling on different ML models Random Forest or any model

I tried to explain in simplest way and hope you are able to understand the blog.If you have any queries ,please feel free to reach out me via :My LinkedIn Profile

Thank You for Reading !!

References

1.https://www.kaggle.com/c/talkingdata-mobile-user-demographics/

2. https://www.kaggle.com/c/talkingdata-mobile-user-demographics/discussion/23424

3. https://machinelearningmastery.com/model-averaging-ensemble-for-deep-learning-neural-networks/

4.https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/3081/what-are-ensembles/4/module-4-machine-learning-ii-supervised-learning-models

Talking Data User Demographics

Kaggle Competition

Written by Tataji Yerubandhi