Understanding Twitter Engagement with a Constant

Andrea Fiandro
Analytics Vidhya
Published in
8 min readSep 10, 2020

This article outlines the solution proposed by the POLINKS team that ranked sixth on the RecSys Challenge 2020. This challenge is one of the most important competitions in the field of recommender system. This year the competition was hosted by Twitter that provides a really big dataset with almost 80 millions of unique tweets.

The team was supported by FITEC srl, LINKS Foundation and Politecnico di Torino.

FITEC srl, LINKS Foundation and Politecnico di Torino logos
The three companies involved in the challenge.

The goal of the challenge: predict the engagement

Twitter aims to provide always engaging contents, which are tailored on their user preferences. To do so they have to constantly improve their recommender engines and that’s what the teams participating to the challenge tries to do.

Let’s see an example to make things clearer.

In a typical interaction on Twitter, we have two kind of actors:

  • The tweet author
  • The user reading the tweet
A typical twitter interaction between an author and a user.
A typical Twitter interaction: this is what we have to predict. P(yes/no) represents the probability of a positive interaction between the user and the author.

The author writes a tweet that, depending on his popularity, might be seen by a wide range of users. Each engaging user can perform a different kind of actions:

  • Reply: writing a comment about the tweet
  • Retweet: sharing the content of the tweet
  • Retweet with comment: sharing the content along with a personal comment
  • Like: a very well known action

The goal of the Recsys Challenge 2020 is to provide the probability of a positive engagement for each kind of action.

Evaluation metrics

The results on the public leaderboard are evaluated by means of two different metrics:

  • PRAUC (Area under the Precision-Recall Curve)
  • RCE (Relative Cross Entropy): this is a metric slightly different from a classical cross entropy. It is simply the cross entropy, divided by the CTR (Click-through ratio).

Let’s see an example of CTR to clarify a bit this metric.

click-through rate (CTR) explaination
Click through ratio for a typical user session

It simply represents the number of positive actions, divided by the total number of interactions.

First approach: gradient boosting

A pretty common solution that usually gets a lot of success during data science competitions is the gradient boosting.

Since we are focused on getting a good leaderboard position this is our first attempt but if you are thinking that’s the approach that takes us to the sixth place you’ll be disappointed.

Feature engineering

The majority of the time it was used to generate a lot of useful features in order to make easier for the gradient boosting algorithm to predict the probability for each type of engagement.

Gradien boosting feature engineering
Feature engineering. Train. Repeat.

We generate 59 features that can be gathered in six different categories:

  1. Dataset features (12 features): given directly by the dataset. They can be included in the model with little or no adjustments. (e.g. Number of hashtags, language of the tweet)
  2. Author features (18 features): useful to profile each author of the training set. They include some pre-computed features detailing the peculiar behaviour of each author during the history, documented by the dataset. The two most important features of this category are:

Author engagement ratio:

Author engagement ratio

Represents the number of action of a particular type received by the author’s tweets, divided by the total number of published tweets.

Number of received engagements: expresses the total number of interactions, received by the author for each type of engagement (like, retweet, reply, comment).

3. User features (18 features): similar to the one applied for the authors but they are calculated with respect to the user interacting with the tweet. (e.g. Total number of like given by the user or the total number of actions)

4. Language spoken (1 features): the main intuition behind this feature is that understanding the language of a tweet plays a key role in a possible interaction. For this reason we calculated all the languages spoken by each user to define if he is able to understand the text of the tweet.

5. Previous action (4 features): we perform another precomputation to understand the history of the previous interactions between users and authors.

6. Time related features (6 features): some additional features related to the timestamp of the tweet such as the time of the day, the day of the week and so on.

Different feature categories for the gradient boosting
The different categories of features for the Gradient Boosting

Scalability issues

Now we have defined a lot of features in order to help our gradient boosting algorithm to make correct decisions and we have to face the biggest problem of this challenge: scalability.

The training set is really huge (about 70 GB) and contains 148 millions of rows along with the 60 columns we detailed before, representing the features.

It’s easy to understand that is not so simple to fit the whole dataset in memory, for this reason a classical implementation of the gradient boosting wouldn’t work.

Fortunately the XGBoost library we used for the implementation provides a good solution to handle this kind of situation: the external memory version.

Since this kind of implementation is not well known we share a code snippet to help other people facing the same issue.

  1. Write the dataset, along with the generated features in a csv. In our solution we wrote on the disk three different files: training, test and validation.
  2. Save the name of the columns, they will be really useful, for example, when plotting the importance of each feature. Otherwise you will simply get a number, instead of the name.
  3. Write the labels on a column of the csv, keeping in mind the index of the label column since you will lose the headers.
  4. Import the external memory csv in the DMatrix data structure:

It’s important to include:

#dtrain.cache

In this way we tell XGBoost to not load the whole dataset in memory.

5. Train the XGBoost model. To make the model even more robust to the huge size of data we personally suggest to use the gpu_hist tree method and play around with the parameter subsample if you still have some problems:

Results

This solution seems pretty complicated and it doesn’t give us good results on the public leaderboard. In particular we get a strong negative RCE score that suggests to look for other directions.

The best solution: CTR optimized constant

After the disappointment given by the results of the gradient boosting model, we realize that we need a better understanding of the metrics to climb up the leaderboard.

The starting question, that ironically became the best solution, was:

What it is the best constant that can provide a good balance between the score of the two metrics?

To answer that question, we follow different steps:

  1. We assign the same number, extracted randomly, to all the possible pairs user-tweet
  2. We calculate both metrics to tune that number in order to maximise the score

The result of this investigation is detailed in the following table:

RCE and PRAUC results for each constant.

The first thing we noticed is that we have a different optimal value depending on the engagement. In particular we have a much higher optimum for the Like because it is, by far, the most popular action.

From this investigation we get two useful outcomes:

  • PRAUC: any constant produces the same effect in terms of score
  • RCE: has different optimal values, depending on the engagement type.

It is obvious that we have to find the values that optimize the RCE.

Action distribution

The RCE is strictly related to Click-Through Ratio (CTR) and the CTR depends on the different distribution of the actions in the training set.

For example the RCE for the action Like gives better result for higher number with respect to the other action, because it is the most common action.

Distribution of the actions of the RecSys Challenge 2020
The distribution of each action over the training set

At this point it is obvious that the best constant for each type of action was the CTR itself. The following table shows the numerical values of the CTR for each type of engagement:

The value of the clickthrough rate over the dataset
Optimized constant for each type of engagement

By comparing these results with the previous table we understand why 0.5 gives a good result for like while for the other actions the best constant is 0.1. The best value is simply the closest to the CTR.

The final solution is based on the CTR constant, following these steps:

  • Calculate the CTR for each kind of action
The CTR for each type of engagement
  • The predicted probability for each kind of action will be the CTR itself, replicated for all the pairs user-tweet. For example:

Results

Despite this approach is really simple and requires few line of code, it gives an amazing score on the leaderboard, probably because it exploits some weaknesses of the way the ranking is calculated.

The final ranking is computed in different steps:

  • Average of the PRAUC score across the four engagements
  • Average of the RCE score across the four engagements
  • Ranking computation for both metrics
  • Sum of the two obtained ranking

This kind of computation favors the solution with a good score on the least competitive metric.

Our team extensively tested the constant model to understand if the good results on the public leaderboard is just a coincidence or may have some validity also on the official ranking.

To do so we divided the training set in different chunk, with the same size of the official validation set to define how the RCE score is affected by the different distribution of actions.

The results are outlined in the following picture:

Behaviour of the constant based on CTR over different time spans.
Behaviour of the constant based on CTR over different time spans.

As we can see, there’s no big difference in terms of score between the different training chunk, so we can be pretty sure that the final leaderboard computation will not affect so much our solution.

Final Consideration

The lesson we learned from this challenge is that a datascience competition can be really different from real life.

The dataset provided was very well structured and Twitter is one of the most common datasource for researcher so it’s unfair to consider this kind of problem artificial.

The real difference is that somehow, to increase the position in the leaderboard you have to do some tricks, such as exploiting data leaks or metrics weaknesses, that cannot be applied when the goal is to deliver a product.

In a real case scenario a solution based on a constant will be totally unuseful and we would have probably spent more time improving the gradient boosting solution with more feature and with better scalability.

If you are interested, the code for both solutions can be found in our github repository.

--

--