Linkedin Posts Engagement Composite Score with Python
How to rank your company posts engagement through a 3 tier composite score (gold, silver, bronze).
Introduction
Youhosty is an Italian startup in property management, with a high technological impact, for the provision of quality services, connected to short and medium-term rentals.
They provide a wide range of services, from the management of the property to the management of the guests, from the cleaning to the maintenance of the property, from the marketing to the management of the financial flows.
They are a team of professionals who have been working in the sector for years and who have decided to join forces to offer a complete service to the owners of the properties.
To know more about Youhosty, visit their English website here.
Data Project Overview
In this project we refer to the Linkedin profile posts data of Youhosty to analyse the engagement of the posts. The data collected refer to the period from 01/01/2021 to 31/01/2022.
The posts topics are mixed, some of them focus on the promotion of properties for rent, others focus on sector insights and updates.
Main scope is to find if there are patterns between posts components (words, sentences, etc) and user engagement to create a scoring system to evaluate the engagement of the posts.
The scoring system will be based on a tier system (gold, silver, bronze), where each tier will have a different score.
The higher the tier, the higher the score. After the implementation of the scoring system we will create a word cloud to see the most used words in the posts to identify the most important topics.
The posts are in Italian language. I will provide a translation of the most important words and sentences. You can find the full code in my github repository
Key questions are:
Will the flat rental promotion posts be part of the gold and silver tiers or is the insights posts being the winner of the gold and silver medal considering that LinkedIn is a professional social network where besides jobs related posts is mainly used to promote insights and Thought Leadership content?
About the dataset
The dataset has been collected downloading the Linkedin posts data from Youhosty account. The dataset contains the following columns:
- Post Date: Date of the post
- Update Post: Text of the post
- Impressions: Number of impressions of the post
- Clicks: Number of clicks on the post
- Clicks through rate: Percentage of clicks on the post
- Likes: Number of likes on the post
- Comments: Number of comments on the post
- Shares: Number of shares on the post
- Follows: Number of follows on the post
- Engagement: Number of engagement on the post
Action Plan:
- Cleaning the data (data type adjustments, missing values, special characters, emoji, hashtags)
- Descriptive analysis: likes, shares, comments and engagement count, correlation patterns (if any)
- Scoring system implementation
- Finalize the dataframe to implement composite scoring and wordcloud of most engaging posts
1.Cleaning the data(data type adjustments, missing values, special characters, emoji, hashtags)
In this section we will clean the data, removing the missing values, the special characters, the emoji and the hashtags. We will also adjust the data type of the columns.
But first, let’s import the libraries we will need for the analysis.
The libraries we will use are:
- Pandas: to manipulate the data
- Numpy: to manipulate the data
- Matplotlib: to plot the data
- Seaborn: to plot the data
- Neattext: to clean the text
- Wordcloud: to plot the wordcloud
#import main libraries for reading the excel file, cleaning the text
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import neattext as nfx
# Reading the file with posts into a dataframe
df = pd.read_excel('youhosty_updates_1_jan_2021_23_jan_2021.xls', sheet_name='Update engagement')
Dataset Overview:
Drop the rows with missing values: Follows
# drop the column with missing values ==> Follows
df.drop(columns='Follows', axis=1, inplace=True)
# drop the rows with missing values in the column Update Post and reset the index
df.dropna(axis=0, inplace=True)
df.reset_index(drop=True)
Cleaned Dataset:
Summary of posts Metrics
- Total Impressions: 7852
- Total Clicks: 192
- Click Through Rate Average: 0.02
- Total Likes: 175
- Total Comments: 4
- Total Shares: 42
- Average Engagement Rate: 0.06
Cleaning the text using neattext library for the Update Post columns from special characters, emoji, hashtags etc..
# apply the cleaning text function to the column Update Post
def clean_column(column):
# remove user handles
df['Cleaned_posts'] = df['Update Post'].apply(nfx.remove_userhandles)
# remove emojis
df['Cleaned_posts'] = df['Cleaned_posts'].apply(nfx.remove_emojis)
# remove hashtags
df['Cleaned_posts'] = df['Cleaned_posts'].apply(nfx.remove_hashtags)
# remove special characters
df['Cleaned_posts'] = df['Cleaned_posts'].apply(nfx.remove_special_characters)
# remove website
df['Cleaned_posts'] = df['Cleaned_posts'].apply(nfx.remove_urls)
# remove punctuations
df['Cleaned_posts'] = df['Cleaned_posts'].apply(nfx.remove_punctuations)
# remove short words
df['Cleaned_posts'] = df['Cleaned_posts'].apply(nfx.remove_puncts)
df['Cleaned_posts'] = df['Cleaned_posts'].apply(nfx.remove_stopwords)
df['Cleaned_posts'] = df['Cleaned_posts'].apply(nfx.remove_shortwords)
df.drop(columns='Update Post', axis=1, inplace=True)
return df
clean_column(df['Update Post'])
# move a the cleaned posts column before impression column
def move_column(df, col_name, before_col_name):
col_index = df.columns.get_loc(col_name)
before_col_index = df.columns.get_loc(before_col_name)
col_index_before = col_index - 1
df.insert(before_col_index, col_name, df.pop(col_name))
return df
move_column(df, 'Cleaned_posts', 'Impressions')
Dataset cleaned:
2. Descriptive analysis: likes, shares, comments and engagement count, correlation patterns (if any)
Checking correlations between numerical variables to identify stronger relationships to identify the key metrics to select for the engagement scoring model
corr = df.corr()
sns.heatmap(corr, annot=True)
Correlation Heatmap
Boxplot of numerics features
# boxplots of key numerical metrics
df.boxplot(['Likes','Clicks','Shares', 'Comments', 'Click through rate (CTR)'], figsize=(15, 10))
Sort the engagement rate by descending order to grasp which kind of posts are getting highest engagement rate
# sort by Engagement rate descending to start to graps which kind of posts topics are getting highest engagement rate
df.head(5).sort_values(by='Engagement rate', ascending=False)
Observations of the heatmap and boxplots:
- ctr and clicks high correlation so we can drop ctr
- shares and likes show a correlation (not high)
- clicks also shows some correlation with likes
- cliks, likes and shares should be our engagement rate to take into considerations
- negative correlation between impressions and engagement rate
- looking at the top 5 of cliks, likes, shares, etc, it seems that posts referring to sector topics, insights and research drive the most of the engagement
next steps of featuring engineering:
- Keep only shares, likes, clicks, impressions and drop ctr and engagement rate to build our own engagement rate ratio as a base to create the engagement scoring
- We are excluding CTR to make our engagement scoring clean and don’t skew the data.
- We exclude comments because they present too many ‘’zeros’’ values not adding anything to the efficiency of our own engagement rate.
3.Composite Scoring System Implementation
We are going to create our engagement rate metric after cleaning the dataset (excluding, comments, CTR Click Through Rate and LinkedIN engagement rate). The features of our own engagement rate will contain only shares, likes, clicks, impressions, comments, creation date
df_clean.loc[:, 'Own_Eng_rate'] = (df_clean['Clicks'] + df_clean['Likes'] + df_clean['Shares']) / df_clean['Impressions']
Sorting the top 10 posts by our own engagement rate we created
df_clean.head(10).sort_values(by='Own_Eng_rate', ascending=False)
Dataset sorted by highest Own_Eng_rate
Initial Observations
Looking at the top 10 posts by Engagement Rate, we already seeing that posts related to sector insights or updates are at the top.
The topic with the highest eng rate (8.6%) for example is (in italian: ‘Google spinge ripresa turismo..’) in English: “Google pushes the recovery of tourism in Italy
Also ‘’participate with your property in the resumption of short-term rent…” refers to the google article about the recovery of the short term market and invite property owners to participate to the recovery (as reminder to the reader, these posts refer to date range of Jan 2021 to Jan 2022 therefore between the COVID pandemic and its last wave.
Italy for who doesn’t know was one of the countries hard hit by COVID and the rental sector was extremely battered.
Creation of a composite score for the posts to visualize the most engaging ones
To create the composite score, we are going to import sklearn libraries Standard Scaler and MinMaxScaler.
# Import of the normalizer and scalers from sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler
#initialize the scaler and the Min Max scaler
scale = StandardScaler()
min_max = MinMaxScaler()
# we isolate the numerics columns to calculate the scaling and after generate the composite score.
numerics = ['int64', 'int64', 'int64', 'int64', 'float64']
df_clean.select_dtypes(numerics)
# create a copy to attach the composite score to the original data frame
df_clean_feat_2 = df_clean.copy()
# scale the data and we check the first 5 rows
df_clean_feat_2[numerical_columns] = scale.fit_transform(df_clean_feat_2[numerical_columns])
df_clean_feat_2.head()
Creation of composite score adding the columns
# Create a composite score adding the columns
df_clean_feat_2['Composite_Score'] = df_clean_feat_2.sum(axis=1)
df_clean_feat_2.head(5)
# Scale the Composite_Score table with a min max scaler of a scale of 0 to 5
df_clean_feat_2['Composite_Score'] = min_max.fit_transform(df_clean_feat_2[['Composite_Score']]) * 5
# Sort the Composite Score from highest to lowest
df_clean_feat_2.sort_values(by='Composite_Score', ascending=False)
Final dataset with Composite score of a range between 0(low) to 5(high)
Now the next step is to create a posts engagement tier system with Gold, Silver and Bronze tiers using Pandas.qcut
# Now we create a posts engagement tier system with Gold, silver and bronze tiers using Pandas.qcut
df_clean_feat_2['Eng_tiers'] = pd.qcut(df_clean_feat_2['Composite_Score'], 3, labels=['Bronze', 'Silver', 'Gold'])
# After creating the Composite score tiers Gold, Silver, Bronze, we sort them to show Gold on top and we visualize how many posts there are in the different tiers using DEX
df_clean_feat_2.head(25).sort_values(by='Composite_Score', ascending=False)
# Now we add composite score and eng tiers score to the original dataframe
df_clean['Composite_Score'] = df_clean_feat_2['Composite_Score']
df_clean['Eng_tiers'] = df_clean_feat_2['Eng_tiers']
4.Creation of the wordcloud with top 25 posts based on the composite score tiers (Gold, Silver, Bronze)
# To create the wordcloud we select the top 25 posts based on the composite score tiers
df_cl_top25 = df_clean.head(25).sort_values(by='Composite_Score', ascending=False)
Dataset with top 25 posts
Creation of the wordcloud with the wordcloud library
#import wordcloud library
from wordcloud import WordCloud
#function to get the wordcloud
def create_wordcloud_from_column(df, col_name, title, figsize=(20, 20)):
wordcloud = WordCloud(background_color="white", max_words=1000, max_font_size=40, scale=3,
random_state=1).generate(str(df[col_name]))
fig = plt.figure(figsize=figsize)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title(title)
plt.show()
create_wordcloud_from_column(df_cl_gold, 'Cleaned_posts', 'Wordcloud of gold posts tier')
Wordcloud Image
Final Observations:
- Posts that refer to sector trends drive more engagement (words: recovery, tourism, flows, results, reopening, increment, vaccins..)
- Posts with a defined call to action or a strong judgement (read, make a gift) shows higher engagement
- Posts that promote offers and products are less engaging as they are in the ‘’bronze category’’
- To improve the social channel engagement, the recommendation is to focus on posts that provide insights on sector trends and updates.
Based on this analysis and the results of the composite score, we can confirm the importance of using LinkedIn to share/promote sector insights, researches, thought leadership content that provide value to the followers.
Instagram and Facebook are more suitable social networks to promote products and services in specific sectors such as the short rental/real estate business.
Author:
I write about data science, python coding projects and data driven marketing. I also provide data and business mentorship for data novice or data entry level people.
You can follow me on Medium, and Twitter, or visit my website and Github page.