PREDICTIVE WEB ANALYTICS: A CASE STUDY

Suchismita Sahu
Analytics Vidhya
Published in
13 min readSep 7, 2021

Day by day, the amount of data and information on the internet is growing exponentially, new websites, new images are coming up every second. So, how an e-Commerce organisation can make best out of this huge data? Here, Web Analytics comes into play.

In this blog, we are going to learn the below:

  • What is Web Analytics?
  • Metrics used for Web Analytics.
  • What is Predictive Web Analytics
  • Steps to perform Predictive Web Analytics
  • Case Study: Build a Predictive Model using Web Data
  • Conclusion

So, lets start…

WHAT IS WEB ANALYTICS

Web analytics is the collection, reporting, and analysis of website data, to identify the measures based on your organizational and user goals and using the website data to determine the success or failure of those goals and to derive a strategy to achieve the goal and improve the user’s experience.

METRICS USED FOR WEB ANALYTICS:

Some common web metrics that Web Analytics experts track include:

  • Number of visitors a website receives.
  • From where web traffic is coming.
  • Timer spent by the user on each page
  • What links are and are not clicked on
  • How well a website performs in search engine results

For more details, please visit:

https://www.cooladata.com/wiki/display/webanalyticsbi/Web+Analytics+Metrics

WHAT IS PREDICTIVE WEB ANALYTICS

Based on these collected metrics, we can predict certain customer behaviour using predictive modelling, so that we can take corrective actions in order to achieve our target.

So, Predictive Analytics is a set of methodologies that assist us in anticipating customer behaviour. Some of the reason why you should opt for Predictive Analytics strategies are as below:

  • Traditional Web Analytics tools generate tons of click stream data, from where Predictive Analytics helps filter out the noise and go beyond aggregate level metrics.
  • Analytical models helps you understand the complex patterns between the various data points, which can become the basis of your decision-making process.
  • It helps you to prepare a data driven marketing plan, to allocate proper investments.

Few examples of Predictive Web Analytics includes below:

The company uses predictions performed by a predictive model to maximize engagement rates for its content, by sending personalized messages to the target audiences, which helps to increase the ROI for its content marketing efforts.

The same principle can be applied to run tighter email campaigns. You can let machine-learning algorithms personalize subject lines according to demographics, time of the day, and other factors.

STEPS TO PERFORM PREDICTIVE WEB ANALYTICS

Now, in order to perform Predictive Analytics, you will require the following:

Objective: The business problem that we want to solve.

Data: Having the right data required to solve the business problem. If you have a user centric business model, where you can get rich data regarding your customers behaviour.

Methodology: Once you have the data and a clear objective, you can start thinking about the statistical method you will use to build the prediction model. For example: through Cluster Analysis, we can group the users having similar behaviour and can plan a marketing strategy to acquire those customers or through logistic analysis we can predict which customer can buy a plan or not.

Tool: There are a variety of predictive analytics tools available. KNIME, RStudio, Alteryx Platform, MATLAB, IBM SPSS, Python and SAP Analytics Cloud are few names among these. We have to select the right tool which is most suitable for our in-house analytics talent pool and allocated budget, keeping in our mind.

Here, we are going to build a predictive model, using the data set published in UCI Machine Learning.

CASE STUDY

We are going to build a Predictive Model using customer visits data over a website.

Please refer here to get the dataset and details about it.

Dataset Information:

The dataset consists of data points belongs to 12,330 sessions of customer visits to the website. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period.

Attribute Information:

The dataset consists of 10 numerical and 8 categorical attributes.
‘Revenue’: Class level. Possible values: False and True.

“Administrative”, “Administrative Duration”: Represent the Administrative pages visited by the visitor in that session and total time spent in each of this page category.

“Informational”, “Informational Duration”: Represent the Information related pages visited by the visitor in that session and total time spent in each of this page category.

“Product Related” and “Product Related Duration”: Represent the Product Related pages visited by the visitor in that session and total time spent in each of this page category.

“Bounce Rate” refers to the percentage of visitors who enter the site from that page and then leave without triggering any other requests to the analytics server during that session.

“Exit Rate” depicts the percentage of exits on a page.

“Page Value” feature represents the average value for a web page that a user visited before completing an e-commerce transaction.

“Special Day” feature indicates the closeness of the site visiting time to a specific special day.

The dataset also includes some other features such as operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

Objective: To build a predictive model, which shall decide whether the customer will buy or not, means the variable: Revenue shall be the Response Variable and others are the Predictor Variables.

Step 1: Import all the required libraries

PYTHON

import numpy as npimport matplotlib.pyplot as pltimport pandas as pd

Step 2: Upload the required dataset

PYTHON

from google.colab import filesuploaded=files.upload()import iodf2=pd.read_csv(io.BytesIO(uploaded['online_shoppers_intention.csv']))

Step 3: Get the size of the dataset

PYTHON

df2.shape

Step 4: Get first 10 records from the dataset

PYTHON

df2.head(10)

Step 5: Get the descriptive statistics of the dataset

df2.describe()

Step 6: Count of Missing values

mv=df2.isnull().sum()mv

Step 7: Plotting the Percentage of customers have brought Revenue. ‘True’ means customer has bought the product and ‘False’ means customer didnot buy the product.

import seaborn as snssns.set(style="whitegrid")plt.figure(figsize=(8,5))total = float(len(df2))ax = sns.countplot(x="Revenue", data=df2)for p in ax.patches:percentage = '{:.1f}%'.format(100 * p.get_height()/total)x = p.get_x() + p.get_width()y = p.get_height()ax.annotate(percentage, (x, y),ha='center')plt.show()

Step 8: Distribution of VisitorType

df2['VisitorType'].value_counts()sns.set(style="whitegrid")plt.figure(figsize=(8,5))total = float(len(df2))ax = sns.countplot(x="VisitorType", data=df2)for p in ax.patches:percentage = '{:.1f}%'.format(100 * p.get_height()/total)x = p.get_x() + p.get_width()y = p.get_height()ax.annotate(percentage, (x, y),ha='center')plt.show()

Step 9: Percentage distribution of ‘VisitorType’ over the ‘Weekend’

x,y = 'VisitorType', 'Weekend'df1 = df2.groupby(x)[y].value_counts(normalize=True)df1 = df1.mul(100)df1 = df1.rename('percent').reset_index()g = sns.catplot(x=x,y='percent',hue=y,kind='bar',data=df1)g.ax.set_ylim(0,100)for p in g.ax.patches:txt = str(p.get_height().round(2)) + '%'txt_x = p.get_x()txt_y = p.get_height()g.ax.text(txt_x,txt_y,txt)

Step 10: Distribution of Revenue (Buy or Not) for different Traffic Types

x='TrafficType'y= 'Revenue'df1 = df2.groupby(x)[y].value_counts(normalize=True)df1 = df1.mul(100)df1 = df1.rename('percent').reset_index()g = sns.catplot(x=x,y='percent',hue=y,kind='bar',data=df1)g.ax.set_ylim(0,100)for p in g.ax.patches:txt = str(p.get_height().round(2)) + '%'txt_x = p.get_x()txt_y = p.get_height()g.ax.text(txt_x,txt_y,txt)

Step 11: Distribution of Customers based on Different Traffic Type Codes

plt.hist(df2['TrafficType'])plt.title('Distribution of diff Traffic',fontsize = 30)plt.xlabel('TrafficType Codes', fontsize = 15)plt.ylabel('Count', fontsize = 15)

Step 12: Distribution of Customers based on Region Codes

plt.hist(df2['Region'])plt.title('Distribution of Customers',fontsize = 30)plt.xlabel('Region Codes', fontsize = 15)plt.ylabel('Count', fontsize = 15)

Step 13: Distribution of Customers over OperatingSystems

plt.hist(df2['OperatingSystems'])plt.title('Distribution of Customers',fontsize = 30)plt.xlabel('OperatingSystems', fontsize = 15)plt.ylabel('Count', fontsize = 15)

Step 14: Distribution of Customers over Months

plt.hist(df2['Month'])plt.title('Distribution of Customers',fontsize = 30)plt.xlabel('Month', fontsize = 15)plt.ylabel('Count', fontsize = 15)

Step 15: Distribution of Pagevalues over Revenue. seaborn.stripplot draws a scatterplot where one variable is categorical.

sns.stripplot(df2['Revenue'], df2['PageValues'])

Step 16: Distribution of Revenue over BounceRates

sns.stripplot(df2['Revenue'], df2['BounceRates'])

Step 17: Distribution of TrafficType over Revenue

df = pd.crosstab(df2['TrafficType'], df2['Revenue'])df.div(df.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True)plt.title('Traffic Type vs Revenue', fontsize = 30)plt.show()

Step 18: Distribution of Region over Revenue

ax4=sns.countplot(df2['Region'],hue='Revenue', data=df2)with_hue(ax3,df2.Region,2,2)

Step 19: Linear Regression plot between Administrative and Informational

sns.lmplot(x = 'Administrative', y = 'Informational', data = df2, x_jitter = 0.05)

Step 20: Multi-variate analysis.

Month vs Pagevalues wrt Revenue

sns.boxplot(x = df2['Month'], y = df2['PageValues'], hue = df2['Revenue'], palette = 'inferno')

plt.title('Mon. vs PageValues w.r.t. Rev.', fontsize = 30)

Step 22: month vs bouncerates wrt revenue

# month vs bouncerates wrt revenuesns.boxplot(x = df2['Month'], y = df2['BounceRates'], hue = df2['Revenue'], palette = 'Oranges')plt.title('Mon. vs BounceRates w.r.t. Rev.', fontsize = 30)

Step 23: visitor type vs exit rates w.r.t revenue

# visitor type vs BounceRates w.r.t revenuesns.boxplot(x = df2['VisitorType'], y = df2['BounceRates'], hue = df2['Revenue'], palette = 'Purples')plt.title('Visitors vs BounceRates w.r.t. Rev.', fontsize = 30)
df2.fillna(0, inplace = True)# checking the no. of null values in data after imputing the missing valuedf2.isnull().sum().sum()

Step 24: The goal of cluster analysis in marketing is to accurately segment customers in order to achieve more effective customer marketing via personalization. A common cluster analysis method is a mathematical algorithm known as k-means cluster analysis, sometimes referred to as scientific segmentation.

Cluster of customers Administrative Duration vs Bounce Rate. We have considered columns 1 as Administrative Duration and column 6 as Bounce Rate. Total we have built 11 clusters.

WCSS: One measurement is Within Cluster Sum of Squares (WCSS), which measures the squared average distance of all the points within a cluster to the cluster centroid. To calculate WCSS, you first find the Euclidean distance (see figure below) between a given point and the centroid to which it is assigned.

Here, Elbow method is a graph between WCSS and No.of Clusters.

# preparing the datasetx = df2.iloc[:, [1, 6]].values# checking the shape of the datasetx.shapefrom sklearn.cluster import KMeanswcss = []for i in range(1, 11):km = KMeans(n_clusters = i,init = 'k-means++',max_iter = 300,n_init = 10,random_state = 0,algorithm = 'elkan',tol = 0.001)km.fit(x)labels = km.labelswcss.append(km.inertia_)plt.rcParams['figure.figsize'] = (15, 7)plt.plot(range(1, 11), wcss)plt.grid()plt.tight_layout()plt.title('The Elbow Method', fontsize = 20)plt.xlabel('No. of Clusters')plt.ylabel('wcss')plt.show()

Step 25: The maximum bend is at third index, that is the number of Optimal no. of Clusters for Adminstrative Duration and Revenue is Three. plotting the clusters

km = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0y_means = km.fit_predict(x)plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'red', label = 'Un-interested Customers')plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow', label = 'General Customers')plt.scatter(x[y_means == 2, 0], x[y_means == 2, 1], s = 100, c = 'green', label = 'Target Customers')plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')plt.title('Administrative Duration vs Duration', fontsize = 20)plt.grid()plt.xlabel('Administrative Duration')plt.ylabel('Bounce Rates')plt.legend()plt.show()

Step 26: We have considered columns 3 as Informational Duration and column 6 as Bounce Rate.

# informational duration vs Bounce Ratesx = df2.iloc[:, [3, 6]].valueswcss = []for i in range(1, 11):km = KMeans(n_clusters = i,init = 'k-means++',max_iter = 300,n_init = 10,random_state = 0,algorithm = 'elkan',tol = 0.001)km.fit(x)labels = km.labels_wcss.append(km.inertia_)plt.rcParams['figure.figsize'] = (15, 7)plt.plot(range(1, 11), wcss)plt.grid()plt.tight_layout()plt.title('The Elbow Method', fontsize = 20)plt.xlabel('No. of Clusters')plt.ylabel('wcss')plt.show()

Step 27: Here, we have 2 clusters

km = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)y_means = km.fit_predict(x)plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'red', label = 'Un-interested Customers')plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow', label = 'Target Customers')plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')plt.title('Informational Duration vs Bounce Rates', fontsize = 20)plt.grid()plt.xlabel('Informational Duration')plt.ylabel('Bounce Rates')plt.legend()plt.show()

Step 28: From where customer comes: Region vs Traffic Type

# Region vs Traffic Typex = df2.iloc[:, [13, 14]].valueswcss = []for i in range(1, 11):km = KMeans(n_clusters = i,init = 'k-means++',max_iter = 300,n_init = 10,random_state = 0,algorithm = 'elkan',tol = 0.001)km.fit(x)labels = km.labels_wcss.append(km.inertia_)plt.rcParams['figure.figsize'] = (15, 7)plt.plot(range(1, 11), wcss)plt.grid()plt.tight_layout()plt.title('The Elbow Method', fontsize = 20)plt.xlabel('No. of Clusters')plt.ylabel('wcss')plt.show()
km = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)y_means = km.fit_predict(x)plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'red', label = 'Un-interested Customers')plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow', label = 'Target Customers')plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')plt.title('Region vs Traffic Type', fontsize = 20)plt.grid()plt.xlabel('Region')plt.ylabel('Traffic')plt.legend()plt.show()

Step 29: Data Preprocessing to build Random Forest classifier and Logistic Regression. Here, we want to predict whether the customer will buy or not. So, we have used binary classifier.

# one hot encodingdata1 = pd.get_dummies(df2)data1.columnsfrom sklearn.preprocessing import LabelEncoderle = LabelEncoder()df2['Revenue'] = le.fit_transform(df2['Revenue'])df2['Revenue'].value_counts()# getting dependent and independent variablesx=data1# removing the target column revenue from x = x.drop(['Revenue'], axis = 1)y = data1['Revenue']# checking the shapesprint("Shape of x:", x.shape)print("Shape of y:", y.shape)

Step 30: splitting the data between train and test sets

# splitting the datafrom sklearn.model_selection import train_test_splix_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)# checking the shapeprint("Shape of x_train :", x_train.shape)print("Shape of y_train :", y_train.shape)print("Shape of x_test :", x_test.shape)print("Shape of y_test :", y_test.shape)

Step 31: RandomForest classifier model Building

# MODELLINGfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import classification_reportmodel = RandomForestClassifier()model.fit(x_train, y_train)y_pred = model.predict(x_test)# evaluating the modelprint("Training Accuracy :", model.score(x_train, y_train))print("Testing Accuracy :", model.score(x_test, y_test))

Step 32: Confusion Matrix. Model accuracy is 89%.

# confusion matrixcm = confusion_matrix(y_test, y_pred)plt.rcParams['figure.figsize'] = (6, 6)sns.heatmap(cm ,annot = True)# classification reportcr = classification_report(y_test, y_pred)
print(cr)
cm = confusion_matrix(y, model.predict(x))fig, ax = plt.subplots(figsize=(8, 8))ax.imshow(cm)ax.grid(False)ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))ax.set_ylim(1.5, -0.5)for i in range(2):for j in range(2):ax.text(j, i, cm[i, j], ha='center', va='center', color='red')plt.show()

Step 33: Plotting the ROC curve for Random Forest

from sklearn.metrics import plot_roc_curverf_disp = plot_roc_curve(model, x_test, y_test)plt.show()

Step 34: Saving the predictions of of Random Forest model into a dataframe, which can later be written in a .csv file, so that we can know from which customer we will get the revenue.

df=pd.DataFrame(y_pred,columns=["Revenue"])df

Step 35: Building Logistic Regression model

from sklearn.linear_model import LogisticRegressionmodel1 = LogisticRegression(solver='liblinear', random_state=0)model1.fit(x_train, y_trainy_pred1 = model1.predict(x_test)

Step 36: Printing Confusion Matrix

from sklearn.metrics import confusion_matrixcm = confusion_matrix(y_test, y_pred)print ("Confusion Matrix : \n", cm)

Step 37: Plotting Confusion Matrix

cm = confusion_matrix(y, model.predict(x))fig, ax = plt.subplots(figsize=(8, 8))ax.imshow(cm)ax.grid(False)ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))ax.set_ylim(1.5, -0.5)for i in range(2):for j in range(2):ax.text(j, i, cm[i, j], ha='center', va='center', color='red')plt.show()

Step 38: Printing the Classification Report Accuracy of Logistic Regression is 87%

# classification reportcr1 = classification_report(y_test, y_pred1)print(cr1)

Step 39: Plotting ROC curve for Logistic Regression

from sklearn.metrics import plot_roc_curvelr_disp = plot_roc_curve(model1, x_test, y_test)plt.show()

Step 40: Saving the predictions of of Logistic Regression model into a dataframe

df1 = pd.DataFrame(y_pred1, columns=["Revenue"])df1

Step 41: Plotting ROC curve for both Random Forest and Logistic Regression

ax = plt.gca()rf_disp = plot_roc_curve(model, x_test, y_test, ax=ax, alpha=0.8lr_disp.plot(ax=ax, alpha=0.8)plt.show()

CONCLUSION

In this blog, we learnt, about Predictive Web Analytics, various metrics used for this , took a case study, performed Data Visualizations, made clusters based on customer behaviors, built two predictive models: Random Forest classifier and Logistic classifier, compared performance of both the models using Confusion Matrix and ROC curve and also wrote the predictions from both the models into respective data-frames, so that the business decision makers can know the exact customers who will generate the revenue and who will not, by writing those prediction outputs into csv files.

Hope, you enjoyed this article.

So, what’s for you…. Please come up with the performance tuning of both of the models and let me know the metrics in the comment’s box…

See you in our next blog…till then, Happy Learning…Stay tuned!

--

--

Suchismita Sahu
Analytics Vidhya

Working as a Technical Product Manager at Jumio corporation, India. Passionate about Technology, Business and System Design.