Evaluation Metrics, Train/Test Split and Advertisement Data

Last week, I was looking at the differences in evaluation metrics and how they can be interpreted. In my data science immersion, we were looking at train test split, and the documentation associated with sci-kit learn. Thanks to a video entitled Data science in Python: pandas, seaborn, scikit-learn which was published on Data School’s YouTube channel, I was able to gain some insight and understanding on how to apply the theory and implement it using code.

The particular data set look at the effects of TV, radio, and newspaper data on sales. Internet and social media advertising was left out, but it was still a great way to learn how these tools can be applied in a marketing setting.

We start by loading the data frame by using pandas:

#Import CSV
import pandas as pd
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
data.head()

Next, it’s always a good idea to look at the shape of the data.

data.shape
(200, 5)

We now can import sea born to see how our data looks graphically.

#import seaborn 
import seaborn as sns
%matplotlib inline
#Visualizes data using scatter plots, includes confidence interval areas 
sns.pairplot(data, x_vars = ['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=8, aspect=0.7, kind='reg')

kind = reg shows confidence interval area which looks great visually

Now, we should define our features, which tells us how many adds came from different mediums.

#Use feature names
X = data[['TV', 'Radio', 'Newspaper']]
X.head()

We should do the same for the dependent variable i.e. Sales.

y = data.Sales
y.head()
0    22.1
1 10.4
2 9.3
3 18.5
4 12.9
Name: Sales, dtype: float64

Now we need to get insight into the shape and type of the data for X

print type(X)
print X.shape
<class 'pandas.core.series.Series'>
(200,)

Now we can begin to execute code regarding the train/test split

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#default split is 75 percent train, 25 percent test 
print X_train.shape
print X_test.shape
print y_train.shape
print y_test.shape
#the above yields the below findings 
(150, 3)
(50, 3)
(150,)
(50,)

Next, is where scikit learn comes in

#import model 
from sklearn.linear_model import LinearRegression
#instantiate 
linreg = LinearRegression()
#fit the model to the training data and learn coefficients 
linreg.fit(X_train, y_train)
#See output below 
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now we can derive our overall equation

#print the intercepts and coefficients 
print linreg.intercept_
print linreg.coef_
#see output below 
2.87696662232
[ 0.04656457 0.17915812 0.00345046]

y = 2.88 * TV + 0.179 * Radio + 0.00345 * Newspaper

Let’s look at the evaluation metric RMSE. RMSE is favorable over MAE and SME in this instance over MAE and MSE because it is interpretable in y units.

print np.sqrt(metrics.mean_squared_error(y_test, y_pred))
#output below
1.40465142303

Since, Newspaper had only a slight affect on the sales value (y), let’s evaluate the RMSE if we drop it as compared to when it was included.

#create a python list of feature names 
feature_cols = ['TV', 'Radio']
#use the list to select a subset of the original dataframe 
X = data[feature_cols]
#Select a Series from the dataframe
y = data.Sales
#Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
#fit the model to the training data 
linreg.fit(X_train, y_train)
#make predicitions on the testing set 
y_pred = linreg.predict(X_test)
#compute the RSME of our predictions
print np.sqrt(metrics.mean_squared_error(y_test, y_pred))
RSME = 1.38790346994

Here we can conclude that RSME decreased when we removed Newspaper variable, so our new model is performing better and is better suited to make predictions.

Like what you read? Give Robert Watkins a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.