# Evaluation Metrics, Train/Test Split and Advertisement Data

Last week, I was looking at the differences in evaluation metrics and how they can be interpreted. In my data science immersion, we were looking at train test split, and the documentation associated with sci-kit learn. Thanks to a video entitled Data science in Python: pandas, seaborn, scikit-learn which was published on Data School’s YouTube channel, I was able to gain some insight and understanding on how to apply the theory and implement it using code.

The particular data set look at the effects of TV, radio, and newspaper data on sales. Internet and social media advertising was left out, but it was still a great way to learn how these tools can be applied in a marketing setting.

We start by loading the data frame by using pandas:

#Import CSV

import pandas as pd

data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')

data.head()

Next, it’s always a good idea to look at the shape of the data.

data.shape

(200, 5)

We now can import sea born to see how our data looks graphically.

#import seaborn

import seaborn as sns

%matplotlib inline

#Visualizes data using scatter plots, includes confidence interval areas

sns.pairplot(data, x_vars = ['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=8, aspect=0.7, kind='reg')

kind = reg shows confidence interval area which looks great visually

Now, we should define our features, which tells us how many adds came from different mediums.

#Use feature names

X = data[['TV', 'Radio', 'Newspaper']]

X.head()

We should do the same for the dependent variable i.e. Sales.

y = data.Sales

y.head()

0 22.1

1 10.4

2 9.3

3 18.5

4 12.9

Name: Sales, dtype: float64

Now we need to get insight into the shape and type of the data for X

print type(X)

print X.shape

<class 'pandas.core.series.Series'>

(200,)

Now we can begin to execute code regarding the train/test split

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

#default split is 75 percent train, 25 percent test

print X_train.shape

print X_test.shape

print y_train.shape

print y_test.shape

#the above yields the below findings

(150, 3)

(50, 3)

(150,)

(50,)

Next, is where scikit learn comes in

#import model

from sklearn.linear_model import LinearRegression

#instantiate

linreg = LinearRegression()

#fit the model to the training data and learn coefficients

linreg.fit(X_train, y_train)

#See output below

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now we can derive our overall equation

#print the intercepts and coefficients

print linreg.intercept_

print linreg.coef_

#see output below

2.87696662232

[ 0.04656457 0.17915812 0.00345046]

**y = 2.88 * TV + 0.179 * Radio + 0.00345 * Newspaper**

Let’s look at the evaluation metric RMSE. RMSE is favorable over MAE and SME in this instance over MAE and MSE because it is interpretable in y units.

print np.sqrt(metrics.mean_squared_error(y_test, y_pred))

#output below

1.40465142303

Since, Newspaper had only a slight affect on the sales value (y), let’s evaluate the RMSE if we drop it as compared to when it was included.

#create a python list of feature names

feature_cols = ['TV', 'Radio']

#use the list to select a subset of the original dataframe

X = data[feature_cols]

#Select a Series from the dataframe

y = data.Sales

#Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

#fit the model to the training data

linreg.fit(X_train, y_train)

#make predicitions on the testing set

y_pred = linreg.predict(X_test)

#compute the RSME of our predictions

print np.sqrt(metrics.mean_squared_error(y_test, y_pred))

RSME = 1.38790346994

Here we can conclude that RSME decreased when we removed Newspaper variable, so our new model is performing better and is better suited to make predictions.