# Evaluation Metrics, Train/Test Split and Advertisement Data

Last week, I was looking at the differences in evaluation metrics and how they can be interpreted. In my data science immersion, we were looking at train test split, and the documentation associated with sci-kit learn. Thanks to a video entitled Data science in Python: pandas, seaborn, scikit-learn which was published on Data School’s YouTube channel, I was able to gain some insight and understanding on how to apply the theory and implement it using code.

The particular data set look at the effects of TV, radio, and newspaper data on sales. Internet and social media advertising was left out, but it was still a great way to learn how these tools can be applied in a marketing setting.

We start by loading the data frame by using pandas:

`#Import CSVimport pandas as pd data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')`
`data.head()`

Next, it’s always a good idea to look at the shape of the data.

`data.shape`
`(200, 5)`

We now can import sea born to see how our data looks graphically.

`#import seaborn import seaborn as sns  %matplotlib inline`
`#Visualizes data using scatter plots, includes confidence interval areas sns.pairplot(data, x_vars = ['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=8, aspect=0.7, kind='reg')`

kind = reg shows confidence interval area which looks great visually

Now, we should define our features, which tells us how many adds came from different mediums.

`#Use feature namesX = data[['TV', 'Radio', 'Newspaper']]X.head()`

We should do the same for the dependent variable i.e. Sales.

`y = data.Salesy.head()`
`0    22.11    10.42     9.33    18.54    12.9Name: Sales, dtype: float64`

Now we need to get insight into the shape and type of the data for X

`print type(X)print X.shape`
`<class 'pandas.core.series.Series'>(200,)`

Now we can begin to execute code regarding the train/test split

`from sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)`
`#default split is 75 percent train, 25 percent test print X_train.shapeprint X_test.shape print y_train.shapeprint y_test.shape`
`#the above yields the below findings (150, 3)(50, 3)(150,)(50,)`

Next, is where scikit learn comes in

`#import model from sklearn.linear_model import LinearRegression`
`#instantiate linreg = LinearRegression()`
`#fit the model to the training data and learn coefficients linreg.fit(X_train, y_train)`
`#See output below `
`LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)`

Now we can derive our overall equation

`#print the intercepts and coefficients print linreg.intercept_print linreg.coef_`
`#see output below 2.87696662232[ 0.04656457  0.17915812  0.00345046]`

y = 2.88 * TV + 0.179 * Radio + 0.00345 * Newspaper

Let’s look at the evaluation metric RMSE. RMSE is favorable over MAE and SME in this instance over MAE and MSE because it is interpretable in y units.

`print np.sqrt(metrics.mean_squared_error(y_test, y_pred))`
`#output below1.40465142303`

Since, Newspaper had only a slight affect on the sales value (y), let’s evaluate the RMSE if we drop it as compared to when it was included.

`#create a python list of feature names feature_cols = ['TV', 'Radio']`
`#use the list to select a subset of the original dataframe X = data[feature_cols]`
`#Select a Series from the dataframey = data.Sales`
`#Split into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)`
`#fit the model to the training data linreg.fit(X_train, y_train)`
`#make predicitions on the testing set y_pred = linreg.predict(X_test)`
`#compute the RSME of our predictionsprint np.sqrt(metrics.mean_squared_error(y_test, y_pred))`
`RSME = 1.38790346994`

Here we can conclude that RSME decreased when we removed Newspaper variable, so our new model is performing better and is better suited to make predictions.

Like what you read? Give Robert Watkins a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.