# Predicting Starting Pitcher Salaries

In baseball, you don’t know nothing.

-Yogi Berra

**THE PROBLEM**

Today’s post focuses on applying linear regression techniques to a less-than-ideal dataset. In order to do so, I need a scenario from which to work. As of this writing, the MLB free agent signing period (or ‘Hot Stove’ as it is affectionately named) is in full effect. Therefore, I chose the following problem statement as my challenge: **my client, a professional baseball team, is interested in offering a contract to a free agent starting pitcher and wants a recommendation for the annual salary it should propose**. Now that I have my problem, I can begin working on the answer!

**FINDING DATA**

In order to find a salary recommendation for my client, I must find a dataset with my target (salary) and potential predictors. Fortunately, sabermetrics has become quite the phenomenon in the world of baseball — thanks to pioneers like Bill James and Billy Beane — as it allows teams to take more analytical approaches to determining player value. One great website for accessing standard and advanced statistics common in the field of sabermetrics is www.baseball-reference.com. On the website, I can access a player value table for pitchers that includes a number of features and each player’s salary for 2017.

**SCRAPING & CLEANING**

Now that I have found the data table for my analysis, I need to scrape it from the web and into my jupyter notebook. There are many options for performing this kind of work, but I used Selenium (okay, I tried BeautifulSoup first, but it could not find the table I needed within the HTML code).

driver = webdriver.Chrome(chromedriver)

driver.get(‘https://www.baseball-reference.com/leagues/MLB/2017-value-pitching.shtml')

time.sleep(6)

pvtable = driver.find_element_by_xpath('//div[@id="div_players_value_pitching"]').get_attribute('outerHTML')

print(pvtable)

**Note: **I struggled with using Selenium to scrape the HTML code for this particular table, but eventually stumbled upon the **.get_attribute() **function. When I call it with ‘outerHTML,’ it returns what I need!

The above Python code provides the HTML code for the table in which I am interested. At this point, I can have pandas read the HTML and create a DataFrame, which can then be accessed using the 0 index of my pd.read_html() object as outlined below:

df = pd.read_html(pvtable)

df[0]

df = df[0]

I then clean the data by stripping unwanted characters like $ from the Salary column and removing pitchers that do not have listed salaries. I also convert all of the columns containing numerical information to integer and float dtypes by using **.to_numeric()**.

Once I have a workable DataFrame, I want to save that information in the same directory in which I am using my Jupyter Notebook so that I can access it in a separate notebook for the anlaysis portion of the project. Pandas contains a **.to_pickle() **function that makes this process pretty efficient.

df.to_pickle('mlb_sal_stats_17.pkl')

At this point, I have a data file that will be unaffected by any changes made to the Jupyter Notebook.

**EXPLORATORY DATA ANALYSIS**

Like any set of metrics, pitcher value metrics based on a small number of observations can drastically impact their accuracy. As a check on the pitchers included in my dataset, I decided to check the minimum value for innings pitched and plot annual salary against the number of innings pitched.

df.IP.min()

Output: .10000000

Hmmm, it looks like we have some players that did not even pitch a full inning, as well as a high concentration of pitchers with fewer than 100 innings pitched. The small number of innings pitched by certain players could mean that pitchers were injured, underutilized, **OR** that position players (non-pitchers) are occasionally pitching in games and might skew the data. Remember the data table snapshot from Baseball-Reference earlier in the post? It demonstrates the problem of having highly paid position players who pitched for one inning during the year!

The high concentration likely results from all of the aforementioned issues as well as the inclusion of relief pitchers in the data, which is beyond the scope of this project. Instead of including all pitchers, I chose to examine those that started five or more games. Eliminating those observations that do not truly belong in our target population yields the following, much friendlier scatter plot:

At this point, I need to explore the various features included in my dataset and determine which ones are the best candidates for performing a linear regression.

**WARNING:** Bare in mind that I want to practice linear regression techniques, regardless of whether or not the features and target exhibit a linear relationship. Performing a linear regression assumes that a linear relationship exists among the data, so I do not advocate violating said assumption when performing actual work.

In order to gain insight into which features are more correlated with a starting pitcher’s salary, I create a correlation matrix in pandas and a pair plot in Seaborn. Due to the size of the correlation matrix and unrefined pair plot, I will only display the output for the plot with five features.

df.corr()

sns.pairplot(df[['Age', 'IP', 'GS', 'RA9',

'WAR', 'Salary']]);

Although I plotted five features, some of them demonstrate collinearity with one another. For this reason, I choose to eliminate two of the collinear measures and concentrate on three independent features with the highest correlations: Age, IP, and WAR. Age is just the age of the player taken at a predetermined point in the season. IP is the number of innings pitched by the player over the course of 2017. Finally, WAR is an acronym for ‘Wins Above Replacement,’ which is an important sabermetric statistic designed to represent the number of wins a particular player contributed to his team above what a replacement level player might produce (WAR = 0). The highest correlation for the three features is .559, which is not fantastic. At this point, I would already consider using a different model if it were an actual deliverable.

**REGRESSION — MINIMUM VIABLE PRODUCT**

In order to create a minimum viable product in the event that everything falls apart, I decide to focus on creating a linear model to predict Salary dependent upon Age. To accomplish this, I use the LinearRegression class from scikit-learn.

lr_age = LinearRegression()

X = df.loc[:,'Age'].values.reshape(-1,1)

y = df.loc[:,'Salary'].values.reshape(-1,1)

lr_age.fit(X,y)

rmse = np.sqrt(mean_squared_error(y,lr_age.predict(X)))

print('Coefficient Matrix: %s\n Intercept: %s\n R^2: %s\n RMSE: %s'\

%(lr_age.coef_, lr_age.intercept_, lr_age.score(X,y), rmse))

Plotting the actual observations with the linear model overlaid:

Clearly, there exist a few players with much higher salaries than the majority. There are also a few high leverage points at the age of 40 and above, which further emphasizes the non-linearity of my data. However, these points are real data, so I do not want to discard them. I do, however, want to address the particularly high salaries, so I log transform them. This transformation results in a much tighter concentration of points, and improves my model.

lr_age = LinearRegression()

X = df.loc[:,'Age'].values.reshape(-1,1)

y = np.log(df.loc[:,'Salary']).values.reshape(-1,1)

lr_age.fit(X,y)

rmse = np.sqrt(np.exp(mean_squared_error(y,lr_age.predict(X))))

print('Coefficient Matrix: %s\n Intercept: %s\n R^2: %s\n RMSE: %s'\

%(lr_age.coef_, lr_age.intercept_, lr_age.score(X,y), rmse))

Ideally, the model would create a sine-like wave through the points, which would be representative of increasing salaries for pitchers until their prime, at which point age would take its toll on their abilities and result in lower salaries.

**REGRESSION — MULTIPLE FEATURES**

With a basic linear regression using one feature in hand, I want to explore a model with the three features mentioned earlier: Age, IP, and WAR. To do this, I use K-folds cross-validation to tune my model. This approach involves splitting the data into ‘folds’ of **K** size, where **K **is typically 5 or 10, and iterating through each fold as the validation set, while the remainder form the training set. Here is a nice graphic to aid in understanding:

In order to do apply this technique to my data, I adapt some code from David Ziganto. The first step is defining a few helper functions to calculate and return the training and validation error.

def calc_train_error(X_train, y_train, model):

'''returns in-sample error for already fit model.'''

predictions = model.predict(X_train)

mse = mean_squared_error(y_train, predictions)

rmse = np.sqrt(mse)

return mse

def calc_validation_error(X_test, y_test, model):

'''returns out-of-sample error for already fit model.'''

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)

rmse = np.sqrt(mse)

return mse

def calc_metrics(X_train, y_train, X_test, y_test, model):

'''fits model and returns the RMSE for in-sample error and out-of-sample error'''

model.fit(X_train, y_train)

train_error = calc_train_error(X_train, y_train, model)

validation_error = calc_validation_error(X_test, y_test, model)

return train_error, validation_error

This is more advantageous than using scikit-learn’s **cross_val_score** function because it returns both in-sample (training) and out-of-sample (validation) error, allowing me to determine where my model lies on the bias-variance plot. If you are unfamiliar with the concept, I recommend investing the time to understand it. At a high level, it states that if I lower my training error past a certain point, my model is overfitting the training data and will actually result in poorer performance on the validation data. The preferred region within the chart is at or near the minimum of the validation error curve, because it results in the lowest overall error.

This concept is critical to cross-validation, and is apparent in the final output of the K-Folds approach I am using. One other important requirement for cross-validation is that I must scale my data so that the magnitudes of certain features do not cause those features to outweigh the others. In this case, I standardize the three features using the StandardScaler class from scikit-learn and apply K-Folds.

s = StandardScaler()

K = 10

kf = KFold(n_splits=K, shuffle=True, random_state=42)

alphas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1]

for alpha in alphas:

train_errors = []

validation_errors = []

for train_index, val_index in kf.split(X, y):

# split data

X_train, X_val = X[train_index], X[val_index]

y_train, y_val = y.loc[train_index], y.loc[val_index]

X_train = s.fit_transform(X_train)

X_train = X_train.reshape(164,3)

X_test = s.fit_transform(X_test)

X_test = X_test.reshape(18,3)

# instantiate model

lasso = Lasso(alpha=alpha, fit_intercept=True, random_state=77)

#calculate errors

train_error, val_error = calc_metrics(X_train, y_train, X_val, y_val, lasso)

# append to appropriate list

train_errors.append(train_error)

validation_errors.append(val_error)

# generate report

print('alpha: {:6} | mean(train_error): {:7} | mean(val_error): {}'.

format(alpha,

round(np.mean(np.exp(train_errors)),4),

round(np.mean(np.exp(validation_errors)),4)))

Output:

alpha: 1e-05 | mean(train_error): 2.5996 | mean(val_error): 2.9349

alpha: 0.0001 | mean(train_error): 2.5996 | mean(val_error): 2.9348

alpha: 0.001 | mean(train_error): 2.5996 | mean(val_error): 2.934

alpha: 0.01 | mean(train_error): 2.6002 | mean(val_error): 2.9268

alpha: 0.1 | mean(train_error): 2.6555 | mean(val_error): 2.9145

alpha: 1 | mean(train_error): 6.9193 | mean(val_error): 7.728

This particular cross-validation recommends an alpha for a lasso punishment term to the regression, but the most important results are the trends of the error terms. Training error decreases rapidly while moving from an alpha of 1 to an alpha of .1, but begins to level off. Meanwhile, the validation error forms a convex pattern as it moves across alphas in the same direction. This is the bias-variance tradeoff in action! With this information, I can choose my optimal alpha of .01, because it has the smallest delta between error terms. At this point, I use the K-Folds cross-validation multiple times to determine the best model for the data. It turns out that WAR exhibits too much collinearity with IP and does not actually provide much value. Because I have two features, I can represent the model as a hyperplane on a 3D plot. For interpretability, I undo the log transformation on Salary and get an interesting shape.

Again, it is clear that this model has shortfalls, especially when it comes to predicting salaries for the oldest players. This model predicts incredibly high salaries for those players that pitch into their forties, which is simply not the case.

As it turns out, a polynomial transformation of Age and IP results in an even better model than what I had with those two features alone. Should you wish to explore polynomial feature transformation, you can visit my github.

**RECOMMENDATION**

Despite my model’s shortcomings, I choose to use it to predict a salary for this year’s preeminent free agent starting pitcher: Yu Darvish. By incrementing his age one year to 31 and using his innings pitched from 2017, I can recommend an annual salary to offer Darvish. The grand total offer estimate is **$7.08** **Million**. So how does this compare to Darvish’s 2017 salary and the consensus appraisal seems to be that he will receive an offer with an average annual value of **$26.7 Million**. I will defer to the movie Major League to summarize the model’s performance:

**SUMMARY AND FURTHER INVESTIGATION**

The result of this exercise is a recommended salary for Yu Darvish in 2017 using his statistics for that year. In reality, player salaries are determined in advance, and are effective for a certain number of years as agreed upon by teams and players’ agents. Hence, this model should really incorporate past season data and exclude the present year for the purposes of prediction. To do so, I could create a weighted average of a player’s statistics over a predetermined timeframe as a method of predicting future salary. There likely exist transformations other than log that fit the data better, which could be obtained via additional exploration. I could also use a different algorithm to create a more appropriate model for the data — one that more appropriately handles the non-linearity of the features compared to salary. Finally, I am interested in exploring the relationships between the statistics from Baseball-Reference and the salaries of relief pitchers — one of the eliminated populations from the original dataset.

Ultimately, I gained some valuable practice in creating and evaluating linear models. Even though my model was not ideal for evaluating the relationships in the data, it resulted in a prediction that was interpretable and simple to assess. The process of attempting to provide insights for a Major League Baseball team was an enjoyable undertaking and gave me a deeper appreciation of the work that front offices do to evaluate players and provide competitive bids to free agent players. My initial expectation was for Darvish to sign with a team prior to this post, so I could have another measure by which to measure the prediction. Alas, this year’s free agency market has been exceptionally slow. I will have to wait a while longer for that comparison!