Support Vector Regression in 6 Steps with Python

Samet Girgin
PursuitOfData
Published in
4 min readMay 22, 2019

Support Vector regression is a type of Support vector machine that supports linear and non-linear regression. As it seems in the below graph, the mission is to fit as many instances as possible between the lines while limiting the margin violations. The violation concept in this example represents as ε (epsilon).

SVR requires the training data:{ X, Y} which covers the domain of interest and is accompanied by solutions on that domain. The work of the SVM is to approximate the function we used to generate the training set to reinforce some of the information we’ve already discussed in a classification problem.

How to Build a Support Vector Regression Model:

  1. Collect a training ꞇ = {X,Y}
  2. Choose a kernel and parameter and regularization if needed. (Gaussian Kernel and noise regularization are an instance for both steps)
  3. Form the correlation matrix:

4. Train your machine, exactly or approximately, to get contraction coefficient by using the main part of the algorithm.

5. Use this coefficient to create an estimator.

The goal in linear regression is to minimize the error between the prediction and data. In SVR, the goal is to make sure that the errors do not exceed the threshold.

SVR in 6 Steps with Python:

Let’s jump to the Python practice on this topic. Here is the link you can reach the dataset for this problem.

#1 Importing the librariesimport numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#2 Importing the datasetdataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:,1:2].values.astype(float)
y = dataset.iloc[:,2:3].values.astype(float)
#3 Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
#4 Fitting the Support Vector Regression Model to the dataset
# Create your support vector regressor here
from sklearn.svm import SVR# most important SVR parameter is Kernel type. It can be #linear,polynomial or gaussian SVR. We have a non-linear condition #so we can select polynomial or gaussian but here we select RBF(a #gaussian type) kernel.regressor = SVR(kernel='rbf')
regressor.fit(X,y)
#5 Predicting a new result
y_pred = regressor.predict(6.5)

The prediction output is 130002. This prediction is not good enough.

#6 Visualising the Support Vector Regression resultsplt.scatter(X, y, color = 'magenta')
plt.plot(X, regressor.predict(X), color = 'green')
plt.title('Truth or Bluff (Support Vector Regression Model)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
An interesting SVR line

The current problem causes from the unscaled dataset of our practice. Normally several common class types contain the feature scaling function so that they make feature scaling automatically. However, the SVR class is not a commonly used class type so that we should make feature scaling by our codes.

Before coding feature scaling line, restart your kernel the Python IDE. Then put your code in the 3rd step of the code.

#3 Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
Compare the before and after the feature scaling

Then restart to run the code again. The final SVR graph will be like below. In this model the point on the far upper right side is CEO and that the model takes this value as an outlier.

#5 Predicting a new resulty_pred = sc_y.inverse_transform ((regressor.predict (sc_X.transform(np.array([[6.5]])))))

So basically what this means with the two pairs of brackets here is that it’s an array of only one line in one column. That is one cell containing this 6.5 numerical value. By using transform and inverse_transform method to convert the feature scaled values into the normal values. So that the prediction for y_pred(6,5) will be 170370. So that it seems more accurate. The last line for our code is for a similar except that higher resolution

#6 Visualising the Regression results (for higher resolution and #smoother curve)X_grid = np.arange(min(X), max(X), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = ‘red’)
plt.plot(X_grid, regressor.predict(X_grid), color = ‘blue’)
plt.title(‘Truth or Bluff (Support Vector Regression Model(High Resolution))’)
plt.xlabel(‘Position level’)
plt.ylabel(‘Salary’)
plt.show()

References:

--

--

Samet Girgin
PursuitOfData

Data Analyst, Petroleum & Natural Gas Engineer, PMP®