Machine Learning made Easy — Linear Regression: Code Concept (Python)

Chinmay s yalameli
4 min readAug 9, 2019

--

We have already learnt the theoretical approach for Linear regression. If you have not read my previous article, please click on this link. Let’s see an example of the same in python.

Before we proceed, we are using Jupyter notebook as a tool to run our algorithms. I request you to check this link if you want to know more about jupyter notebook. please find files in this GitHub repository.

Code Analysis in Python:-

We need to install necessary libraries like numpy, matplotlib and pandas to run the below code. To install particular libraries, use the following code in pip,

pip install “Some package.”

Use -–user in addition to the above command in case of administration rights. For example,

pip install numpy — user

Installing libraries-

Run following commands in your CMD,

  1. pip install numpy
  2. pip install matplotlib
  3. pip install pandas
  4. pip install sklearn

After you run them in your CMD. Start your Jupyter notebook.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv(‘Salary_Data.csv’)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

We have imported libraries in the first place; now we are importing dataset Salary_Data.csv using pandas. I am categorising X and Y variables as Independent and dependent variable, respectively. The dependent variable is in column Y, and independent variables are in column X.

Note- Independent variables (also referred to as Features) are the input for a process that is being analysed. Dependent variables are the output of the process.

After that, we are dividing our variables into train and test cases. Train cases are used to train and teach the machine, while we use test cases to check the accuracy. Assume that we have details of 100 employees and their salary. Usually, the data is split in ratio 80 -20. We use 80 percent of data for training and the remaining 20 percent for testing. Based upon the difference between actual values in test cases and predicted values in test cases, accuracy is measured.

If you type data set in a new row of your notebook, you must be able to see the following output. You can also type X-train, Y_train or any other variable to check their status.

Here comes an important part,

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)

Here I am importing LinearRegression model from sklearn library. It does similar work, as discussed in this article. I am assigning that to a variable called regressor. The command regressor.fit() trains our model. The above lines are the heart of our project. Next, I use y_pred variable to predict outputs for untrained variables. In our case, I am passing details of some unknown employees, and the predicted results are stored in y_pred variable. Now we can compare predicted results in Y_pred variable and actual results in Y_test variable to check accuracy.

Run,

# Visualising the Training set results
plt.scatter(X_train, y_train, color = ‘red’)
plt.plot(X_train, regressor.predict(X_train), color = ‘blue’)
plt.title(‘Salary vs Experience (Training set)’)
plt.xlabel(‘Years of Experience’)
plt.ylabel(‘Salary’)
plt.show()

this code shows a graph like given below it’s your training result,

We can see the best fit line drawn to various salaries using cost function mentioned in the previous article. We use matplotlib to visualise data. Similarly below code uses test data for visualisation.

plt.scatter(X_test, y_test, color = ‘red’)
plt.plot(X_train, regressor.predict(X_train), color = ‘blue’)
plt.title(‘Salary vs Experience (Test set)’)
plt.xlabel(‘Years of Experience’)
plt.ylabel(‘Salary’)
plt.show()

The blue line comes from a trained model while we can see X_test values scattered.

Below is a graph for Y-pred variables; we can see most of them predicted on line.

Now, let’s compare test and predicted salaries, run

df_new = pd.DataFrame({‘Y_predicted’ : y_pred,’Y-test’:y_test})

You can see predicted salaries are in close proximity to actual salaries. This is how linear regression works. You can keep following me for future updates.

Reference-

1.Course Machine Learning A-Z from super data science team on Udemy.

--

--