Linear Regression using Iris Dataset — ‘Hello, World!’ of Machine Learning

Darryl See Wei Shen
Analytics Vidhya
Published in
5 min readMar 10, 2020
Picture of an Iris Setosa ( https://commons.wikimedia.org/wiki/File:Iris_setosa_var._setosa_(2595031014).jpg)

Agenda

  1. A (VERY) basic introduction to the Linear Regression Model.
  2. A basic introduction to the Iris Data.
  3. Codes for predictions using a Linear Regression Model.

Preamble

Regression Models are used to predict continuous data points while Classification Models are used to predict discrete data points.

What do they mean?

  1. Continuous data points are data points that can occupy any value over a continuous range and are always essentially numeric. Between two continuous data points, there may be an infinite number of other data points. For sake simple comparison, I will term this as the ‘grey area’.
  2. Discrete data points are data points that can only take particular values, it can be either numerical or categorical. For discrete data points, there is NO ‘grey area’.

What is a Linear Regression Model?

Linear Regression is a type of Regression Model and a Supervised Learning Algorithm in Machine Learning. It is one of the basic Machine Learning Model every Machine Learning enthusiast should know. Linear Regression is a linear approach to modelling the relationship between a scalar response (y — dependent variables) and one or more explanatory variables (X — independent variables).

I will be exemplifying the use of Linear Regression to predict the sepal length (cm) of a genus of flower called Iris.

Before we begin…

Firstly, you will need to have Python installed and an Integrated Development Environment (IDE) of your choice (this is completely up to you, you may or may not want to use an IDE). I am using Jupyter Notebook as it is by far the best IDE for data visualisation/manipulation and Machine Learning in my opinion.

Secondly, you will also need to install pandas, NumPy, scikit-learn (sklearn), matplotlib and finally seaborn. Run pip/pip3/conda install on your command line to install these packages as such.

pip/pip3/conda install pandas
pip/pip3/conda install numpy
pip/pip3/conda install scikit-learn
pip/pip3/conda install sklearn
pip/pip3/conda install matplotlib
pip/pip3/conda install seaborn

Run the command according to the version of your pip installer and/or if you are using the anaconda package management system.

Finally, to check if you have successfully downloaded the libraries, you can either type

pip/pip3 freeze in the command line

or check the Anaconda Navigator Environments.

The Iris Dataset

There are 3 species in the Iris genus namely Iris Setosa, Iris Versicolor and Iris Virginica and 50 rows of data for each species of Iris flower. The column names represent the feature of the flower that was studied and recorded.

This is how I have prepared the Iris Dataset which I have loaded from sklearn.datasets. Alternatively, you could download the dataset from UCI Machine Learning Repository in the form of a CSV File.

# Import Dataset from sklearn
from sklearn.datasets import load_iris
# Load Iris Data
iris = load_iris()
# Creating pd DataFrames
iris_df = pd.DataFrame(data= iris.data, columns= iris.feature_names)
target_df = pd.DataFrame(data= iris.target, columns= ['species'])
def converter(specie):
if specie == 0:
return 'setosa'
elif specie == 1:
return 'versicolor'
else:
return 'virginica'
target_df['species'] = target_df['species'].apply(converter)# Concatenate the DataFrames
iris_df = pd.concat([iris_df, target_df], axis= 1)

An overview of the dataset:

iris_df.describe()

.describe() generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

iris_df.info()

.info() prints a concise summary of a DataFrame.

sns.pairplot(iris_df, hue= 'species')

Problem Statement: Predict the sepal length (cm) of the iris flowers

Here comes the coding part!

# Converting Objects to Numerical dtype
iris_df.drop('species', axis= 1, inplace= True)
target_df = pd.DataFrame(columns= ['species'], data= iris.target)
iris_df = pd.concat([iris_df, target_df], axis= 1)
# Variables
X= iris_df.drop(labels= 'sepal length (cm)', axis= 1)
y= iris_df['sepal length (cm)']

# Splitting the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 101)

# Instantiating LinearRegression() Model
lr = LinearRegression()

# Training/Fitting the Model
lr.fit(X_train, y_train)

# Making Predictions
lr.predict(X_test)
pred = lr.predict(X_test)

# Evaluating Model's Performance
print('Mean Absolute Error:', mean_absolute_error(y_test, pred))
print('Mean Squared Error:', mean_squared_error(y_test, pred))
print('Mean Root Squared Error:', np.sqrt(mean_squared_error(y_test, pred)))

Results of fitting the model!

Mean Absolute Error: 0.26498350887555133
Mean Squared Error: 0.10652500975036944
Mean Root Squared Error: 0.3263816933444176

Now to test…

iris_df.loc[6]
d = {'sepal length (cm)' : [4.6],
'sepal width (cm)' : [3.4],
'petal length (cm)' : [1.4],
'petal width (cm)' : [0.3],
'species' : 0}
test_df = pd.DataFrame(data= d)
test_df
pred = lr.predict(X_test)print('Predicted Sepal Length (cm):', pred[0])
print('Actual Sepal Length (cm):', 4.6)

As you can see, there is a discrepancy between the predicted value and the actual value, the difference is approximate 0.283 cm (3 S.F.) which is a little bit higher than the mean absolute error.

Conclusion

I hope with this introductory article, you have a basic understanding and view what a Linear Regression Model is (and the codes) and the Iris Data, ‘Hello, World!’ data set for Machine Learning.

Source Code: https://github.com/peanutsee/Basic-Linear-Regression-Using-Iris-Dataset

--

--

Darryl See Wei Shen
Analytics Vidhya

Student | Programming | Data Analysis | Wakeboard | Nerding