[Data Analysis] Machine Learning (8/9)

Sam Taylor
11 min readNov 5, 2023

--

Learn how to load data, preprocess it, choose a model, and evaluate its performance. Start your journey into the world of data analysis and machine learning today!

[This guide is part 8 of an 9-article walkthrough.]

Key concepts:
Data analysis · Machine learning · Random Forest · Data analysis process · Data analysis projects · VS Code · Python

Photo by Goran Ivos on Unsplash

Are you an aspiring data analyst looking to jump into the world of machine learning? Look no further! In this blog post, we’ll walk you through the most common steps for getting started with machine learning using the famous Iris Flower dataset. We’ll use Python, Jupyter notebooks and Visual Studio Code, so let’s get started.

Applying a random forest model to the iris dataset

To remind ourselves where in the data analysis process machine learning comes into play, here is a general outline of the data analysis process:

  1. Define Objectives: Clearly understand the goals of your analysis.
  2. Data Acquisition: Obtain the dataset you’ll be working with.
  3. Data Exploration: Explore the dataset to get an initial understanding of its structure and content.
  4. Data Cleaning: Preprocess the data to ensure its quality and consistency.
  5. Data Visualization: Create visualizations to gain insights into the data.
  6. Feature Engineering: Create new features or transform existing ones to enhance the dataset’s predictive power.
  7. Statistical Analysis (if applicable): Conduct statistical tests or analyses to answer specific questions or hypotheses.
  8. ➡️ Machine Learning (if applicable): If your analysis involves predictive modelling, split the data into training and testing sets.
    ◦ Select an appropriate machine learning algorithm.
    ◦ Train and evaluate the model’s performance using metrics like accuracy, precision, recall, or F1-score.
  9. Present solution: Interpret the findings in the context of your objectives. Document your analysis process and create a report or presentation summarising your analysis.

Prerequisites

Step 1: Setting Up Your Environment

Before we start, make sure you have the necessary tools and libraries installed:

  • Visual Studio Code (VS Code): A coding environment.
    Step-by-step guide
  • Python: A coding language and the backbone of our data analysis.
    Step-by-step guide
  • Jupyter Extension for VS Code: For interactive notebooks within VS Code.
    Step-by-step guide
  • Pandas, Matplotlib, Seaborn: Python libraries for data manipulation and visualization.
    Step-by-step guide
Installing a Python package via the command terminal (macOS)

Step 2: Creating a Jupyter Notebook

Launch Visual Studio Code, create a new Jupyter Notebook, connect a kernal, and save the notebook with an appropriate name like: “Iris_Flower_Data_Visualization.ipynb.”
Step-by-step guide

Step 3: Importing Libraries

In your Jupyter Notebook, start by importing the necessary libraries:
Step-by-step guide

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Step 4: Loading the Iris Flower Dataset

We will use a dataset called the Iris dataset. The Iris dataset is a classic dataset in the field of data analysis and machine learning, often used for classification and data exploration.

  • Download the dataset: Download the Iris Flower dataset as a CSV file from a trusted source, for example Kaggle.
    Step-by-step guide
  • Upload the dataset to VS Code: Now, load the CSV dataset into a Pandas DataFrame:
    ◦ Replace ‘your_file_path’ with the actual path to your dataset.
    Step-by-step guide
# Import the iris dataset using pd.read_csv 
# Replace 'your_file_path' with the actual path to your dataset.
df = pd.read_csv('your_file_path/iris.csv')

Step 5: Data exploration

Now, we will have a quick check of the data, to get ourselves familiar with it. To do so, we will use the .head() method.
◦ Click here for a more in-depth guide to data exploration.

#Check the first 5 rows of data
df.head()

Step 6: Data cleaning & preprocessing

Finally, we will clean the data by handling missing values, removing duplicates, and addressing outliers:
Step-by-step guide

# Handle missing values
df.dropna(inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Outlier removal (if needed)
# Example: Removing values greater than 2 standard deviations from the mean
from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 2).all(axis=1)]

Machine Learning

Context

Before we dive in, it’s useful to understand what machine learning is.

Machine learning is a subset of artificial intelligence (AI) that focuses on developing algorithms and models that enable computers to learn from data and make predictions or decisions without being explicitly programmed. It involves creating systems that can automatically improve their performance over time through exposure to new data (Russell & Norvig, 2016).¹

The Two Main Types of Machine Learning

1. Supervised Learning
In supervised learning, the algorithm is trained on a labelled dataset, where each data point is associated with a known outcome. The goal is to learn a mapping from input data to the correct output. This type of learning is used for tasks like classification and regression, where the algorithm predicts categories or numerical values based on input features (Hastie et al., 2009).²

Examples:

  • Linear Regression: Linear regression is a common supervised learning method used for regression tasks. It finds the linear relationship between input features and a continuous target variable.
    ◦ For example, it can be used to predict house prices based on features like square footage and the number of bedrooms.
  • Logistic Regression: Logistic regression is used for binary classification tasks. It models the probability of an instance belonging to a particular class.
    ◦ For instance, it can be applied to predict whether an email is spam (class 1) or not spam (class 0).
  • ➡️ Random Forest: Random Forest is an ensemble learning method used for both classification and regression. It combines multiple decision trees to make more accurate predictions.
    ◦ It can be used to classify images, detect diseases, or predict stock prices.
  • Support Vector Machines (SVM): SVM is a powerful method for classification and regression. It aims to find the optimal hyperplane that best separates different classes in the data.
    ◦ SVMs are used in applications such as image recognition, text classification, and anomaly detection.

2. Unsupervised Learning
Unsupervised learning involves working with unlabelled data, where the algorithm aims to find patterns, structures, or relationships within the data without prior knowledge of the outcomes. Common tasks in unsupervised learning include clustering, dimensionality reduction, and anomaly detection. It is used to discover hidden insights and group similar data points (Bishop, 2006).³

Examples:

  • Means Clustering: K-Means is a popular clustering algorithm used in unsupervised learning. It groups similar data points into clusters based on their features.
    ◦ For example, it can be applied to segment customers into different market segments for targeted marketing.
  • Principal Component Analysis (PCA): PCA is a dimensionality reduction technique. It identifies the most important components in high-dimensional data and projects them onto a lower-dimensional space.
    ◦ PCA is used for feature reduction, visualization, and noise reduction in data.
  • Hierarchical Clustering: Hierarchical clustering groups data into a tree-like structure where similar data points are linked together.
    ◦ It is often used in biology for phylogenetic analysis, in document retrieval for clustering similar documents, and in image analysis for object recognition.
  • Anomaly Detection with Isolation Forest: Isolation Forest is used to detect anomalies or outliers in a dataset. It works by isolating anomalies faster than inliers.
    ◦ It is used in various applications, such as fraud detection, network security, and quality control.

✅ In our dataset, we have the {species} column. This means our data is labelled, so we will use a supervised learning model: the random forest algorithm.

Step 1: Import Necessary Libraries

In addition to Pandas, Numpy and Matplotlib, that we loaded above, let’s import essential libraries for machine learning:

# Import a module used to help us split our data into training data and test data
from sklearn.model_selection import train_test_split

# Import a module used to help standardise our data
from sklearn.preprocessing import StandardScaler

# Import the Random Forest algorithm
from sklearn.ensemble import RandomForestClassifier

# Import modules used to evaluate how accurate our algorithm is
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Step 2: Split Your Data into Training and Testing Sets

A train-test split is a fundamental step in machine learning. It involves dividing a dataset into two subsets:

  • A ‘training’ set: used to train a machine learning model
  • A ‘testing’ set: used to evaluate the model’s performance.

This division allows the model to learn from the training data and then be assessed on unseen data, helping to estimate its ability to generalize to new, unseen examples.

To split your data, you can use the following code:

# Split our data into a testing and training set

# Separate the X and Y variables
X = df.drop('species', axis=1)
y = df['species']
  • X: The feature(s) of our data that we will use to help predict our Y value
  • Y: The variable that we are trying to predict
# Use the train_test_split module on our data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  • This code takes our X and Y values above and splits them into 4 variables
    X_train: Training dataset (features)
    X_test: Test dataset (featres)
    y_train: Training dataset (values to predict)
    y_test: Test dataset (values to predict)
  • test_size = 0.2
    ◦ Here, we save 20% of our original dataset as our ‘test data’
    ◦ Meaning, we use 80% of our original dataset as our training data

A common practice, for beginners, is to use a 70–30 or 80–20 split, where 70% or 80% of the data is used for training, and the remaining 30% or 20% is used for testing.

Step 3: Feature Scaling

Scaling your data is important as it ensures all features have the same influence. We can use StandardScaler to scale our features:
◦ If you have been following our 9-article series, you will already have scaled your Iris dataset and can skip this step.

# Scale our data, so that our features have the same scale.

# Start an instance of the scaler
scaler = StandardScaler()

# Select the data to be scaled
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 4: Choose Your Model

For this example, we’ll use a Random Forest Classifier.

The code to initiate a Random Forest instance is:

# Initiate a random forest instance 
model = RandomForestClassifier()

Step 5: Train Your Model

Next, we will train our model using the training data we created:
◦ Here the model will look through our data and try and find patterns for each species of iris flowers, similar to the patterns we have seen in the previous blog posts in this series.

# Fit our training data to our model
model.fit(X_train, y_train)

Step 6: Evaluate Your Model
So, after all that, the question is: how did the model do?
◦ To answer this question, we need to ask our mode to compare its own predictions (y_pred) with our test data that we created earlier (y_test).

To do this, we can use the following code:

# Ask our model to predict what Y would be, given our test data
y_pred = model.predict(X_test)

# Compare our test data with our predictions, and see how accurate our model is
accuracy = accuracy_score(y_test, y_pred)
accuracy
The accuracy score of the Random Forest model on the Iris dataset: 100%

🎉 Not too shabby at all!
◦ 1.0 means that 100% of our test data (y_test) was correctly identified by our model’s predictions (y_pred)!

Let’s take our evaluation a little further and produce a classification report, which will provide more details about our model’s performance.

# classification report 
print(classification_report(y_test, y_pred))
A classification report of our Random Forest model on our Iris dataset

Below you will find an explanation of each of the terms in the report (precision, recall, f1-score & accuracy).

Accuracy:

  • Accuracy is a commonly used metric that measures the proportion of correctly classified instances out of the total instances in a dataset (Hastie et al., 2009).²
  • It’s calculated using the formula:
    ◦ Accuracy = (Number of Correct Predictions) / (Total Number of Predictions).

Precision:

  • Precision quantifies the ability of a model to make correct positive predictions among all positive predictions (Powers, 2011).⁵
  • It is calculated as:
    ◦ Precision = (Number of True Positives*) / (Number of True Positives + Number of False Positives**).

* True positives represent the instances that the model correctly predicts as positive, and they are indeed positive in reality.
** False positives are instances that the model incorrectly predicts as positive, when in reality, they are negative.

Recall (Sensitivity):

  • Recall, also known as sensitivity or true positive rate, quantifies a model’s ability to correctly identify all positive instances (Powers, 2011).⁵
  • It’s calculated as:
    ◦ Recall = (Number of True Positives*) / (Number of True Positives + Number of False Negatives**).

* True positives represent the instances that the model correctly predicts as positive, and they are indeed positive in reality.
** False negatives are instances that the model incorrectly predicts as negative, when they are actually positive.

F1 Score:

  • The F1 score is a metric that balances precision and recall, providing a single measure of a model’s performance (Van Rijsbergen, 1979).⁴
  • It is calculated as the harmonic mean of precision and recall:
    ◦ F1 Score = 2 * (Precision * Recall) / (Precision + Recall).

Here’s a good matrix to help visualise the parts of each formula (true positive, true negative, false positive & false negative):
Example: If our model predicts Iris-versicolor and the answer should be Iris-versicolor, this would be a true positive.
◦ Example: If our model predicts Iris-versicolor and the answer should be Iris-setosa, this would be a False positive.

Confusion matrix showing the definitions of TP/FP/FN/TN (Qian, Yawei & Zeng, Guang & Pan, Yue & Liu, Yang & Li, Kun., 2021).⁶

We can also vizualise these metrics a little better by plotting our own predictions (y_pred) against our test data (y_test) on a heatmap.
◦ You’ll notice that all of our predictions are ‘True Positives’.

# Create a confusion matric from our model predictions and our test data
cm = confusion_matrix(y_pred, y_test)

# Create the labels for the chart
x_axis_labels = ["Iris-setosa","Iris-versicolor","Iris-virginica"] # labels for x-axis
y_axis_labels = ["Iris-setosa","Iris-versicolor","Iris-virginica"] # labels for y-axis

# Plot the confusion matrix on a heatmap (+ set a color & the X/Y labels)
sns.heatmap(cm, annot=True, cmap='Blues', xticklabels=x_axis_labels, yticklabels=y_axis_labels);
A heatmap visualising the predictions made by our Random Forest model on our Iris dataset

We can see above, that:

  • 10 of our predictions correctly guessed Iris-setosa, where it should have guessed Iris-setosa.
  • 9 of our predictions correctly guessed Iris-versicolor, where it should have guessed Iris-versicolor.
  • 11 of our predictions correctly guessed Iris-virginica, where it should have guessed Iris-verginica.
Applying a random forest model to the iris dataset (end-to-end)

Summary

🎉 Congratulations! You’ve taken your first steps into the world of machine learning with the Iris Flower dataset.

By following these steps, you’ve learned how to import, preprocess data, choose a model, and evaluate its performance.

This is just the beginning of your journey into the exciting field of data analysis and machine learning. Keep practicing and exploring!

Reference(s)

¹ Russell, S. J., & Norvig, P. (2016).Artificial Intelligence: A Modern Approach. Pearson.

² Hastie, T., Tibshirani, R., & Friedman, J. (2009).The Elements of Statistical Learning: Data Mining, Inference, and Prediction’. Springer.

³ Bishop, C. M. (2006).Pattern Recognition and Machine Learning’. Springer.

Van Rijsbergen, C. J. (1979). ‘Information Retrieval.’ Butterworth-Heinemann.

Powers, D. M. (2011). ‘Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation’. Journal of Machine Learning Technologies, 2(1), 37–63.

Qian, Yawei & Zeng, Guang & Pan, Yue & Liu, Yang & Li, Kun. (2021).A Prediction Model for High Risk of Positive RT-PCR Test Results in COVID-19 Patients Discharged From Wuhan Leishenshan Hospital, China’. Frontiers in Public Health. 9. 778539. 10.3389/fpubh.2021.778539. https://www.researchgate.net/figure/Confusion-matrix-The-accuracy-precision-recall-F1-score-and-AUC-mainly-rely-on-the_fig2_355985914. Accessed: 05 November, 2023.

--

--

Sam Taylor

Operations Analyst & Data Enthusiast. Sharing insights to support aspiring data analysts on their journey 🚀. Discover more at: https://samtaylor92.github.io