The 7 Steps of Python Machine Learning (with Code Examples)

[ ryn0f1sh ]
The Machine Learner
5 min readOct 5, 2023
Photo by Markus Winkler on Unsplash

This post is inspired by “Python Machine Learning Tutorial (Data Science)” by Mosh Hamedani. The video is amazing, he walks you through creating a working model, and explains things in a very clear manner.

This write-up is a way for me to share some of my notes after walking through the video. I would highly recommend checking out Mosh’s video for more details.

The Outline of this Post.
1- The 7 Steps. (high level)
2- The 7 Steps. (with my notes)
3- The code of the model.
4- The code of the model. (broken down into the 7 steps)
Conclusion

The 7 Steps. (high level)

1- Import data.
2- Clean the data.
3- Split the data (Training / Testing).
4- Create a model.
5- Train the model.
6- Make a prediction.
7- Evaluate & Improve your model.

The 7 Steps. (with my notes)

1- Import data.
Most of the time this will be a CSV file.
If you have a specific database, export it as a CSV file.

2- Clean the data.
Depending on the project your working on, this could be:
- Removing duplicates.
- Removing irrelevant data.
- Modifying incomplete or missing data.
- Converting/Labeling “text” to “numeric” values.
etc.

3- Split the data (Training / Testing).
Usually the rule of thumb is 80/20.
This means you split your data as follows:
- 80% — Training Set
- 20% — Testing Set
This is the general rule, but of course every project is different.

4- Create a model.
Selecting an algorithm to analyze the data. (Decision Tree / Naive Base, etc.)
Each algorithm has Pros and Cons.
So you would need to figure out which one is best for your situation.
You choose an algorithm based on:
- The kind of data that you are working this.
- The kind of problem you are trying to solve.

5- Train the model.
Feed it the Training set.
Have it learn the patterns in your data.

6- Make a prediction.
Feed it the Testing set.
This is where you ask it “Is this new data This or That?” to see what it learned from the training set.
Usually in the beginning it might be inaccurate, and that’s ok, its learning.

7- Evaluate & Improve your model.
Evaluate and measure the accuracy of the prediction.
Based on that you can:
- Fine tune the current model to optimize accuracy.
- Or choose a different algorithm all-together and see if that gives you a better result.

The code of the model.

In this exercise, our model predicts the genre of music a user likes based on 2 parameters: Age and Gender.
In the video he provides a link to the CSV file to use.
This is what the model looks like.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

music_data = pd.read_csv('music-file.csv')
X = music_data.drop(columns=['genre'])
y = music_data['genre']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
score

The code of the model. (broken down into the 7 steps)

The libraries we imported, and what they are for.

import pandas as pd
The Pandas library has many methods, and is used often in Data Science. It allows us to import CSV files, and create “Data Frames” which make it easier to work with in Machine Learning environments.

from sklearn.tree import DecisionTreeClassifier
This is the Algorithm that we will use for our model, it will use a Decision Tree method to learn the patterns of our data. This is also where you could choose a different algorithm if you wanted.

from sklearn.model_selection import train_test_split
This is the super helpful function that will help us split our data into Training and Testing sets.

from sklearn.metrics import accuracy_score
This is another super helpful function, that will help calculate how accurate our model’s prediction was, after it has trained and was given new data to analyze.

1: Import Data
I’m using Pandas to read the CSV file, and assigning it to the variable “music_data” .
music_data = pd.read_csv(‘music-file.csv’)

2: Clean Data
Luckily this data is clean, there were no missing or inaccurate inputs.
When working with larger datasets that have even more inputs, this step will most likely be used more.

BUT we will create Input and Output sets. Our file has 3 columns (Age / Gender / Genre). We will split this data into the 2 parameters that we need.

These 2 parameters (Input / Output) are needed to help us create the Training and Testing sets.

The Input Set: We only want the “Age” and “Gender”, not the “Genre”, so we are using the “drop” method, to drop the “Genre” column from this set, and assign it to a new variable (X).

This is our INPUT SET. (Age , Gender)
X = music_data.drop(columns=[‘genre’])

The Output Set: We only need the “Genre” so we just assign the ‘genre’ column to a new variable (y).

This is our OUTPUT SET. (Genre)
y = music_data[‘genre’]

3: Splitting the Data : Training / Testing
Splitting the data using the ‘Train Test Split’ function for an 80/20 split.
80% Training / 20% Testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

4: Create Model
Making an instance for the classifier.
model = DecisionTreeClassifier()

5: Train Model
The ‘fit’ method takes 2 parameters: Input and Output
Give it the Training sets
model.fit(X_train, y_train)

6: Make Prediction
Use the INPUT Testing set for the prediction.
predictions = model.predict(X_test)

Calculate the accuracy.
Using the “Accurcy Score” function, we give it the OUTPUT Testing set, and the Prediction it just made.
score = accuracy_score(y_test, predictions)

Display the score on the screen.
score

7: Evaluate & Improve
Since its a small dataset our model has been giving us great results, so we will keep it as it is.
When working with larger datasets that have even more inputs, this step will most likely be used more.

Conclusion

Mosh’s video really helped me demystify many things about creating a model. Yes this is in its simplest form, but still, having that guideline was super helpful. I would highly recommend checking the video out and walking through it yourself.

Towards the end he teaches you how to create a “Persistent Model”, meaning once you’ve trained a model, you can then just use it to check new data without having to re-train it every time you want to use it. Pretty cool stuff.

Thanks for reading.
Lets code something cool.
Ash, The Machine Learner.

Support The Project.
Buy me a coffee | Become my GitHub Sponsor | Become a Patreon

--

--

[ ryn0f1sh ]
The Machine Learner

Sudanese / Blogger / Pen Tester / Python Charmer / Machine Learner / Lover Of Outer Space / Coffee Drinker (https://www.buymeacoffee.com/execodeable)