A Guide to Building Your First Machine Learning Model in Python: An Introduction
Building machine learning models is COOL! Your first model should not be complex and fancy. No! It is your first attempt; have fun doing it!
My first model was a linear regression model for predicting housing prices based on apartment size. My second? I predicted housing prices based on location. Then the third — predicted housing prices with size, location, and type of neighborhood. I finally learned to train models around Time Series data sets; these were a bit advanced. Nonetheless, all these were fun activities to do.
While at it, I realized that building a simple machine-learning model generally follows a specific set of steps.
I’ll discuss them below.
CASE IN POINT: In this article, I’ll base my discussion around building a linear regression model to predict house prices (“house_price_usd”) with size (“house_size_m2”) and location (“lat”, “lon”).
Phase #1: Preparing Data
Before anything else, data analysts should have access to relevant data. Usually, this isn’t served to you on a silver plate. It is part of a data analyst’s work to convert RAW DATA into a ‘clean’ dataset that should provide meaningful insights.
Pick your favorite tools — Python, R, MS Excel, SPSS, and so on — and dive deep into these three activities:
Import Data
My favorite approach to cleaning and studying the nature of my data set is in its original format. For example, I’ll often use Microsoft Excel to study data stored in CSV and XLSX files or formats before importing it into Python — this is considered good practice.
Next, get your hands dirty!
#importing CSV files into Python
df = pd.read_csv(filepath)
Explore Data
Data exploration is like art — everyone might have a unique style of exploring data. However, your research questions should ALWAYS be your guiding compass.
First, explore null values (NaN) and how they’ll impact your analysis — drop/fill them if they are insignificant/significant.
#identify columns with more than 50% NaN values
df.isnull().sum() / len(df)
#drop columns with more than 50% NaN values
df.drop(columns=[
"column1",
"column2"
],
inplace=True)
Other issues you should consider before moving to the next stage:
High and Low Cardinality: High cardinality refers to a situation where a dataset or a column contains a large number of unique values. For example, a column with URL links provides little to no information for your model; you should drop it.
Low cardinality? This is present when a column contains a relatively small number of unique values. Examples of low cardinality column values include Boolean values, gender, and other major classifications. In most cases, we won’t need these columns to build a linear regression model.
Leakages in your dataset: Leakages could be experienced when your model has access to “future information” — information that is not expected to be available at prediction time. Think of it as having the answers to an exam before sitting for your exams. Or, your model being aware of some part of test data after training. For example, if house prices are in USD, then any column that reflects house prices in local currencies (forex exchange) should be dropped.
That is, any data related to the target feature should not be used to train the model.
Multicollinearity: This is a situation where features are highly correlated. If two features are highly correlated, drop one. Especially one with fewer entries (rows).
An excellent way to check for multicollinearity is to use a heatmap.
#import necessary library
import seaborn as sns
#create a correlation matrix and convert it to a heatmap
corr_matrix = df.select_dtypes("numbers").corr()
sns.heatmap(corr)
**The list is not exhaustive. Explore other characteristics of your data based on the needs and expectations of your analysis.
Split Data
Determine your “target vector” and “feature matrix.” In our case, the target is house prices, and the features are size and location (latitude and longitude).
#define your target: price in USD
target = "house_price_usd"
y_train = df[target] #use y for dependent variables
#define your features: size, and location of the house
features = ["house_size_m2", "lat", "lon"]
X_train = df[features] #use X for independent variables
Note that the target variable does not have square brackets []. That’s because we need the target variable in one dimension (VECTOR — one row, zero columns) rather than multidimensional (MATRIX — multiple rows and columns). The opposite is true for the “features” variable.
In some cases, you’ll need to split your data into a training dataset and a testing dataset — unless you have two different datasets for testing and training purposes.
Either way, you have a clean and usable dataset at this point. And you are ready to build your linear regression model.
Phase #2: Build Model
This phase involves extracting insights and knowledge from your dataset and making connections between variables. The model will highlight the degree of relation (coefficient) between dependent and independent variables. This phase can be broken into THREE stages:
Baseline
To build a reliable model, you’ll need to set specific standards. This involves building a simple model that provides you with a baseline standard for reference throughout your project.
For a simple linear regression model, the baseline metric is derived from your target’s MEAN and your simple model’s MEAN ABSOLUTE ERROR. Let’s see how this can be done:
#find the mean of your target (house_prices)
y_mean = y_train.mean()
#create you predicrion baseline from the mean value above
y_prediction_baseline = [y_mean] * len(y_train)
#use the two results to calculate the mean absolute error (MAE) baseline
from sklearn.metrics import mean_absolute_error #import MAE from scikit-learn library
MAE_baseline = mean_absolute_error(y_train, y_prediction_baseline)
#print the mean house price and the baseline MAE
print("Mean house price:" round(y_mean, 2))
print("Baseline MAE:" MAE_baseline.round(2))
Iterate
Iterate means using specific methods or tools in cycles or portions to refine and improve your model. That is, fit a model comprising one or more imputers and a predictor from the Scikit-learn library.
A simple model will run on a single predictor as follows:
#instantiate the LinearRegression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
#fit the model to the data, X_train and y_train
model.fit(X_train, y_train)
Interestingly, Scikit-learn allows data analysts to define a set of steps to process data before making any prediction. You can do this by creating pipeline — a pipeline consists of one or multiple transformers (imputers) and ends with a predictor, as shown in the example below.
Examples of imputers or transformers in Scikit-learn include SimpleImputer, MinMaxScaler, Ordinal Encoder, One Hot Encoder, and Power Transformer.
For example, if your large dataset has few incomplete or missing values, Scikit-learn can impute the missing values in appropriate rows or columns. One way to do this is by utilizing Scikit-learn Simple Imputer.
#instantiate the LinearRegression model
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(
SimpleImputer(), #SimpleImputer is a transformer
LinearRegression() #LinearRegression is a predictor
)
#fit the model to the data, X_train and y_train
model.fit(X_train, y_train)
Evaluate
You’ve trained your model. Now, how does it perform against the set standards? You’ll need to compare the MAE_baseline and the MAE_pred_training. See below:
#check code for MAE_baseline above
#perform house prices prediction using your model
y_predictions = model.predict(X_train)
#calculate the MAE_pred_training
MAE = mean_absolute_error(y_train, y_predictions)
#print the results
print("Training MAE:" MAE.round(2))
Evaluation criteria: The MAE_pred_training value should be much less than the MAE_baseline.
Phase #3: Communicate Results
Prediction Function
What does your model say about the variables? Usually, this can be determined by analyzing the coefficients of each variable. Looking at the simple linear regression function, “m” is the coefficient of the variable “x” while “y” is the target and “c” is the intercept.
y = mx + c
The same function can be extended further in the presence of multiple variables as follows:
y = β1 + β2(x2) + β3(x3)
First, lets find the coefficient associated with the “house_size_m2”, “lot”, and “lon” variables.
#coefficient of a model without a transformer
coefficient = round(model.coef_[0],2) #first coefficient
#coefficienct of a model with a transformer/pipeline
coefficient = model.named_steps["linearregression"].coef_.round(2)
Then you’ll need to find the intercept:
#intercept of a model without a transformer
intercept = round(model.intercept_, 2)
#intercept of a model with a transformer
intercept = model.named_steps["linearregression"].intercept_.round(2)
Finally, we have a linear regression model of the form:
#print the linear regression equation determined by your model
print(
"house price USD = {} + ({}* house size) + ({}*lat) + ({}*lon)"
.format(intercept,
coefficient[0],
coefficient[1],
coefficient[2])
)
The size of the coefficients will determine the degree to which a feature affects the target (house prices in USD).
Interactive Dashboard of Chart
Visual representation of data is crucial for communication because it allows complex information to be conveyed quickly, clearly, and effectively. Utilizing visuals in your communication simplifies complex data, enhances clarity, facilitates memorability, and engages the audience.
A good practice when working with linear regression models is to plot them against their respective data sets.
That is, plot a 2D or a 3D scatter plot against your linear model depending on the number of independent variables in your model.