Predicting Weather Temperature Change Using Machine Learning Models

The Startup
Published in
14 min readAug 28, 2020


A Practical Machine Learning Workflow Example

Problem Introduction

The problem we will tackle is predicting the average global land and ocean temperature using over 100 years of past weather data. We are going to act as if we don’t have access to any weather forecasts. What we do have access to is a century’s worth of historical global temperatures averages including; global maximum temperatures, global minimum temperatures, and global land and ocean temperatures. Having all of this, we know that this is a supervised, regression machine learning problem

It’s supervised because we have both the features and the target that we want to predict, also our target makes this a regression task because it is continuous. During training, we will give multiple regression models both the features and targets and it must learn how to map the data to a prediction. Moreover, this is a regression task because the target value is continuous (as opposed to discrete classes in classification).

That’s pretty much all the background we need, so let’s start!

ML Workflow

Before we jump right into programming, we should outline exactly what we want to do. The following steps are the basis of my machine learning workflow now that we have our problem and model in mind:

  1. State the question and determine the required data (completed)
  2. Acquire the data
  3. Identify and correct missing data points/anomalies
  4. Prepare the data for the machine learning model by cleaning/wrangling
  5. Establish a baseline model
  6. Train the model on the training data
  7. Make predictions on the test data
  8. Compare predictions to the known test set targets and calculate performance metrics
  9. If performance is not satisfactory, adjust the model, acquire more data, or try a different modeling technique
  10. Interpret model and report results visually and numerically

Data Acquisition

First, we need some data. To use a realistic example, I retrieved temperature data from the Berkeley Earth Climate Change: Earth Surface Temperature Dataset found on Being that this dataset was created from one of the most prestigious research universities in the world, we will assume data in the dataset is truthful.

Dataset link:

After importing some important libraries and modules, the code below loads in the CSV data which I store into a variable we can use later:

Following are explanations of each column:

dt: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures

LandAverageTemperature: global average land temperature in celsius

LandAverageTemperatureUncertainty: the 95% confidence interval around the average

LandMaxTemperature: global average maximum land temperature in celsius

LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature

LandMinTemperature: global average minimum land temperature in celsius

LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature

LandAndOceanAverageTemperature: global average land and ocean temperature in celsius

LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature

Identify Anomalies/ Missing Data

Looking through the data (shown above) from Berkeley Earth, I noticed several missing data points, which is a great reminder that data collected in the real-world will never be perfect. Missing data can impact analysis immensely, as can incorrect data or outliers.

To identify anomalies, we can quickly find missing using the info() method on our DataFrame.

Also, we can use the “.isnull()” and “.sum()” methods directly on our dataframe to find the total amount of missing values in each column.

Data Preparation

Unfortunately, we aren’t quite at the point where we can just feed the raw data into a model and have it return an answer (although you could, it would not be the most accurate)! We will need to do some minor modification to put our data into machine-understandable terms.

The exact steps for preparation of the data will depend on the model used and the data gathered, but some amount of data manipulation will be required.

First things first, I will be creating a function called wrangle() in which I will call our dataframe.

We want to make a copy of the dataframe so we do not corrupt the original. After that, we are going to drop columns that hold high cardinality.

High cardinality refers to columns with values that are very uncommon or unique. Given how common high-cardinality data are within most time-series datasets, we are going to address this problem directly by removing these high cardinality columns from our dataset completely as to not confuse our model in the future.

Next in the set of instructions for our function, we are going to create a function within our pending wrangle function, called convertTemp(). Essentially this convertTemp function is just for my own eyes (and maybe yours) and being that I am from the United States, our official measurement for temperature is in Fahrenheit and the dataset I have used is measured in Celsius.

So just for ease purposes, not that it will affect our model results or predictions in any way, I chose to apply that function to the remaining columns which hold Celsius temperature:

Finally, the last step in our data wrangling function would be to convert the dt(Date) column to a DateTime object. After which we will create subsequent columns for the month and year, eventually dropping the dt and Month columns.

Now if you remember we also had missing values which we saw earlier in our dataset. From just analyzing the dataset and from what I described about the Date column, the LandAverageTemperature column starts in 1750 while the other 4 columns we chose to keep in our wrangle function start in 1850.

So I think we will solve much of the missing value problem by just splicing the dataset by the year, creating a new dataset that starts from the year 1850 and above. We will also call the dropna(), just in case there are any other missing values in our dataset:

Let's see how it looks:

After calling our wrangle function to our globalTemp dataframe, we can now see a new cleaned-up version of our globalTemp dataframe free of any missing values

It looks like we are ready for the next step, Setting up our target and features, train/test split, and establishing our baseline…

Quick Correlation Visualization

One thing I like to do when working with regression problems is to look at the cleaned dataframe and to see if we can truly use one column as our target and the others as our features.

One way I loosely determine that is by plotting a correlation matrix, just to get an understanding of how related each column is to each other:

Global Temps Correlation Matrix Plot

As we can see, and some as some of you probably guessed, The columns we chose to keep moving forward are HIGHLY correlated to one another. So we should have pretty strong & positive predictions just from glancing at this plot.

Separating our Target From Our Features

Now, we need to separate the data into the features and targets. The target, also known as Y, is the value we want to predict, in this case, the actual land and ocean average temperature and the features are all the columns (minus our target) the model uses to make a prediction:

Creating Target Vector and Features Matrix

Train-Test Split

Now we are on the final step of the data preparation part of our ML workflow: splitting data into training and testing sets.

During training, we let the model ‘see’ the answers, in this case, the actual temperature, so it can learn how to predict the temperature from the features. As we know, there is a relationship between all the features and the target value, and the model’s job is to learn this relationship during training. Then, when it comes time to evaluate the model, we ask it to make predictions on a testing set where it only has access to the features (not the target)!

Generally, when training a regression model, we randomly split the data into training and testing sets to get a representation of all data points.

For example, if we trained the model on the first nine months of the year and then used the final three months for prediction, our algorithm would not perform well because it has not seen any data from those last three months.

Make sense?

The following code splits the data sets:

Train/Test Split Creation

We can look at the shape of all the data to make sure we did everything correctly. We expect the training(X_train) features number of columns to match the testing (X_val) feature number of columns and the number of rows to match for the respective training and testing features and target:

The shape of each training and test set

It looks as if everything is in order! Just to recap, we:

  1. Got rid of missing values and unneeded columns
  2. Split data into features and target
  3. Split data into training and testing sets

These steps may seem tedious at first, but once you get the basic ML workflow, it will be generally the same for any machine learning problem. It’s all about taking human-readable data and putting it into a form that can be understood by a machine learning model.

Establish Baseline Mean Absolute Error

Before we can make and evaluate predictions, we need to establish a baseline, a sensible measure that we hope to beat with our model. If our model cannot improve upon the baseline, then it will be a failure and we should try a different model or admit that machine learning is not right for our problem.

The baseline prediction for our case will be the yearly average temperature. In other words, our baseline is the error we would get if we simply predicted the average temperature for our target dataset (Y_train)

In order to find out the MAE, very easily, we can import the mean_absolute_error method from the sci-kit learn library which will calculate it for us:

Baseline Mean Absolute Error

We now have our goal! If we can’t beat an average error of 2 degrees, then we need to rethink our approach.

Train Model

After all the work of data preparation, creating and training the model is pretty simple using scikit-learn. For this problem, we could try a multitude of models, but in this situation, we are going to use two different models; a Linear Regression Model and a Random Forest Regressor Model.

Linear Regression Model

Linear regression is a statistical approach that models the relationship between input features and output. Our goal here is to predict the value of the output based on the input features.

In the code below, I created what is called pipeline which allows stacking multiple processes into a single scikit-learn estimator. Here the only processes we are using is a StandardScalar(), which subtracts the mean from each feature and then scaled to the variance of the unit and obviously the LinearRegression() process:

Linear Regression pipeline

Random Forest Regressor Model

A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique commonly known as bagging.

The basic idea behind bagging is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.
Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling from the dataset forming sample datasets for every model:

Random Forest Regressor Pipeline

Little information on whats going on in the code snippet above:

n_estimators represents the number of trees in the random forest.

max depth represents the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data.

n_jobs refers to the number of cores the regressor will use. -1 means it will use all cores available to run the regressor.

SelectKBest just scores the features using an internal function. In this case, I chose to score all the features.

After creating our pipelines and having fit our training data into our pipeline models, we now need to make some predictions.

Make Predictions on the Test Set

Our model has now been trained to learn the relationships between the features and the targets. The next step is figuring out how good the model is! To do this we make predictions on the test features and compare the predictions to the known answers.

When performing regression predictions, we need to make sure to use the absolute error because we expect some of our answers to be low and some to be high. We are interested in how far away our average prediction is from the actual value so we take the absolute value (as we also did when establishing the original baseline earlier in this blog):

Linear Regression MAE

Let look at our Random Forest Regressor MAE:

Our average temperature prediction estimate is off by 0.28 degrees in our Linear Regression MAE and 0.24 for our Random Forest MAE. That is almost a 2-degree average improvement over the baseline of 2.03 degrees.

Although this might not seem significant, it is nearly 95% better than the baseline, which, depending on the field and the problem, could represent millions of dollars to a company.

Determine Performance Metrics

To put our predictions in perspective, we can calculate an accuracy using the mean average percentage error subtracted from 100 %.

Linear Regression Test/Train Accuracy:

Random Forest Regressor Train/Test Accuracy:

By looking at the error metric values we got, we can say that our model performs optimally and is able to give accurate predictions, given a new set of records(y_pred).

Our model has learned how to predict the average temperature for the next year with 99% accuracy in both our models.


Model Tuning

In the usual machine learning workflow, we would stop here after achieving 99% accuracy. But in most cases, as I stated before, the dataset would not be as clean, this would be when to start hyperparameter tuning the model.

Hyperparameter tuning is a complicated phrase that means “adjust the settings to improve performance”. The most common way to do this is to simply make a bunch of models with different settings, evaluate them all on the same validation set, and see which one does best.

An accuracy of 99% is obviously satisfactory for this problem, but it is known that the first model built will almost never be the model that makes it to production. So let us try to reach 100% accuracy if that is possible.


In the beginning, I decided I wanted to use GridSearchCV to hyper tune my model, but GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters.

The most efficient way to find an optimal set of hyperparameters for a machine learning model is to use random search. A solution to this is to use another sci-kit learn method named RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.

Now being that we only have 5 columns in total, there is really no need for us to use RandomizedSearchCV, but for blogging purposes, we will see how to use RandomizedSearchCV to tune your model.

Let’s see if we have any gains in our prediction accuracy score and MAE:

RandomizedSearchCV pipeline

Little information on the code snippet above:

n_iter: represents the number of iterations. Each iteration represents a new model trained on a new draw from your dictionary of hyperparameter distributions.

param_distributions: specify parameters and distributions to sample from

cv: 10-fold cross-validation (cv). The number of cross-validation chosen determines how many times it will train each model on a different subset of data.

n_jobs refers to the number of cores the regressor will use. -1 means it will use all cores available to run the regressor.

best_estimator_: refers to an attribute is an instance of the specified model type, which has the ‘best’ combination of given parameters from the params variable

We then use the best set of hyperparameter values chosen in the RandomizedSearchCV in the actual model which we named best_model as shown:

RandomizedSearchCV MAE

As suspected, after running our using our predict method on our best_model, we can see RandomizedSearchCV output the same prediction results and accuracy score percentage as our Random Forest Regressor model earlier.

Although no need for it, we have seen how hyper tuning could essentially help improve model scores if needed


Partial Dependence Plots

PDPbox is a partial dependence plot toolbox written in Python. The goal of pdpbox is to visualize the impact of certain features towards model prediction for any supervised learning algorithm.

The problem is when using machine learning algorithms like random forest, it is hard to understand the relations between predictors and model outcomes. For example, in terms of random forest, all we get is the feature importance. Although we can know which feature is significantly influencing the outcome based on the importance calculation, we really don’t know in which direction it is influencing.

This is where PDPbox comes into play:

A little background on whats going on in the code above:

feature: the feature column we want to compare against our model to see the effect it has on the model prediction (our target)

isolated: pdp_isolate is what we call to create our PDP pipeline. Being that we are only comparing one feature, hence the name isolated

All other columns should be self-explanatory.

Now let us look at our plot:

From this plot, we can see that as the average LandAndOceanTemperature rises and LandAverageTemperature increases, the predicted temperature tends to increase.

We also created another PDPbox plot in which we used two features (LandMinTemperature and LandMaxTemperature) to see how it affect model prediction with our target column(LandAndOceanTemperature):

From this plot, we can see the same results as well. As the average LandMaxTemperature rises and LandMinTemperature increases, the predicted target, LandAndOcean, temperature tends to increase.


We have now completed an entire end-to-end machine learning example!

At this point, if we want to improve our model, we could try different hyperparameters ( RandomizedSearchCV, or something new like GridSearchCV), try a different algorithm, or the best approach of all, just gather more data! The performance of any model is directly related to how much data it can learn from, and we were using a very limited amount of information for training.

I hope everyone who made it through has seen how accessible machine learning has become and it uses.

Until next time, my friends…



The Startup

Senior Data Scientist | ML Engineer Lifelong Learner