Prediction of the Output Power of a Combined Cycle Power Plant using Machine Learning

Timothy Osirike

Published in

Analytics Vidhya

9 min readJul 6, 2020

A Simple Introduction to Multiple Regression

Background

Single-cycle gas turbine power plants generate electricity by using natural gas and compressed air. Air is drawn from the surroundings, compressed, and fed into the combustion chamber of the gas turbine. Here, natural gas is injected which mixes with the compressed air and ignited. The combustion produces a high-pressure, hot gas stream that flows through the turbine causing it to spin (at tremendous speeds). Consequently, this spins a generator which is connected to the turbine to produce electricity.

For single-cycle gas turbines, much of the energy is wasted as hot exhaust achieving an energy conversion efficiency of 35% at best. Combined cycle power plants exploit this inefficiency by capturing the waste heat using a heat recovery steam generator (HRSG), to produce even more power.

Combined cycle power plants are power generation plants that use both gas and steam turbines together to generate electricity. The waste heat generated from the gas turbine is used to produce steam which is fed to a steam turbine to generate even more electricity. This increases the power produced (up to 50% more) for the same amount of fuel, as well as increases the plant’s efficiency to about 60%.

The Output power of the Combined Cycle Power Plant (CCPP) is dependent on a few parameters which are atmospheric pressure, exhaust steam pressure, ambient temperature, and relative humidity. Being able to predict the full load electrical power output is important for the efficient and economic operation of the power plant. In this article, we will be using machine learning to develop a predictive model to predict the full load output power of a CCPP.

Combined Cycle Power Plant (Sourced from Planete energies, Total)

Objectives

To develop a predictive model to predict full-load power output.
Evaluate the performance of the model

Tip: Since the goal is to predict the output power based on some parameters, this is a regression problem. Regression aims to establish a relationship between predictors (variables that help us make a prediction) and the target (the value we want to predict).

Dataset

For this article, we will be using the dataset provided by Pinar Tufekci which is available at the UCI Machine Learning repository. The dataset was collected over a six-year period and it is composed of 9568 data points collected when the power plant was set to work with a full load.

https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant?ref=datanews.io

Multiple Regression

Multiple regression is an extension of simple linear regression. It is a modeling method that uses multiple predictors or independent variables to predict a target variable or outcome.

The general form of multiple regression

Our Workflow

Exploratory Data Analysis
Develop the model
Evaluate the model
Select the best model

1. Exploratory Data Analysis

First, we will load the data after importing all the python libraries that we would need.

import pandas as pd, numpy as np, matplotlib.pyplot as plt
data = pd.read_csv(‘CCPP.csv’)
df = pd.DataFrame(data)

Next, we explore the data to get a feel of it.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
AT    9568 non-null float64
V     9568 non-null float64
AP    9568 non-null float64
RH    9568 non-null float64
PE    9568 non-null float64
dtypes: float64(5)
memory usage: 373.9 K

We can see that this dataset consists of 5 numerical variables (float64). There are no missing values (9568 non-null) across all the variables and the data type (dtypes) is a float point number. This is certainly good news, as we have a clean dataset.

Next, we look at the distribution of the dataset

df.describe()

Statistical details showing mean, min, max and std

The dataset consists of 4 hourly average variables or features and the target variable which is the Output Power (PE)

Ambient Temperature (AT) in the range 1.81°C and 37.11°C,
Ambient Pressure (AP) in the range 992.89–1033.30 millibar,
Relative Humidity (RH) in the range of 25.56% to 100.16%
Exhaust Vacuum (V) in the range 25.36–81.56 cm Hg
Net hourly electrical energy output (PE) 420.26–495.76 MW

Now, that we have had a feel of our dataset, we need to determine which features will help us predict the output power. Because the goal of regression is to create a mathematical model from the features to predict the target variable (PE), we need to ensure that we select features that have a strong correlation (high predictive power) with the target. A correlation matrix would be useful in doing this.

A correlation matrix is a structured approach to ranking the importance of predictors or input variables (input variables that have the most impact) on the output. To do this we plot the heatmap of the correlation matrix using Seaborn.

import seaborn as sns
plt.figure(figsize = (7, 5))
sns.heatmap(df.corr(), annot = True)

Tip: Correlation is measured on a scale of -1 to 1. -1 means complete negative correlation and 1 means complete positive correlation. 0 means no correlation at all.

From the correlation matrix, we can see that AT and V have a strong negative correlation with the target variable (PE) as their correlation coefficients are -0.95 and -0.87 respectively. AP and RH have a weak positive correlation with PE with correlation coefficients of 0.52 and 0.39.

We can visualize the bivariate distribution of the dataset (which shows how each feature correlates to each other and the PE)

sns.set(style=”ticks”)
sns.pairplot(df, diag_kind = ‘hist’)

When visualized we can easily see that there is a distinctive pattern (negative correlation) seen on AT and V in relation to PE.

Caution: You will notice that AT and V are highly correlated with each other. This is usually not a good thing as our features should be independent of each other. This problem is called multicollinearity. One way of solving this problem is to select the feature(s) that more strongly correlates with our target variable (PE). In this case that will be AT (-0.95). In some cases, we can choose to live with the problem and use our features like that.

2. Develop the model

What we will do here is to develop several regression models using different machine learning algorithms and different combinations of features. I have decided to use Linear Regression, Decision Tree Regression, and Random Forest Regression algorithms on this dataset.

Feature Selection

A critical part of the success of a machine learning project is coming up with a good set of features or predictors to train on. Feature selection involves selecting the most useful features to train on amongst existing features

First, let’s create 4 different combinations of features and train our model with the 3 regression algorithms

Model 1: We select only AT as the predictor (Since it has the strongest correlation with the target variable (PE))

df_1 = df[‘AT’]

Model 2: We select AT and V as the predictors

df_2 = df[[‘AT’, ‘V’]]

Model 3: We select AT, V and RH as the predictors

df_3 = df[[‘AT’, ‘V’, ‘RH’]]

Model 4: We select AT, V, AP, and RH as the predictors

df_4 = df[[‘AT’, ‘V’, ‘AP’, ‘RH’]]
Alternative: df_4 = df.drop([‘PE’],axis =1)

Our target variable (PE) is y

y = df[‘PE’]

Now that we have finished with our feature selection, it is time to train our ML model

Training the model

Before we actually train our machine learning model, we need to first split our data into a training and test set. The training set would be used to create the mathematical model of the relationship between the features and the target variable. The test set would be used to validate the model. To do this, we will use the sci-kit learn library and import the train_test_split module.

We will split the dataset into an 80% training set and a 20% test set (using Pareto Principle).

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_1, y, test_size = 0.2, random_state = 0)

This step will be repeated for all the other models’ datasets by replacing df_1 with the corresponding features array. For instance, Model 4 with df_4.

Finally, we have gotten to the exciting part where we implement a regression algorithm and develop our predictive model. To do this, we will be using OLS Regression, Decision Tree Regression, and Random Forest Regression. First, we import LinearRegression from the sci-kit learn library and train the model.

For Linear Regression

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Now that we have trained our model on our training set, we are ready to make predictions on our test set (our model has never seen this set before)

y_pred = regressor.predict(X_test)

The code implementation for Decision Tree Regression and Random Forest Regression can be seen below:

For Decision Tree Regression

from sklearn.tree import DecisionTreeRegressor
dt_regressor= DecisionTreeRegressor()
dt_regressor.fit(X_train, y_train)

For Predictions

y_pred = dt_regressor.predict(X_test)

For Random Forest Regression

from sklearn.ensemble import RandomForestRegressor
rf_regressor= RandomForestRegressor()
rf_regressor.fit(X_train, y_train)

For Predictions

y_pred = rf_regressor.predict(X_test)

A random forest is an ensemble algorithm that fits many decision trees on various sub-samples of the dataset. Predictions are made by averaging the predictions of each decision tree. This improves the predictive accuracy and controls over-fitting.

Brilliant!, we have now created a predictive model and made predictions with it. It is time to evaluate how well our model has performed

3. Performance evaluation

Whenever a machine learning model is developed, it is important that its performance is evaluated to ensure it is yielding useful outputs and not overfitting. For Regression problems, there are 3 key performance metrics that are used to assess how well your model is performing. There are

Root Mean Squared Error (RMSE): measures the average error performed by the model in predicting the outcome for an observation.

Tip: The lower the RMSE score the better

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

2. R-Squared: It means how much of the variation in the target variable that can be explained by the set of features used in training the model.

Tip: The higher the R-squared score the better

from sklearn.metrics import r2_score
r_squared = r2_score(y_test, y_pred)
r_squared

3. Mean Absolute Error: measures how far predicted values are away from the actual values.

Tip: The lower the MAE value the better

from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
mae

Table showing Evaluation metrics for the Learning algorithms

Now that we have evaluated all our models, we can see that the Random Forest Regression Algorithm with Model 4 (all features) gave us the best performance. The R-squared is 0.9644 (which means that 96.44% of the variation in the target variable PE can be explained by the model).

Similarly, it yields the lowest RMSE of 3.1891.

We can conclude that from our results that Model 4 with Random Forest Regression should be selected

Endnotes

At the beginning of this article, we set out to develop a predictive model for full-load output power (PE) based on the dataset provided. We explored the dataset to find out if we had missing values or other problems, then played around with 4 features subset selections on 3 different machine learning regression algorithms. We were able to discover that using a complete set of parameters or features on the Random Forest Regression algorithm yielded the best results. We obtained an R-squared of 0. 9644 and RMSE of 3.1891.

Hope you found this useful. All the best!