Unraveling Predictive Patterns for CHP in Conventional Power Plants Through Data-Driven Insight

7 min readNov 15, 2023

A conventional power plant is a facility that generates electricity through traditional methods, typically involving the combustion of fossil fuels like coal, natural gas, or oil. These plants produce electricity by converting heat energy into mechanical energy, which is then transformed into electrical energy. In this context, CHP (Combined Heat and Power) refers to a more efficient approach where the plant not only generates electricity but also captures and utilizes the resulting waste heat for heating or industrial processes. This integrated system enhances overall energy efficiency, making CHP an environmentally friendly and economically viable energy solution.

In the dynamic world of energy production, leveraging data-driven strategies to optimize the performance of conventional power plants is paramount. In this blog post, I am thrilled to share a significant milestone in my journey — a comprehensive exploration into feature engineering and predictive modeling within the context of conventional power plants.

This analysis revolves around a meticulous examination of a dataset sourced from conventional power plants, a reservoir of information with the potential to unveil critical patterns and insights. The primary focus of this endeavor is to predict the presence of Combined Heat and Power (CHP) within these plants, a crucial factor influencing their overall efficiency.

As an integral part of this exploration, I delve into the intricate process of feature engineering, where raw data is refined, transformed, and augmented to extract more meaningful features. The ultimate aim is to enhance the dataset’s predictive capabilities, enabling the development of robust models capable of discerning whether CHP is operational or not.

About the dataset

The master data I used had 909 rows and 38 variables which are:

id
name_bnetza
block_bnetza
name_uba
company
street
postcode
city
state
country
capacity_net_bnetza
capacity_gross_uba
energy_source
technology
chp
chp_capacity_uba
commissioned
commissioned_original
retrofit
shutdown
status
type
lat
lon
eic_code_plant
eic_code_block
efficiency_data
efficiency_source
efficiency_estimate
energy_source_level_1
energy_source_level_2
energy_source_level_3
eeg
network_node
voltage
network_operator
merge_comment
comment

Cleaning, Filtering and Formatting Data

Upon dataset acquisition, I faced challenges — missing values, inconsistencies, and diverse data types. These issues rendered immediate analysis or model development unfeasible. Acknowledging the pivotal role of robust data integrity, a rigorous cleaning and preprocessing phase became imperative. This foundational step ensures reliable analyses and model construction. Addressing intricacies before data-driven endeavors is paramount. Through systematic refinement, the dataset’s potential for meaningful insights and accurate predictions is unlocked, emphasizing high-quality data as indispensable for successful analytics.

The initial step taken was removing columns and creating a column filtered data. The columns where more than majority of the observations were missing and was insignificant for the analysis was removed and this included:

id,block_bnetza, name_uba, street, capacity_gross_uba, chp_capacity_uba, retrofit, shutdown, type, eic_code_plant, eic_code_block, efficiency_data, efficiency_source, energy_source_level_3, network_node, merge_comment and comment

Now there are 21 columns remaining.

The next step was to remove the rows and create a row filtered data. The rows which had a large number of missing values under each of the remaining variables was removed. Now there are 783 observation left which can be used for analysis. The dataset is now free of missing values.

But before analysis, the data must be formatted into similar data type in each column. So each column was strictly either a text column or a numeric column. Thus we have the processed data for analysis.

Analysis

I did a predictive analysis on the data using PredictEasy tool (to know more of the tool’s working refer my previous blog posts) so that I can predict if a power plant uses CHP or not. The tool itself provides a real-time interface to change the inputs of a model and view the output. It shows predictions, confidences and explanations for those inputs helping us to decide if the power plant is efficient. So basically we just need to feed the required data and the model will predict if its CHP efficient or not.

The dataset with 783 observations and 21 variables was used as such for analysis without any up-sampling or down-sampling since the data had almost equal number of positive and negative cases under the prediction variable CHP.

Initially, all the variables except CHP was considered as the features or independent X variables and CHP was the prediction variable or the dependent Y variable. I ran a classification model using PredictEasy and these were the results obtained for it.

So we need to keep in mind that our fine-tuned final prediction model should have an accuracy close to or greater than 83% for the model to be a good one and the confusion matrix for this model shows that majority of the classification predictions were done correctly.

However, it was noted that a few of the features did not contribute to the predictive model and those features had a very low feature rank.

These variables such as country, postcode and so on can be removed as a part of feature engineering to find the actual significant features that affect the model. Fine-tuning the data increases the efficiency of the model.

A cautious thing to be noted is that while removing variables do not blindly remove all variables that have low feature rank or does not significantly impact the models output. We have to have a thorough understanding of the variables and the domain and think on it. Sometimes a variable which shows no significance initially might end up being the most significant in the final model.

After multiple iterations removing features at each step and fine-tuning the data, the final prediction model was obtained. The model showed the following result:

The predictive model achieved an Accuracy of 0.83, indicating that it correctly classified 83% of the instances. This suggests that the model is performing well in predicting the target variable.

The Precision of the model is 0.84, which implies that when the model predicts a positive outcome (chp), it is correct 84% of the time. This indicates that the model has a good ability to avoid false positives.

The Recall of the model is 0.83, meaning that it correctly identifies 83% of the positive instances (chp). This suggests that the model has a good ability to capture the positive cases.

The F1 Score of the model is 0.83, which is a balanced measure of precision and recall. It indicates that the model has a good overall performance in predicting the target variable.

These results were supported by the ROC plot and Confusion Matrix of the model.

We can see that only very few out of the majority of observations tested gave wrong results and the final model had an accuracy very well close to the initial model.

The final model had the variables eeg, energy_source, capacity_net_bnetza and status as the significant variables that have an impact on the model output shown by the XAI plot below.

The ranking of these features are also shown here:

Conclusion

Based on the feature scores, the most important feature for predicting chp is eeg with a score of 0.52. This suggests that EEG data plays a significant role in determining whether chp will occur. The second most important feature is status with a score of 0.20. This feature seems to have a moderate impact on the prediction of chp. The feature capacity_net_bnetza has a score of 0.18, indicating that it also contributes to the prediction of chp, although to a lesser extent. Lastly, the feature energy_source has the lowest score of 0.10, suggesting that it has the least influence on the prediction of chp.

For a more example oriented understanding of each variables effect, we can use the real-time interface provided by PredictEasy that gives predictions, confidences and explanations for those inputs.

Based on these insights, the Business team can take the following actions:

Focus on EEG Data: Given its high score, the “eeg” feature should be given special attention. Further analysis and exploration of this feature may provide valuable insights for improving the predictive model.
Consider Status Information: The “status” feature is also important in predicting the target variable. Understanding the different statuses and their impact on the outcome can help in refining the model.
Evaluate Capacity and Energy Source: The “capacity_net_bnetza” and “energy_source” features, although less influential, should still be considered. Analyzing their relationship with the target variable can provide additional insights and potential improvements to the model.

By focusing on these key features and their impact on the prediction, the Business team can make informed decisions and take actions to enhance the model’s performance.

Unraveling Predictive Patterns for CHP in Conventional Power Plants Through Data-Driven Insight

About the dataset

Cleaning, Filtering and Formatting Data

Analysis

Conclusion

Written by Elsa Saji