Predict Electrical Energy Output of a Combined Cycle Power Plant Using Machine Learning & AI

Combined Cycle Power Plant

A combined cycle power plant is an assembly of heat engines that work in tandem from the same source of heat, converting it into mechanical energy. The principle is that after completing its cycle in the first engine, the working fluid (the exhaust) is still hot enough that a second subsequent heat engine can extract energy from the heat in the exhaust.

Introduction

A small dataset is provided with the below information:

- Temperature (T) in the range 1.81°C to 37.11°C,

- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,

- Relative Humidity (RH) in the range 25.56% to 100.16%

- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg

- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)


PE is the dependent variable here to be predicted. Since it is a continuous numeric variable, this prediction is a regression problem. Hence, in my solution I will use regression models, decision tree and random forest for regression. 

Here is my proposed plan to tackle this problem from end to end. 

Data Exploration


The scatter plot shows strong correlation between PE and other independent vars. It also reviews some correlations among independent vars.


Modeling and Comparison

Correlation matrix suggested strong correlation among independent vars as well. For exmaple, AT and V has 84% correlation. AP, RH also correlated with AT respectively. This indicates that we might be facing a multicollinearity issue. To make the model simple and also interpretable, I will skip PCA transformation at this point, but focus on using lasso and ridge to handle the overfitting issue. 

Linear Regression

Linear regression, LASSO and RIDGE are performing similarly. 

Lasso

Ridge

Tree models

Decision Tree

Decision tree performed better than linear regressions. 

Random Forest

Random forest is the best model at this point with the lowest MSE, MAPE, and highest Rsq.

Conclusion


Using the CCPP dataset, I have built 3 regression models and 2 tree models. Out of the 5 models, random forest significantly outperformed the rest. I would choose with random forest as my solution, however, if the interpretability is critical in this problem I would go with Linear regression with a few preprocessing to deal with the multicollinearity problem. If I don't care about easy interpretation of the model, I would try PCA, Neural network as well. 

Here are the next steps I will follow up: