Predict Electrical Energy Output of a Combined Cycle Power Plant Using Machine Learning & AI
Combined Cycle Power Plant
A combined cycle power plant is an assembly of heat engines that work in tandem from the same source of heat, converting it into mechanical energy. The principle is that after completing its cycle in the first engine, the working fluid (the exhaust) is still hot enough that a second subsequent heat engine can extract energy from the heat in the exhaust.
Introduction
A small dataset is provided with the below information:
- Temperature (T) in the range 1.81°C to 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)
PE is the dependent variable here to be predicted. Since it is a continuous numeric variable, this prediction is a regression problem. Hence, in my solution I will use regression models, decision tree and random forest for regression.
Here is my proposed plan to tackle this problem from end to end.
Identifying tools or platforms: Out of a few options, I decided to use Python for this project.
Data exploration: looking for missing values, correlations
Data preprocessing: getting ready for modeling.
Model building (linear regression, lasso, ridge, decision tree, random forest)
Model comparison
Conclusion and Write up.
Data Exploration
The scatter plot shows strong correlation between PE and other independent vars. It also reviews some correlations among independent vars.
Modeling and Comparison
Correlation matrix suggested strong correlation among independent vars as well. For exmaple, AT and V has 84% correlation. AP, RH also correlated with AT respectively. This indicates that we might be facing a multicollinearity issue. To make the model simple and also interpretable, I will skip PCA transformation at this point, but focus on using lasso and ridge to handle the overfitting issue.
Linear Regression
Linear regression, LASSO and RIDGE are performing similarly.
Lasso
Ridge
Tree models
Decision Tree
Decision tree performed better than linear regressions.
Random Forest
Random forest is the best model at this point with the lowest MSE, MAPE, and highest Rsq.
Conclusion
Using the CCPP dataset, I have built 3 regression models and 2 tree models. Out of the 5 models, random forest significantly outperformed the rest. I would choose with random forest as my solution, however, if the interpretability is critical in this problem I would go with Linear regression with a few preprocessing to deal with the multicollinearity problem. If I don't care about easy interpretation of the model, I would try PCA, Neural network as well.
Here are the next steps I will follow up:
Improve linear regressions by transform variable V and AP, which apparently are violating linear assumptions.
PCA dimension reduction. I would use PCA to improve the regression models. However, after PCA the coeffs may not be easily translated back to the original variables.
Neural networks are worth giving a try as well if we are sure not to care about the interpretability of the model.