A small dataset is provided with the below information:
- Temperature (T) in the range 1.81°C to 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)
PE is the dependent variable here to be predicted. Since it is a continuous numeric variable, this prediction is a regression problem. Hence, in my solution I will use regression models, decision tree and random forest for regression.
Here is my proposed plan to tackle this problem from end to end.
Identifying tools or platforms: Out of a few options, I decided to use Python for this project.
Data exploration: looking for missing values, correlations
Data preprocessing: getting ready for modeling.
Model building (linear regression, lasso, ridge, decision tree, random forest)
Model comparison
Conclusion and Write up.
Correlation matrix suggested strong correlation among independent vars as well. For exmaple, AT and V has 84% correlation. AP, RH also correlated with AT respectively. This indicates that we might be facing a multicollinearity issue. To make the model simple and also interpretable, I will skip PCA transformation at this point, but focus on using lasso and ridge to handle the overfitting issue.
Linear regression, LASSO and RIDGE are performing similarly.
Decision tree performed better than linear regressions.
Random forest is the best model at this point with the lowest MSE, MAPE, and highest Rsq.
Improve linear regressions by transform variable V and AP, which apparently are violating linear assumptions.
PCA dimension reduction. I would use PCA to improve the regression models. However, after PCA the coeffs may not be easily translated back to the original variables.
Neural networks are worth giving a try as well if we are sure not to care about the interpretability of the model.