AIN311 Machine learning in Sustainability project Week 2: EDA

Tuncersivri
AIN311 Fall 2023 Projects
3 min readNov 16, 2023

Welcome to the world of Exploratory Data Analysis (EDA), a crucial phase in the data analysis journey that let us uncover the hidden stories within datasets. In this blog post, experiences and insights gained from conducting EDA on project’s dataset will be shared. The main purpose of EDA is to Understanding and Cleaning the data while revealing patterns that belongs to it. The dataset, as mentioned on last week’s blog, offers a comprehensive view of weather conditions, facilitating analysis of solar energy-related patterns and broader meteorological insights. The data has been analyzed under 3 titles which are: Weather data insights☀️, Time series analysis📈 and Analysis of Discrepancies🔍.

Under Weather data insights, Frequency histograms or GHI, Temperature and Wind Speed had been analyzed beside their Box Plot’s for further analysis on outliers. As a result a significant number of outliers on the upper hand of GHI boxplot had been identified which may show anomalies or exceptionally sunny days. Finally a correlation heatmap analysis of weather features had been analyzed. That analysis provides valuable insights into the relationships between various weather variables and solar irradiance. This is essential for feature selection in predictive modeling, being a guide on which variables may be more influential. 📊💡

Under Time Series Analysis, Time series data of DHI,DNI,Solar zenith angle, Temperature, Wind speed and calculated and given GHI had been plotted and analyzed

The reason both “Calculated GHI” and “Given GHI” is plotted is to check whether the given GHI values are true or not. GHI can be calculated with this formula below:

GHI=DHI+DNIcos⁡(αzenith​)

The close match between the two lines validates the consistency of the data. ✅

The overlapping lines in the graph suggest that the formula for GHI is accurately reflected in the provided data, which is crucial for any analytical or predictive work. 📉

For the last part, Analysis of Discrepancies, values of mean absolute difference, maximum absolute distance and Standard Deviation of Absolute Differences between given and calculated GHI had been calculated and analyzed which gave the conclusion that “While the mean differences are low, the existence of significant outliers and variability indicates areas where the predictive model or measurement approach could potentially be improved.” 🛠️💡

Also Weather conditions’ impact on GHI discrepancies had been explored for Cloud type and Humidity ranges. Boxplots and ANOVA tests are applied.

Regarding the ANOVA test, the results show a very high F-statistic and a p-value of 0.0 for both, which indicates that there are statistically significant differences in the mean absolute differences across the various cloud types and humidity ranges. The p-value is so small that it rounds down to 0.0, which means we can reject both null hypotheses that all cloud types and humidity ranges have the same mean absolute difference.

This finding is quite important because it suggests that cloud type and Humidity ranges is a significant factor affecting the accuracy of GHI calculations. For predictive modeling purposes, this means that including cloud type as a feature could potentially improve the model’s performance. Therefore, they should be considered when creating predictive models for GHI.

In summary, the Exploratory Data Analysis (EDA) of the data revealed crucial insights and notable findings. These results lay the groundwork for informed decision-making, emphasizing the significance of weather variables in developing precise models for solar irradiance predictions. This EDA is a pivotal step towards optimizing machine learning approaches for sustainable energy practices. 🌐🔍💡

--

--