Tackling Kaggle Tasks: Descriptive Analytics on Solar Panel Sites in India

Published in

Analytics Vidhya

11 min readOct 4, 2020

Hello, brave readers, and welcome to a new series of mine called “Tackling Kaggle Tasks”. In this series I will be exploring the vast ocean of data that Kaggle has to offer and completing various tasks that each dataset owner puts forth pertaining to their submitted dataset. For this edition we will be taking on a dataset called “Solar Power Generation Data” submitted by one Ani Kannal. The dataset includes data records of two solar panel plant sites in India over a 34 day period. There are a total of 4 data files, 2 files for each plant site. One file includes that plant site’s power generation data, the second file includes that plant site’s sensor data such as temperature and irradiation levels. The dataset can be found here. The creator of the dataset asks Kaggle users to complete any of a list of four tasks, including Descriptive Analytics, Visualization and Further Exploration, Competition, and Tell a Story. In this article I will outline the process I took in completing the first of the four tasks, Descriptive Analytics.

Part 1 — Load the data, briefly explore each dataset, and understand and explore underlying patterns in the data

I began by creating separate dataframes for the four files provided to me and then analyzing the basic exploration and statistics of each. To do this I created a basic function that printed out a passed in dataframe’s first five rows of data, column names found in the dataframe, number of rows and columns in the dataframe, and a statistical summary of the dataframe, both numeric and non-numeric columns included.

The function I created to output basic dataframe exploration and statistics

The outputs I recieved from running each dataframe through the above created function, from top left to bottom right: plant_1_generation_df, plant_1_sensor_df, plant_2_generation_df, plant_2_sensor_df

I then made a list of each column within the dataframes, pulling some information from what the author offered as well as doing some of my own minor research.

Column names and their descriptions found in the plant_generation data

Column names and their descriptions found in the plant_sensor data — the link at the bottom is here.

I followed this up by creating pair plots for each of my dataframes, eyeing the results and coming up with a number of base conclusions and hypotheses. First, given the pairplots for both plant_generation sets, I determined that DC power shows a near perfect positive correlation with AC power. This makes sense as the inverter makes the AC power dependent on the amount of DC power — its job is to invert DC to AC. Furthermore, I could see that DC and AC power both contribute to instances of DAILY_YIELD as well as TOTAL_YIELD, which intuitively makes sense as both DAILY_YIELD and TOTAL_YIELD should be direct results of AC/DC power.

The pairplots for the plant_generation dataframes, plant 1 (left) and plant 2 (right)

Second, given the pairplots for both plant_sensor sets I determined that increasing ambient temperature is strongly correlated with increasing module temperature and increasing irradiation levels and that increasing module temperature is very strongly correlated with increasing irradiation levels. This leads us to conclude that increasing energy recieved from the sun has the effect of increasing module temperature as well as the surrounding air (ambient) temperature. This intuitively makes sense because higher levels of energy from the sun cause increasing temperatures.

An interesting thing of note, though it appears to be minor in the visualizations, is the correlation between ambient temperature and module temperature. Each feature appears to increase at a fairly positive rate to each other up to a certain point, then we see the module temperature begin to decrease marginally as ambient temperature continues to increase. As I mentioned, this is just slight and could be nothing, but it could also be indicitive of a possible temperature regulation system in place that keeps the module from rising above a certain level, or perhaps the material used to create the modules (the solar panels). A final thought on this, taking into consideration how heat and energy work, this could possibly also be caused by the modules shifting that energy from the sun down stream to the inverter, and then to further energy storage or management systems. The ambient temperature itself is not in a controlled environment and thus will continue to gain heat as the amount of energy obtained from the sun increases. The module temperature shifts that energy away, in a sense regulating its moment-to-moment temperature.

The pairplots for the plant_sensor dataframes, plant 1 (left) and plant 2 (right)

Next I converted the “DATE_TIME” column from “object” data type to datetime format allowing its usefulness in comparisons and feature engineering.

Code for converting “DATE_TIME” columns in each dataframe to datetime format

Following this and a quick look at the instances of the dataframes by year I decided to feature engineer a few new columns from the “DATE_TIME” column, namely a “month” column, a “day” column, an “hour” column, and a “minute” column. All instances occurred in the year 2020 so I decided that I could ignore the year.

A view of instances in each dataframe by year

My dataframe after some datetime feature engineering

After feature engineering I took a look at the pairplots again, wanting to determine what new trends I could see. A number of new observations became clear given this new viewpoint. First, given the plant_generation dataframe pairplots:

Pairplot for plant_1_generation dataframe with newly engineered features

There appears to be missing data across all features bettween days 1 and around 12 or 13. Looking into this further will yield more accurate numbers. Possibly due to a hold on gathering observations?
Daily yields appear to be slightly higher around the middle of the year. This could be due to a more sunny season.

Pairplot for plant_2_generation dataframe with newly engineered features

The total power yield shows a slow positively increasing trend through each passing month, as we would intuitively expect.
Most of the data appears to have been collected within months 5 and 6 (May and June). Why? Higher solar radiation periods for better data collection?
The DC and AC Power increases quickly as the hour of day increases, up to a certain point, then begins to decline as the hour increases. This is likely due to the sun’s movement throughout the day. More sun mid-day, no sun early-day and late-day.
The Daily yield begins a gradual increase as the hour increases, then begins a significant increase up to a certain point (around the same point that DC and AC Power begins its decline) then flattens out as DC and AC power fall back to lower points. Again this can be explained by the sun’s movement throughout the day. The daily yield will start low, then as the hours of the day brings more sunlight, thus more AC/DC power, daily yield will begin to increase. This increase will begin to slow down as it gets later in the day and the amount of sunlight decreases, thus decreasing DC and AC power levels. Eventually this increase becomes so minimal as to become obsolete and we see a flatline in the daily increase.
Total yield shows a steady increase as we get later in the year as well as later in the day, again this intuitively makes sense. Total yield shoul be continuously inreasing as time passes and more power is collected.
Plant site 2 appears to only have recordings for two months, 5 and 6 (May and june).
In plant site 2 the correlation between daily yield and the hour of the day seems drastically different at first glance; however, we can see that it undergoes the same positive correlation that we saw at plant site 1. The major difference here is that at plant site 2 we appear to start the day off with high levels of daily yield power collection. One possible explanation of this is that it is excess yield carried over from the previous day.

I was also able to ascertain a few observations from the new plant_sensor dataframe pairplots:

Pairplot for plant_1_sensor dataframe with newly engineered features

We can immediately see a glaring difference between the plant_1_generation dataframe and all other dataframes. In the month column, the former has 12 values, but all others have two values (5 and 6) indicating that there appears to only be records of observations for those two months. Incidently they are the same two months that had the most significant instances of observations in the plant_1_generation dataframe.

Pairplot for plant_2_sensor dataframe with newly engineered features

As we would expect, we can see a positive correlation between Ambient Temperature, Module Temperature, and Irradiation as the hour of day increases, up to a certain point, then we see a negative correlation where the aforementioned features begin a decline as the hour of day increases. This can be explained in the same way as our first observation: the sun’s movement through the sky throughout the day.
Perhaps an important note, perhaps not: the days we saw missing in the plant_1 dataframes, around days 2–14, do not appear to be missing in the plant_2 dataframes.

Part 2 — In depth data exploration

With some of the more basic observations finished my next goal is to go into a more in depth exploration of each of the dataframes I created. Here I will be looking into and solving problems such as the mean value of the daily yield, the total amount of irradiation per day, ambient and module temperature explorations, some inverter explorations, and AC/DC power explorations.

I started out by finding the mean value of the daily yield. I created two new dataframes and included only the instances recorded during the last 15 minutes of any given day. Because the “DAILY_YIELD” column is a summation of daily yield throughout the day, this should give me the total daily yield for any given day. I found the following:

The output of calculating the mean daily power yield for each plant (bottom) and its corresponding code (top)

Next I determined the total irradiation of each plant site per day. To do this I seperated the plant_sensor dataframes into data from the month of May and data from the month of June. I then calculated the total number of days by taking the length of the value counts of each of the dataframes separated by month and adding them together based on plant site, giving me a count of 34 days — this information was provided to me by the author of the dataset, however I chose to approach it as if it had not been given to me. I then summed up all of the irradiation recordings from the main plant_generation dataframes because this column is not a summation of irradiation after each passing 15-minute interval, but the amount of irradiation recorded since the previous recording. Using these sums I divided them into the calculated total number of days accounted for in each dataframe coming up with the following irradiation per day calculations:

the output of calculating the total irradiation per day of each plant (bottom) and its corresponding code (top)

To explore the ambient and module temperatures I chose to find the max of each and do a comparison. I approached this in two ways: first I created a series of four graphs comparing ambient temperature and module temperature with the day of the month for each plant dataframe. I split this up further by separating the data by month.

Graphic representation of Ambient and Module temperature compared to the time of day and month.

I then quantified these observations by simply finding the max value of each of the dataframes at the “AMBIENT_TEMPERATURE” then “MODULE_TEMPERATURE” columns.

The max ambient and module temperatures recorded for each plant (bottom) and its corresponding code (top)

Each plant site has specific number of inverters hooked up to rows of modules, or solar panels. I wanted to find how many inverters each plant site contained. For this I found all the unique values in each plant_generation dataframe, then I found the length of that list giving me a total of 22 inverters for each plant site.

Number of inverters for each plant (bottom) and the source code for calculating them (top)

The next few explorations were a bit more time consuming. First I wanted to find the maximum power yield generated in a daily time interval. I used a series of python graphs (dictionaries) and for loops to split the last_daily_recording dataframes I had created earlier by month and create key-value pairs in the form of “day” - “daily_yield_sum” pairs.

Source code for separating last_daily_recording dataframes into dictionaries holding “day” as a key and the data frame’s “DAILY_YIELD” sum as the value

To find the max daily sum by month I took the above results and used a for loop, comparing the current iteration to what was set to the max value, eventually ending with a group of four variables holding the max values for each plant for the months May and June.

The max daily power yield of each plant site by month and its corresponding code.

Finally I used these final results to calculate the absolute max daily power yield for each plant site. I ended this exploration by looking at these results graphically.

Absolute max daily yield for each plant site and its corresponding code.

A graphical representation of max daily yields for each Plant site by month.

Now earlier I had calculated how many inverters each plant contained. I decided that would be useful for exploring each of the inverter’s AC/DC power yields. I started by sorting the dataframes by unique “SOURCE_KEY” values and through a series of python graphs (dictionaries) and for loops found the sum daily yield for each inverter. Using these results I used more for loops to find which inverter yielded the most power and what its corresponding power value was for each plant site.

The source code for seperating the data by inverter ids and their daily yield sums

The inverters with the greatest max yield for each plant along with their yield values (bottom) and the source code for the calculation (top)

Finally, I organized the inverters for each plant in order of max daily yield to minimum daily yield.

Source code for ordering inverters by max daily power yield

Ordered list of best daily yield for each inverter in both plant sites

I finished things up by taking a closer look at why one of our data frames appears to contain data for months other than May and June. I looked at the “day” value counts for that data frame, only including the months that were less than 5 (May) and greater than 6 (June). I then looked at only the data that landed in the month of May and then only the data that landed in the month of June. I was able to ascertain that data was gathered on the 6th day of every month of the year at plan site 1, but only for the generation calculations, and the only months in which data was obtained outside of that 6th day were the months May and June. I determined that this could have been for testing purposes, perhaps a year-long research for that particular plant, or even a monthly required inspection used to determined the plant’s productivity.

Thank you for coming along with me on this very long and arduous journey. I hope you did enjoy at least some of the analysis and I hope that you will continue on with me to the next installation where I will use this same dataset to extend our exploration to several visualizations and graphs. Until next time, brave reader. Keep on reppin’ on and happy coding.

Tackling Kaggle Tasks: Descriptive Analytics on Solar Panel Sites in India

Written by Daniel Benson