Solar panel array, what does it cost?

Sergey Mouzykin

Published in

Analytics Vidhya

15 min readOct 14, 2019

Let’s use data to answer that question!

Introduction

Needless to say, the Sun is absolutely the most powerful object in our solar system.

Solar energy is still underutilized and perhaps misunderstood by the general public in the United States despite the technological advancements and a dramatic reduction in hardware prices that have occurred. It can provide us with clean and stable production of energy should we choose to embrace its pursuit.

Perhaps it is the misconceptions about the solar technology as well as perceived high costs which act as a great deterrence for many people to purchase solar panels.

Truly, it is difficult to accept something that we do not understand. The truth about solar panels seems to be hidden in the dark, but once we shine the light on it, the truth will never be unseen.

To enable greater understanding of solar energy’s potential, we can use data to analyze historic installations and predict its cost.

This is a relatively large data set and there is a plethora of information which can be extracted from it. However, this article will focus on answering these particular questions about residential installations:

Which states have the cheapest and most expensive installations; which states have highest/lowest incentives?
How have the prices changed over the years?
Can we predict the cost of a solar array installation? Which factors have the greatest impact on the cost of hardware?

Photo by American Public Power Association on Unsplash

The Data

This data set contains solar array installations across the United States between the years 1998 and January 2018. All code and data can be found on my GitHub.

At this point, the data frame contains about one million rows and 81 columns. However, many of them contain 100% missing values. Here each row represents one installation and its variables which will be discussed shortly.

Above is an overview of the data after removing empty columns knocking down the count from 81 to 29 columns.

Data Wrangling

The most important step before anything is data cleaning. This data was pretty messy, wild, and untamed.

For example, some of the numerical columns (rebate and cost) had contained unwanted characters like dollar signs, commas, and percent signs.

In the state column, some of the states were counted twice because they had contained trailing white spaces.

Categorical column install_type, had contained numerous categories which were actually of the same category, or similar, but had spelling errors.

In general, the following was performed to clean the data:

Drop columns which contained more than 80% missing values
Deal with other missing values
Convert to lowercase all non-numeric columns
Convert date column to a datetime object
Combine categories in ‘install_type’ column
Remove unwanted characters (%, $, commas, trailing white spaces)
Convert data types (some numeric columns were loaded as object)
Get more data via NREL API
Use data from API to fill some of the missing values

Exploratory Data Analysis (EDA)

Now for the fun part.

The search for insights begins here. Although, along the way a bit more data wrangling was required as there were outliers which had to be removed.

Since residential installations dominate the data, they will be the focus of this article.

To begin with, I searched for some correlations between the target variable (cost), which I wanted to predict, and the predictors (all other columns). However, going into this I did have an idea as to which variables might have the strongest correlation.

Cost vs System Size (measured in kilowatts)

Looking at the plot above, a strong linear correlation can be seen between the cost of the installation and the system capacity. Additionally, it can be seen that that there is one particular tracking type that is most common.

In general, the pattern appears to be linear. However, considering that this is supposed to be all residential installations, the magnitude of capacity and cost does not make sense.

Typically, the average residential solar array capacity will range from six to eight kW (kilowatts), but this data contains values which are well beyond what is considered to be ‘average’ for a residential type of installation.

For example, it is extremely unlikely (if not impossible) that any home will have a need for a 100 kW system that produces about 154,864 kWh annually and costs $995,956.

The most valid explanation may be that these data points have been mislabeled. It is more reasonable to assume that the higher values may represent commercial or utility type of installation.

Therefore, this project will follow through on the assumption that these extreme values have been mislabeled and do not actually represent residential values.

From above plot, it is easy to visualize how the capacity and power generation are correlated. It terms of building a machine learning model, this is a bit of a problem since we have two features which are highly correlated.

This is called multicollinearity. In short, when building a linear regression model, this will result in having inaccurate coefficients. However, this will not be detrimental to the predictive power of the model.

The coefficients are useful because they tell us the importance of each variable in predicting the cost. This in turn allows us to understand how each feature impacts the cost of a solar array installation.

Box-plots

With box-plots we can visualize the effect of outliers and high influence data points. Below, we see some rebellious data points which are really skewing the data and they must be dealt with!

After removing the extremely large values, we have the box-plots below.

This was achieved by simply filtering system size, power generation, and cost to only include data below the 97.5th percentile. Above that percentile, the values are discarded.

Types of Technology

Solar panels are not all created equal. There are various technologies out there; some are more expensive than others, and some are more efficient than others.

This data set contains various types of technologies but only a few of them are prevalent; polycrystalline and monocrystalline.

In the box-plot below, we can see the varying costs associated with different technology.

Variance of cost among different technologies.

Below, we can see the varying power generation associated with different technology.

Power generation of each technology type.

A few things are noticeable from above plots. The box-plots indicate that the CIS/CIGS (Copper Indium Gallium Selenide) modules have a higher power production with much lower variance and relatively cheaper than other technologies.

Cost and power generation of varying technology.

Above is the average cost per unit of power produced by each solar panel technology type.

Distributions

Box-plots give us some sense of the distribution but it’s much better to actually make a histogram. Histograms give us a better understanding of how the data is distributed.

This data is still a bit skewed to the right but it’s pretty close to being normally distributed. Ideally, we need data to have a normal distribution when building a machine learning model.

Therefore, this data will need to be transformed such that it is normally distributed or very close to it. This will greatly improve the performance of the ML model.

ECDF (Empirical Cumulative Distribution Function)

The ECDF below is accomplished by sorting the values and then plotting them. This enables us to arrive at some meaningful statistics about each feature.

For example, in terms of annual energy estimate (Annual PV Prod), we can see that about 75% of installations (in United States) are estimated to produce up to about 10,000 kWh of energy.

In terms of cost, about 70% of people have paid less than $35,000 for their solar array installation. Keep in mind that these are historical numbers ranging from 1998 to 2018, and this cost does not include installation labor or any incentives.

Statistics by State

As we can see, most solar arrays in this data set have been installed in California. The red line simply represents the one percent marker to indicate that most states appear in this data less than one percent of the time.

By excluding CA, temporarily, we can get a better look at the other states that are represented in this data.

Below is a count of incentives collected via an NREL API. This is simply the collective amount of state and local government incentives that are available. This does not include any federal incentives.

According to the information gained from the API, every state has at least one incentive available for solar energy. Additionally, there are federal incentives available as well which are not represented in this data.

Average cost and rebates are visualized below. For some states, rebate information was missing and it is difficult to tell as to why it is missing. Perhaps there was no rebates given for the particular installation, or maybe they weren’t available at the time.

So, to answer the first question:

Which states have the cheapest and most expensive installations; which states have highest/lowest incentives?

The data above provides median values and encompasses data ranging from 2013 to end the of 2018. From the two plots above, we now gain insights on the costs and rebate values across the country.

However, keep in mind that federal rebates are not represented in this data which may further reduce the cost.

2. How have the prices changed over the years?

To answer this question, I created a time-series and re-sampled it into six month bins. The resulting time-series plot has six month intervals and the average cost during those six months.

There’s some good news for the consumer judging from the plot above. Hardware costs have been decreasing since about 2010 after peaking between 2007 and 2010.

However, what does it mean for the solar industry? Is manufacturing becoming less complex and more cost effective? Or, is there more competition developing and is that what drives the prices downward? Unfortunately, we cannot answer these questions from this data alone.

Well, this was unexpected. The rebate amount has also been decreasing quite drastically and it appears that it had peaked around the same time as cost. Also, the data contains many missing values for rebates in 2017 and 2018 resulting in the anomaly seen above for the year 2017. The exact reason is unknown, but may be due to changes in the data collection process as the source of this data has been deprecated in 2019.

This plot is informative but also brings about even more questions regarding the solar industry, economy, and the overall state of our concern for the environment.

But not all hope is lost. Despite the decreasing value of rebates, the amount of installations and rebates has been growing during the same period. Although the value of rebates has declined since early 2000s, the number of them has been increasing.

3. Can we predict the cost of a solar array installation? Which factors have the greatest impact on the cost of hardware?

To answer this questions, a few different approaches can be used. We can look at correlations between the independent variable, cost, and all the dependent variables. However, correlation does not imply causation and therefore doesn't tell us the entire story.

The range of correlation leis between -1 and 1. Strongest correlations are approach 1 and -1, weak or no correlation occur near zero.

By visualizing correlations between the features we can get a glimpse into which predictors may have a great impact on hardware cost.

Machine Learning

What do you do when there’s a whole heap of data and you want to find a pattern within it?

By building a machine learning model we can predict the target variable cost and estimate the impact of each variable on cost. We can explore a number methods for building such models including Multiple Linear Regression, Gradient Boosting and Random Forest Regression.

Building a model is an iterative process. There are many variables that may influence the outcome and we may not know which variables are the most useful from the start.

These variables may have different data types as well — categorical, numerical, and strings. Therefore, data must be prepared in such a way that all variables are numerical before building any model.

For this project, the general process was as follows:

Categorical encoding (if any features are non-numeric)
Feature engineering. Play around with math and try to create additional features
Build a pipeline to handle categorical and numeric variables
Search for best pipeline parameters and regression hyper-parameters
Evaluate pipeline with cross-validation
Repeat process, reuse pipeline to evaluate other algorithms

The first few algorithms used were Ridge and ElasticNet Regression. Both implement regularization which penalizes the coefficients that have very little impact on the outcome, shrinking them close to zero.

Overfitting/Underfitting

One of the issues which needs to be addressed is overfitting and underfitting. In general, overfitting occurs when the model fits to the noise surrounding the true signal or pattern.

An example of overfitting can be seen on the plot below on the right. Here, the model tries to fit every single point. This happens because the model is too complex which also has high variance.

Examples of under and overfitting from Scikit-Learn.

It’s also possible to make a model that is lacking in complexity and this results in underfitting the actual signal in the data. This is observed on the left of the above plot. The model on the left is too simple which leads to underfitting and is said to have high bias.

One of the goals of this machine learning model is to achieve balance between bias and variance. To accomplish this, the data is split into a training subset and a testing subset. The model is fit on the training data. It is then used to predict the outcome, cost.

In this case, the training set consists of data ranging from 1998 to May 2015; and the testing data ranges from June 2015 through December 2017.

Metrics

How do we evaluate success of a model?

We can then evaluate the outcome using various metrics. In this case, two metrics were used, mean absolute error (MAE) and root mean squared error (RMSE).

MAE is essentially the average error between predicted and actual values, where direction of the error (positive or negative) is irrelevant. RMSE squares the difference between predicted and actual values, sums it, then we square root of that sum. Outliers in the data will have a greater influence on RMSE than MAE. As a result, RMSE is an excellent metric to use when there are outliers present.

Results

During feature engineering, transforming the predictors was necessary so that the resulting model would provide the “best results”. In this case, best results are achieved by minimizing RMSE on the validation data.

Through a number of trials, reasonable results were achieved by multiplying all the variables by size_kw since it had the highest correlation to the target cost. Additionally, categorical variables such as state, tracking_type, and tech_1 were initially encoded as dummy variables but had no significant impact on the error and thus were not used.

Using ElasticNet and Ridge Regression we can visualize the results below.

The impact of outliers can be seen through the RMSE. Training score being higher than test score would imply that there may be some underfitting occurring in the training set. However, both models perform quite well on the test set. The fit time is measured in seconds and we can see that both linear models are very quick to train.

Above, we can see that both linear models agree on the most important variables which is mostly the product of two variables, size_kw and cost_per_watt.

Cross-Validation

So far both models have performed pretty well but they need to be tested. In order to do that, we can use Scikit-Learn’s function for time-series validation. This will essentially split the data such that the training is performed on the past data and predictions can be performed on future data.

Below, we can visualize this process of splitting data, training on past data, and then predicting on future data.

Note that the validation error starts higher than training, implying that there is overfitting occurring, but begins to converge with training score as more data is used to train the model. It also seems as though the error continues to decrease, although very gradually. At some point, adding more training data will result in minuscule error reduction.

Random Forest and Gradient Boosting

The two models from earlier, Ridge and ElasticNet, are relatively simple but provided pretty good results. Now we can try to build a few more models using more complex algorithms such as Random Forest and Gradient Boosting.

The resulting metrics from the Random Forest model appear to be better than the linear models, Ridge and ElasticNet. The training time is much higher due to its complexity, but Random Forest requires less feature engineering and doesn’t require scaling which in turn saves time on production.

Additionally, this is a tuned model and the results are obtained after finding the optimal parameters in order to minimize the RMSE on the validation set.The process of finding the best parameters is computationally expensive and thus adds more time towards the production of a model.

The last model, Gradient Boosting, is also more complex than the linear models and its results are visualized below. This model seems to perform the worst and has the longest training time.

Last comparison of all models is presented below. We can interpret RMSE in terms of dollars, and fit time in seconds as they both fall on a similar scale.

Additionally, we can visualize the feature importance produced by Gradient Boosting and Random Forest models.

The two features with the greatest impact on cost is the product of, size_kw and cost_per_watt. All models appear to agree on these features.

What’s the best model?

Generally, to classify a model as being the “best” is relative to the context of the problem that is being solved. In this case, the goal was to predict the hardware cost of a solar array installation with minimal error, as measured by RMSE.

Considering that the cost of installations range in the tens of thousands of dollars, all models actually performed quite well as their RMSE were below $100. Making a prediction that is within $100 dollars of actual cost is actually very accurate. However, numerous other factors may be considered, such as, preprocessing/transforming data, training time, and interpretability.

For simplicity and interpretability, the linear models are the best. They provide a simple understanding of how the predictors affect the outcome. In this case, hardware cost is, mostly, the product of two predictors, size_kw and cost_per_watt plus some error.

Conclusion

No matter where you live, solar costs are approaching historical lows while the number of incentives are becoming more widely available from various sources — local/state/federal government, and utilities.

At the same time, the environment needs a break from fossil fuels. Cheaper solar energy is here for the taking. Might as well do some good while saving money.

The complete project is available on my GitHub.

If you made it this far, or even skimmed through it, thank you for reading.