A project I did as a data science intern: Solar-Powered House Data Analysis Report

Chenjing.Claire
8 min readJul 27, 2017

--

*This project was done during my internship at Mobagel.

There is a solar-powered house located in Taiwan. With the given dataset, my goal is to predict and propose the following:

  1. Predict the daily pattern of household electricity usage of each month
  2. Predict the daily pattern of solar-power generation of each month
  3. Ideal inverter power
  4. Ideal peak household electricity consumption
Credit: AGL Solar

How solar-powered house works

On sunny days, solar panels convert sunlight to DC electricity and part of it is stored in the battery. The inverter converts DC electricity to AC electricity to supply the house loads. Appliances in the house will draw electricity from the inverter first. When there is surplus, it gets exported to the grid (mains). But when there isn’t enough electricity to supply the house loads, it will top it up by importing more electricity from the grid (mains).

Data Visualization: ‘house_KW’

  • Found out that the November dataset is abnormal,
  • and the power consumption is higher in summer than in winter

Daily pattern: (x = _id.hour, y = _id.month)

Battery percentage peak:
10:00 ~ 17:00

House_KW peak: 20:00 ~ midnight

House_KWH: daily cumulative usage

Mains — sell: 10:00 ~ 16:00; buy: 16:00 ~10:00

Solar_powergen: high in August, low in November

Solar_stats: accumulative power generation

No matter how much power the battery has, it often imports electricity from the grid, and rarely exports.

  1. Predict the daily pattern of solar-power generation of each month

September’s hourly solar_powergen
(X = time, y = solar_powergen)

September’s average daily solar_powergen

Feature Engineering:
1. Add features including sunshine hours, cloudiness, temperature, number of days temperature higher than 30 degrees Celsius. I chose these features because solar-power generation must be highly affected by those factors. (Data source: 交通部中央氣象局

2. I grouped data by month and hour, since I presumed knowing daily pattern of each month is sufficient in this case. And I got the median values of each columns to make sure that they are not influenced by outliers.

Model:

I have tried Random forest, gradient boosting and bagging regressor (because they often have better performance than other estimators), and it turned out that gradient boosting regressor, on the test set, has the highest R² score among all.
*(Bagging: 0.793/ random forest: 0.807/ gradient boosting: 0.816)

GradientBoostingRegressor(max_depth=3, learning_rate=0.1, n_estimators=200, subsample=0.9, random_state=0)

X_train = Time columns+ hourly average of the rest of the columns
y_train = solar_powergen
X_test = Time columns + hourly average of the rest of the columns from the training set
y_test = solar_powergen

*Note: “Accuracy” in this report is defined as (1- | sum of the absolute value of prediction- true / sum of prediction|)%

  • The gbr.score (R² score) below is the score of each sample and its prediction.
  • The accuracy below represents how close the predictions and the true values of average daily solar-power generation of each month is.

R² provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model . Accuracy is a description of observational errors. (Wiki)

R² measures “explained variance” but not prediction error, so I used two evaluations to show how well the model performs and how accurate the prediction is.

month: 8
gbr.score: 0.949485042148
Accuracy: 0.87971514877

month: 9
gbr.score: 0.816181749495
Accuracy: 0.783695311456
month: 10
gbr.score: 0.791087363623
Accuracy: 0.667679129653
month: 11
gbr.score: 0.884837348683
Accuracy: 0.71054816527
month: 12
gbr.score: 0.927028536038
Accuracy: 0.895645958349
month: 1
gbr.score: 0.833262924263
Accuracy: 0.767611800185

month: 2
gbr.score: 0.896935738999
Accuracy: 0.817152968986
month: 3
gbr.score: 0.758768541144
Accuracy: 0.779393732883
month: 4
gbr.score: 0.856984796955
Accuracy: 0.722903100651
month: 5
gbr.score: 0.859756340341
Accuracy: 0.820325939924

Average accuracy of each month grouped by hour = 0.784
Average R² score of each month grouped by hour = 0.857

cross_val_predict (cv=10)

2. Predict the daily pattern of household electricity usage of each month

September’s hourly household power consumption
(X = time, y = house_KW)

September’s daily household power consumption

Model:
GradientBoostingRegressor(
max_depth=2,
learning_rate=0.09,
n_estimators=50,
random_state=0)

X_train = Time columns+ hourly average of the rest of the columns
y_train = house_KW
X_test = Time columns + hourly average of the rest of the columns from the training set
y_test = house_KW

month: 8
Accuracy of each sample: 0.730
Accuracy of daily average:0.754
month: 9
Accuracy of each sample: 0.534
Accuracy of daily average:0.617
month: 10
Accuracy of each sample: 0.704
Accuracy of daily average:0.798
month: 11
Accuracy of each sample: 0.650
Accuracy of daily average:0.711
month: 12
Accuracy of each sample: 0.730
Accuracy of daily average:0.791
month: 1
Accuracy of each sample: 0.667
Accuracy of daily average:0.740
month: 2
Accuracy of each sample: 0.778
Accuracy of daily average:0.813
month: 3
Accuracy of each sample: 0.731
Accuracy of daily average:0.777
month: 4
Accuracy of each sample: 0.651
Accuracy of daily average:0.751

month: 5
Accuracy of each sample: 0.547
Accuracy of daily average:0.620

On test set:
Average accuracy of each month = 0.672
Average accuracy of each month grouped by hour = 0.737
Average CV Score of each month’s median CV score (exclude November)= 0.639

On full dataset:
Average accuracy of each month = 0.68
Average CV score of each month’s median CV score (exclude November)= 0.627

cross_val_predict (cv=10)

Benchmark:
Based on pure statistics (calculated by the median values of house_KW of each hour in each month), the accuracy of each sample is 0.48, and that of the daily average value is 0.62.
The performance of my model is better than the benchmark.

3. Ideal inverter power
Model:
DecisionTreeClassifier(max_depth=3, random_state=1,
min_samples_split=0.1, max_leaf_nodes=5, min_samples_leaf= 0.1)
cv_score = 0.874

The house can mostly operate independently without importing electricity from the grid when:

  1. the inverter power is larger than 0.838KW (which accounts for about 20% of time)
  2. the inverter is smaller than 0.838KW and house_KW smaller than about 0.23KW (which accounts for about 10% of time)

Current Inverter Power:
mean 0.533212 | std 0.796064
min -1.945600 |max 4.410000
25% 0.000000|50% 0.237000|75% 0.711000

The total amount of electricity imported from the mains is 1875.57kWh, which costs NTD 3923.07.
The total amount of electricity exported to the mains is 1740.60kWh, and made NTD 12,463 (calculated in 7.16NTD/kWh ).(Data Source:台電電價計算範例

Under the condition that battery has charge left, if we can improve the inverter power to the 75th percentile of house_KW (0.7KW), we can reduce the total imported amount of electricity to 185.79 KW, and decrease the fee to NTD 302.83. Moreover, we can sell 1055.7kWh more back to the grid, and make NTD 20,021.

The most ideal inverter power is 0.838 KW. With that, the solar-powered house would be very close to a energy self-sufficient one.

The correlation b/w inverter and solar_powergen is high.

Linear Regression:

Coefficients: [[ 0.661]]
Intercept: [ 0.128]
Score: 0.712

inverter = 0.128 + 0.661*solar_powergen

Suggestion: Better use appliances during daytime (10:00~16:00) when the inverter has more power to support so that we can reduce the amount of power imported from the grid.

4. Ideal peak household electricity consumption
Model:
DecisionTreeClassifier(max_depth=3, random_state=1, min_samples_split=0.001, max_leaf_nodes=5, min_samples_leaf= 0.1)
cv_score = 0.828

Most of the time when solar_powergen is larger than 1.467KW (which accounts for about 20% of time), there is no restriction on household electricity consumption. Otherwise, better limit the household electricity consumption below 0.23kW to keep independent from the mains.

--

--

Chenjing.Claire

Co-founder of a startup, Ex TikTok PM, USC Econ/Math major '18, coffee & ocean lover