A project I did as a data science intern: Solar-Powered House Data Analysis Report
*This project was done during my internship at Mobagel.
There is a solar-powered house located in Taiwan. With the given dataset, my goal is to predict and propose the following:
- Predict the daily pattern of household electricity usage of each month
- Predict the daily pattern of solar-power generation of each month
- Ideal inverter power
- Ideal peak household electricity consumption
How solar-powered house works
On sunny days, solar panels convert sunlight to DC electricity and part of it is stored in the battery. The inverter converts DC electricity to AC electricity to supply the house loads. Appliances in the house will draw electricity from the inverter first. When there is surplus, it gets exported to the grid (mains). But when there isn’t enough electricity to supply the house loads, it will top it up by importing more electricity from the grid (mains).
Data Visualization: ‘house_KW’
- Found out that the November dataset is abnormal,
- and the power consumption is higher in summer than in winter
Daily pattern: (x = _id.hour, y = _id.month)
Battery percentage peak:
10:00 ~ 17:00
House_KW peak: 20:00 ~ midnight
House_KWH: daily cumulative usage
Mains — sell: 10:00 ~ 16:00; buy: 16:00 ~10:00
Solar_powergen: high in August, low in November
Solar_stats: accumulative power generation
No matter how much power the battery has, it often imports electricity from the grid, and rarely exports.
- Predict the daily pattern of solar-power generation of each month
September’s hourly solar_powergen
(X = time, y = solar_powergen)
September’s average daily solar_powergen
Feature Engineering:
1. Add features including sunshine hours, cloudiness, temperature, number of days temperature higher than 30 degrees Celsius. I chose these features because solar-power generation must be highly affected by those factors. (Data source: 交通部中央氣象局)
2. I grouped data by month and hour, since I presumed knowing daily pattern of each month is sufficient in this case. And I got the median values of each columns to make sure that they are not influenced by outliers.
Model:
I have tried Random forest, gradient boosting and bagging regressor (because they often have better performance than other estimators), and it turned out that gradient boosting regressor, on the test set, has the highest R² score among all.
*(Bagging: 0.793/ random forest: 0.807/ gradient boosting: 0.816)
GradientBoostingRegressor(max_depth=3, learning_rate=0.1, n_estimators=200, subsample=0.9, random_state=0)
X_train = Time columns+ hourly average of the rest of the columns
y_train = solar_powergen
X_test = Time columns + hourly average of the rest of the columns from the training set
y_test = solar_powergen
*Note: “Accuracy” in this report is defined as (1- | sum of the absolute value of prediction- true / sum of prediction|)%
- The gbr.score (R² score) below is the score of each sample and its prediction.
- The accuracy below represents how close the predictions and the true values of average daily solar-power generation of each month is.
R² provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model . Accuracy is a description of observational errors. (Wiki)
R² measures “explained variance” but not prediction error, so I used two evaluations to show how well the model performs and how accurate the prediction is.
month: 8
gbr.score: 0.949485042148
Accuracy: 0.87971514877
month: 9
gbr.score: 0.816181749495
Accuracy: 0.783695311456month: 10
gbr.score: 0.791087363623
Accuracy: 0.667679129653month: 11
gbr.score: 0.884837348683
Accuracy: 0.71054816527month: 12
gbr.score: 0.927028536038
Accuracy: 0.895645958349month: 1
gbr.score: 0.833262924263
Accuracy: 0.767611800185
month: 2
gbr.score: 0.896935738999
Accuracy: 0.817152968986month: 3
gbr.score: 0.758768541144
Accuracy: 0.779393732883month: 4
gbr.score: 0.856984796955
Accuracy: 0.722903100651month: 5
gbr.score: 0.859756340341
Accuracy: 0.820325939924
Average accuracy of each month grouped by hour = 0.784
Average R² score of each month grouped by hour = 0.857
2. Predict the daily pattern of household electricity usage of each month
September’s hourly household power consumption
(X = time, y = house_KW)
September’s daily household power consumption
Model:
GradientBoostingRegressor(
max_depth=2,
learning_rate=0.09,
n_estimators=50,
random_state=0)
X_train = Time columns+ hourly average of the rest of the columns
y_train = house_KW
X_test = Time columns + hourly average of the rest of the columns from the training set
y_test = house_KW
month: 8
Accuracy of each sample: 0.730
Accuracy of daily average:0.754month: 9
Accuracy of each sample: 0.534
Accuracy of daily average:0.617month: 10
Accuracy of each sample: 0.704
Accuracy of daily average:0.798month: 11
Accuracy of each sample: 0.650
Accuracy of daily average:0.711month: 12
Accuracy of each sample: 0.730
Accuracy of daily average:0.791month: 1
Accuracy of each sample: 0.667
Accuracy of daily average:0.740month: 2
Accuracy of each sample: 0.778
Accuracy of daily average:0.813month: 3
Accuracy of each sample: 0.731
Accuracy of daily average:0.777month: 4
Accuracy of each sample: 0.651
Accuracy of daily average:0.751
month: 5
Accuracy of each sample: 0.547
Accuracy of daily average:0.620
On test set:
Average accuracy of each month = 0.672
Average accuracy of each month grouped by hour = 0.737
Average CV Score of each month’s median CV score (exclude November)= 0.639
On full dataset:
Average accuracy of each month = 0.68
Average CV score of each month’s median CV score (exclude November)= 0.627
Benchmark:
Based on pure statistics (calculated by the median values of house_KW of each hour in each month), the accuracy of each sample is 0.48, and that of the daily average value is 0.62.
The performance of my model is better than the benchmark.
3. Ideal inverter power
Model:
DecisionTreeClassifier(max_depth=3, random_state=1,
min_samples_split=0.1, max_leaf_nodes=5, min_samples_leaf= 0.1)
cv_score = 0.874
The house can mostly operate independently without importing electricity from the grid when:
- the inverter power is larger than 0.838KW (which accounts for about 20% of time)
- the inverter is smaller than 0.838KW and house_KW smaller than about 0.23KW (which accounts for about 10% of time)
Current Inverter Power:
mean 0.533212 | std 0.796064
min -1.945600 |max 4.410000
25% 0.000000|50% 0.237000|75% 0.711000
The total amount of electricity imported from the mains is 1875.57kWh, which costs NTD 3923.07.
The total amount of electricity exported to the mains is 1740.60kWh, and made NTD 12,463 (calculated in 7.16NTD/kWh ).(Data Source:台電電價計算範例)
Under the condition that battery has charge left, if we can improve the inverter power to the 75th percentile of house_KW (0.7KW), we can reduce the total imported amount of electricity to 185.79 KW, and decrease the fee to NTD 302.83. Moreover, we can sell 1055.7kWh more back to the grid, and make NTD 20,021.
The most ideal inverter power is 0.838 KW. With that, the solar-powered house would be very close to a energy self-sufficient one.
The correlation b/w inverter and solar_powergen is high.
Linear Regression:
Coefficients: [[ 0.661]]
Intercept: [ 0.128]
Score: 0.712
inverter = 0.128 + 0.661*solar_powergen
Suggestion: Better use appliances during daytime (10:00~16:00) when the inverter has more power to support so that we can reduce the amount of power imported from the grid.
4. Ideal peak household electricity consumption
Model:
DecisionTreeClassifier(max_depth=3, random_state=1, min_samples_split=0.001, max_leaf_nodes=5, min_samples_leaf= 0.1)
cv_score = 0.828
Most of the time when solar_powergen is larger than 1.467KW (which accounts for about 20% of time), there is no restriction on household electricity consumption. Otherwise, better limit the household electricity consumption below 0.23kW to keep independent from the mains.