ASHRAE GREAT ENERGY PREDICTION III CHALLENGE

Published in

Analytics Vidhya

6 min readOct 13, 2020

1. Problem Statement:

This is a Kaggle competition organized by ASHRAE(American Society of Heating, Refrigerating, and Air-Conditioning Engineers).ASHRAE, founded in 1894, is a global society advancing human well-being through sustainable technology for the built environment. The data comes from over 1,000 buildings spanning over a three-year timeframe (2016 to 2019). With better estimates of these energy-saving investments, large-scale investors and financial institutions will be more inclined to investing in this area to enable progress in building efficiencies.

source:https://www.kaggle.com/c/ashrae-energy-prediction/overview

2.Data_Set

The dataset consists of five CSV files. These files are train.csv,test.csv,building_metadata.csv,weather_train.csv,weather_test.csv.train.csv,test.csv,weather_train.csv,weather_test.csv is a time-series data.building_metadata consist of features like site_id,building_id,primary_use,square_feet,year_built.weather_train and weather_test data consist of weather features like: air_temperature,dew_temperature,cloud_coverage,wind_speed,wind_direction,precip_depth_1_hr,sea_level_pressure.Weather_train dataset was measured from 1 Jan 2016 to 1 Jan 2017.Weather_test dataset spans from 1 Jan 2017 to 1, Jan 2019.

source: https://www.kaggle.com/c/ashrae-energy-prediction/data

3. Mapping the real-world problem to an ML problem :

In this problem, we have to predict meter_readings for two years (2017 to 2019). So this is a regression problem as meter_readings, which is our target variable, is a continuous variable. The evaluation metric given is RMSLE(Root Mean Squared Logarithmic Error).

Where:

N is the total number of observations in the (public/private) data set.

yi^ your prediction of the target.

y is the actual target for ii.

log(x) is the natural logarithm of x.

Source: https://www.kaggle.com/c/ashrae-energy-prediction/overview/evaluation

4. Exploratory Data Analysis:

First, we merge the train data with weather_train and building_metadata. Then perform analysis on this merged train data. We performed analysis on train_data meter_type(0:electricity,1: chilled-water ,2:steam,3: hot water ) . There are two bar plots for this: one showing log meter reading vs meter type other showing frequency of meter_type used vs meter type.

We see the steam meter type is consuming the most energy and the most frequent meter type used is electricity meter. hot water meter type is least used. Now we perform analysis on site_id and primary_use features. There are 16 primary_use types.primary_use feature tells us the purpose for which building at a given location (site_id)is used.site_id is the location of the building. There are 16 sites taken in this data which are given a unique identifier site_id.site_id value ranges from 0 to 15. Through the stacked plot we see average meter_reading by primary_use per site_id.

We see from the above plot that site_id 13 has the highest mean meter_reading. site_id 13 ‘Education’ building is contributing the most to mean meter_reading. Now to further analyze which meter type is installed at various building types we again do stackplot. Due to the high number of steam meters installed in Education buildings we have a high average meter reading.

Next, we plot daily mean meter reading for the year 2016. We want to see the pattern of mean meter_readings per day in the year 2016. This will give us an idea as to which month readings are high /low.

Here meter readings are abnormally exploding after March, then from July to November almost zero and then a peak in November. We have to dig deeper to find outliers causing such abnormal increases in mean meter reading. We now analyze each site_id mean meter readings. After analyzing each site_id mean meter_readings it was found that site_id 13 Education building number 1099 mean meter_reading is the same as overall mean meter_reading in 2016. Thus site_id 13 Education building number 1099 is an outlier. It is dominating the whole mean _meter_reading in 2016. Now after removing site_id 13 education building 1099 we get the meter_reading plot as follows:

From the above plot, we see there is a drastic increase in the mean meter_reading from September to October and then from October to November. So let’s find the cause for it. From the previous stack plot analysis, we saw site_id 6 has the second-highest meter_reading. So we need to analyze site_id 6 further. It was found that site_id 6 (entertainment building number 778)has mean meter_readings that is dominating the overall mean meter_reading. So it’s also an outlier. It was removed. The overall mean_meter_reading now is:

This is the plot of overall mean_meter_reading after the removal of outliers. This is a normalized curve. It has reasonable mean meter_readings.

5. Feature Engineering:

a) Filling missed values:

Many weather-based features have missing values, so we need to fill them. Filling missing air temperature with a mean temperature of the day of the month. Each month comes in a season and temperature varies a lot in a season. So filling with yearly mean value is not a good idea. There are some features that have continuous nan values. So, first, we calculate mean cloud_coverage, precip_depth_1_hr, sea_level_pressure, and wind_direction of the day of the month and then fill the rest missing values with the last valid observation. (fill with the method=’ffill’ option. ‘ffill’ stands for ‘forward fill’ and will propagate last valid observation forward). We have used time-based features for filling nan values. Time-based features like ‘day’, ‘month’ are used.

b) Adding Features to data:

Time-based features were added to both train and test data. Time-based features are:

1.day of month

2.month

3.hour of day

4.weekday

5.year

These time-based features are extracted from the timestamp feature of the dataset. Weather-based lag features are also added which has improved model accuracy

Meteorological feature relative humidity was also added.

RH: =100(EXP((17.625TD)/(243.04+TD))/EXP((17.625*T)/(243.04+T)))

Where TD is dew temperature and T is air temperature.

6. Models and Hyperparam tuning:

I have already applied log1p transformation to the target so throughout the project I am going to use RMSE (root mean squared error) as the single evaluation metric, which is:

def rmse(y_true, y_pred):

return np.sqrt(np.mean(np.square(y_true — y_pred))))

DECISION TREE REGRESSOR MODEL

RANDOM FOREST REGRESSOR MODEL

LIGHTGBM REGRESSOR MODEL:

CATBOOST REGRESSOR MODEL:

STACKING CV REGRESSOR:

Conclusion from the above models:

Here we see Lightgbm is giving the least rmse out of all models tested for test data. Lightgbm is faster also. The Decision tree model and the Random Forest model is giving higher rmse compared to other tree-based models.