Rising Heat in the City — A Machine Learning Problem

B. Shiv Kumar
Analytics Vidhya
Published in
8 min readMar 9, 2021

Recently, I took part in a Machine Learning competition organized by the Computer Science and Engineering (CSE) department of the National Institute of Technology, Trichy (NITT). It was a competition named Data Crunch, where we were given a dataset and we had to use a machine learning model to predict the “Heat” variable. A test set was also provided to make some predictions and submit them. The correctness of these submissions is judged based on the R2_score. In this article, I will talk about how I approached the problem and outline my solution to the problem.

Photo : The Times of India

We will begin by taking the exact problem statement from the competition link. The statement is as follows:

“Jayavi was moving around in the city and she felt that there is summer coming and the city needs to be prepared for the rising heat. So she decided to use the data provided to her to build a model to predict the heat based on the given factors in the dataset. Help her build the model.

The dataset has train and test datasets. Train your model on the training dataset and run it on the test dataset to generate the submission file. The scoring will be based on the R2_score.”

I have both linked the dataset and the competition in the intro. We will do our coding in Python 3 using Jupyter Notebooks. Let’s begin by importing some libraries.

import pandas as pd
import numpy as np

We then read the data using Pandas.

data = pd.read_csv("/content/drive/MyDrive/Datasets/Data Crunch Vortex/TRAIN.csv")
test = pd.read_csv("/content/drive/MyDrive/Datasets/Data Crunch Vortex/TEST.csv")

Let’s see what the first few columns of the train dataset look like.

data.head()

We get the following output.

First few rows of train data.

We can see our target variable Heat. We also have other factors defined as follows:

  • idx: The Id of each observation.
  • UNIXTime: It specifies a particular moment in time. Specifically, it is the number of seconds that have passed since 1970–01–01 00:00:00 GMT. In this dataset, it is the time at which the observation was made w.r.t 1970–01–01 00:00:00 GMT.
  • Data: It specifies the date on which the observation was made.
  • Time: It specifies the time of the day at which the observation was made.
  • Humidity: Humidity is a measure of the amount of water vapor in the air.
  • Temperature: It is the temperature recorded at the time of observation.
  • Pressure: It is the pressure recorded at the time of observation.
  • WindDegree: It specifies the direction of the wind at the time of observation.
  • WindSpeed: It specifies the speed of wind at the time of observation.
  • RiseTime: It is the time at which Sunrise was observed.
  • SetTime: It is the time at which Sun Set was observed.

We shall proceed by making some basic observations about the data. Let’s start with any missing values.

data.isnull().mean()

We get the following output:

idx            0.0
UNIXTime 0.0
Data 0.0
Time 0.0
Heat 0.0
Temperature 0.0
Pressure 0.0
Humidity 0.0
WindDegrees 0.0
WindSpeed 0.0
RiseTime 0.0
SetTime 0.0
dtype: float64

Nice! So there are no missing values. Let’s see some descriptive statistics of the data.

data.describe()

We get the following output.

Descriptive Statistics

We can see that the dataset has 26,149 observations. Other details like the mean, 25th percentile, minimum and maximum can also be observed. Let’s check the data type of each variable.

data.dtypes

We get the following output.

idx              int64
UNIXTime int64
Data object
Time object
Heat float64
Temperature int64
Pressure float64
Humidity int64
WindDegrees float64
WindSpeed float64
RiseTime object
SetTime object
dtype: object

In the first few rows of the data, we can observe that the time of observation is stated in UNIX Time and the 24-hour clock system. They may be from different timezones. Let’s see if we are right. For this, we must extract the day from the Data column and combine it with the Time column to get the time of observation in the UTC format. Then we must subtract this from the UNIX Time (converted to UTC format). If the difference is zero, all observations were made in the same Time Zone, i.e. in GMT (in this case). We will create a copy of the data to avoid changing the original.

temp = data.copy()
temp[['Date','Midnight','AM or PM']] = temp['Data'].str.split(' ',expand=True)
temp['Data'] = pd.to_datetime(temp['Date'] + ' ' + temp['Time'])
temp['UNIXTimeDate'] = pd.to_datetime(temp['UNIXTime'],unit='s')
temp['ZoneDiff'] = temp['UNIXTimeDate'] - temp['Data']
temp.head()

We get the following output.

0   0 days 10:00:00
1 0 days 10:00:00
2 0 days 10:00:00
3 0 days 10:00:00
4 0 days 10:00:00
Name: ZoneDiff, dtype: timedelta64[ns]

We notice that GMT is 10 hours ahead of this time zone. Thus we must make this time zone correction to the Time of observation, SetTime, and RiseTime. Also, it is very difficult to deal with date columns as such. So it is better to convert them into a data type that’s easier to deal with. In this case, we can convert our dates into UNIX format which will ensure all our times are in GMT for uniformity, and more importantly, they will be integers. We do all this preprocessing in the following function.

def cleanData(temp):
temp[['Date','Midnight','AM or PM']] = temp['Data'].str.split(' ',expand=True)
temp['Data'] = pd.to_datetime(temp['Date'] + ' ' + temp['Time'])
temp['SetTime'] = pd.to_datetime(temp['Date'] + ' ' + temp['SetTime'])
temp['RiseTime'] = pd.to_datetime(temp['Date'] + ' ' + temp['RiseTime'])
temp = temp.drop(['Date','Time','Midnight','AM or PM'],axis=1)
temp['UNIXTimeDate'] = pd.to_datetime(temp['UNIXTime'],unit='s')
temp['ZoneDiff'] = temp['UNIXTimeDate'] - temp['Data']
temp['Data'] = temp['UNIXTimeDate'] - temp['ZoneDiff']
temp['SetTime'] = temp['SetTime'] + temp['ZoneDiff']
temp['RiseTime'] = temp['RiseTime'] + temp['ZoneDiff']
temp = temp.drop(['UNIXTime','ZoneDiff'],axis=1)
temp['Data'] = pd.to_datetime(temp['Data']).astype(int) / 10**9
temp['SetTime'] = pd.to_datetime(temp['SetTime']).astype(int) / 10**9
temp['UNIXTimeDate'] = pd.to_datetime(temp['UNIXTimeDate']).astype(int) / 10**9
temp['RiseTime'] = pd.to_datetime(temp['RiseTime']).astype(int) / 10**9
return temp
train_clean = data.copy()
train_clean = cleanData(data)
train_clean.head()

We get the following output.

Cleaned Training set.

Our dataset looks much tidier now. We can begin modeling our data. Let’s import the required libraries. We will be using a gradient booster (specifically XGBoost) to model our data. Gradient Boosters are among the best models to work on structured data.

from sklearn.model_selection import train_test_split,RandomizedSearchCV
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor

We shall create three subsets of the data. 80% shall be used for training, 10% for validation, and 10% for testing (apart from the test set available for submission).

X_train, X_test, y_train, y_test = train_test_split(train_clean.drop(['Heat','idx'],axis=1),train_clean['Heat'],test_size=0.1,random_state=42)X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.11, random_state=42)X_train.head()

We get the following output.

Train data.

With any numerical data, it’s better to scale our data so that it doesn’t affect model performance. We shall use the MinMaxScaler for this purpose. We will also declare a base model to make some predictions and see if the performance can be improved. We shall use a scikit-learn pipeline to do all this.

heat_pipe = Pipeline([('scaler', MinMaxScaler()),('model', XGBRegressor())])

Let’s fit our pipeline and make some predictions.

heat_pipe.fit(X_train,y_train)
X_train_preds = heat_pipe.predict(X_train)
X_val_preds = heat_pipe.predict(X_val)
X_test_preds = heat_pipe.predict(X_test)
print("Train R2_score: {}".format(r2_score(y_train,X_train_preds)))
print("Validtation R2_score: {}".format(r2_score(y_val,X_val_preds)))
print("Test R2_score: {}".format(r2_score(y_test,X_test_preds)))

We get the following output.

Train R2_score: 0.7713148026068124
Validtation R2_score: 0.7629794446793128
Test R2_score: 0.7561512410326393

Our scores look pretty okay for a base model. But as they say, don’t settle for less if there’s room for improvement :).

We shall try to tune our model and create some new features which may help us get better results. Now at this junction, I’d like to mention that, Feature Engineering is based on intuition and domain knowledge. There are no hard and fast rules to it. There are tools like featuretools that help you automate the process a bit by doing all possible operations you specify between columns. If you would like to explore more about featuretools, I suggest going through this post. The post is very detailed and deals with the usage of featuretools.

With that being said, features created using your domain knowledge almost always give better results. So, I suggest taking time to understand your variables better so that you can create better features using them.

In our case, we can see that our data talks about the weather conditions, particularly the Heat attribute. Applying a bit of common sense here will tell you that the amount of heat depends on how long it’s been since Sun Rise. For example, afternoons are hotter than mornings. Surely, we can use this information to our advantage, right? Let’s create a new variable “TimeSinceSunRise” to store how long it’s been since Sun Rise. The cleanData function is modified as follows:

def cleanData(temp):   temp[['Date','Midnight','AM or PM']] = temp['Data'].str.split(' ',expand=True)
temp['Data'] = pd.to_datetime(temp['Date'] + ' ' + temp['Time'])
temp['SetTime'] = pd.to_datetime(temp['Date'] + ' ' + temp['SetTime'])
temp['RiseTime'] = pd.to_datetime(temp['Date'] + ' ' + temp['RiseTime'])
temp = temp.drop(['Date','Time','Midnight','AM or PM'],axis=1)
temp['UNIXTimeDate'] = pd.to_datetime(temp['UNIXTime'],unit='s')
temp['ZoneDiff'] = temp['UNIXTimeDate'] - temp['Data']
temp['Data'] = temp['UNIXTimeDate'] - temp['ZoneDiff']
temp['SetTime'] = temp['SetTime'] + temp['ZoneDiff']
temp['RiseTime'] = temp['RiseTime'] + temp['ZoneDiff']
temp = temp.drop(['UNIXTime','ZoneDiff'],axis=1)
temp['Data'] = pd.to_datetime(temp['Data']).astype(int) / 10**9
temp['SetTime'] = pd.to_datetime(temp['SetTime']).astype(int) / 10**9
temp['UNIXTimeDate'] = pd.to_datetime(temp['UNIXTimeDate']).astype(int) / 10**9
temp['RiseTime'] = pd.to_datetime(temp['RiseTime']).astype(int) / 10**9
temp['TimeSinceSunRise'] = temp['RiseTime'] - temp['Data']
return temp

We can use this to clean our train data and model it as explained above. Doing predictions on this new dataset gives the following results.

Train R2_score: 0.8975578362931129
Validtation R2_score: 0.8889480708391524
Test R2_score: 0.8849217036129068

Whoa! We got a huge bump in our R2_score! Looks like our feature “TimeSinceSunRise” helped our model make really good predictions!

But let’s not stop here! We can further try to improve this by tuning the parameters of our model. XGBoost has a lot of hyperparameters that can be tuned. We shall focus on five of them and tune them as below:

params={
"learning_rate": [0.1, 0.01 ,0.001],
"n_estimators": [100,250,500,1000],
"gamma" : [0.01, 0.1, 0.3, 0.5, 1, 1.5, 2],
"max_depth": [2,4,7,10],
"min_child_weight" : [1,3,5,7]
}
model = XGBRegressor()
xgb_rscv = RandomizedSearchCV(model, param_distributions=params, scoring='r2', cv=10, verbose=3, random_state=42)
model_xgboost=xgb_rscv.fit(X_train,y_train)
print("Learning Rate: ", model_xgboost.best_estimator_.get_params()["learning_rate"])print("Gamma: ", model_xgboost.best_estimator_.get_params()["gamma"])print("Max Depth: ", model_xgboost.best_estimator_.get_params()["max_depth"])print("Minimum Sum of the Instance Weight Hessian to Make a Child: ",model_xgboost.best_estimator_.get_params()["min_child_weight"])print("Number of Trees: ", model_xgboost.best_estimator_.get_params()["n_estimators"])

We get the following output:

Learning Rate: 0.1
Gamma: 0.5
Max Depth: 10
Minimum Sum of the Instance Weight Hessian to Make a Child: 1
Number of Trees: 100

Let’s plug these values into our pipeline and see what results we get.

heat_pipe = Pipeline([('scaler', MinMaxScaler()),('model', XGBRegressor(n_estimators=100, gamma=0.5,learning_rate=0.1, max_depth=10,min_child_weight=1))])heat_pipe.fit(X_train,y_train)
X_train_preds = heat_pipe.predict(X_train)
X_val_preds = heat_pipe.predict(X_val)
X_test_preds = heat_pipe.predict(X_test)
print("Train R2_score: {}".format(r2_score(y_train,X_train_preds)))
print("Validtation R2_score: {}".format(r2_score(y_val,X_val_preds)))
print("Test R2_score: {}".format(r2_score(y_test,X_test_preds)))

We get the following output:

Train R2_score: 0.9908897998162235
Validtation R2_score: 0.941652850962926
Test R2_score: 0.9312753509058348

And Voila! This has further increased our score to 0.93 on the test set and 0.94 on the validation set. Maybe we can further improve our results, but that’s for another day. :)

To predict on test set provided in the competition for submission, we can do the following:

submission = test_clean.drop(['idx'],axis=1)
X_submission_preds = heat_pipe.predict(submission)
submission_df = pd.DataFrame(X_submission_preds)
submission_df.rename(columns={0 : "Heat"},inplace=True)
submission_df['idx'] = test_clean['idx']
submission_df = submission_df[['idx','Heat']]
submission_df.to_csv("/content/drive/MyDrive/Datasets/Data Crunch Vortex/submission_1.csv",index=False)

The created CSV file can be submitted to the competition to get scores for our performance.

We thus analyzed a dataset provided to us and made some great predictions on it. Thanks for sticking till here! Cheers! See you later!

--

--

B. Shiv Kumar
Analytics Vidhya

A curious soul interested about Machine Learning and life in general.