EV detection on Melbourne House Data

Kiran J Chemmanatte
6 min readApr 6, 2024

--

The EV or Electric Vehicle craze has his Australia with 8.4% of all new cars being sold in Australia being an EV, A 120% jump from 2022. Although this change is fantastic for the enviorment, it props up potential challenges for power suppliers. EV’s use up a considerable amount of electric energy to charge which is totally fine when it is just one . However, if a block of a suburb has many EV users and if they all come back from work at similar times and charge up their EV’s. It puts the power grid at the place under a huge load, potentially even overwhelming the system.

Therefore being able detect, which houses contain EVs and identifying such clusters of houses in a block is of vital importance. That is the problem we are attempting to solve today with ML.

Ok first lets explore our dataset. We have the power-consumption data of 88 houses in Melbourne, taken at a bihourly rate for 46 days, along with the date , id of the house and finally the label of whether the house has an EV or not. The csv file along with the codes can be found here .

Lets explore the dataset with pandas

#importing the data and viewing it 
df = pd.read_csv("../data/EV_data.csv")
ptdf.head()
Overview of Data in the dataset

As always the first step is data cleaning

Step1.1 check for missing values/ null values

The following code counts the number of null values in each column and displays it in a table

df.isna().sum()

We are good the dataset contains no null values, impressive to say the least.

Step 1.2 Check if all column datatypes are correct.

dtypes of the columns

We observe a problem , the read_date column is not represented by a date-time format, therefore to utilize it we have to transform the data to a usable form.

As we see the data in column follows a pattern month/day/year so pandas to datatime funtion should help with transforming the data.

df['read_date'] = pd.to_datetime(df['read_date'], format='%m/%d/%Y')
Errors thrown by the dataset on data tranformation

Unfortunately the date column has some entries which also contains time stamps ‘0:00’ however that data is redundant as the columns give the time data . So we take out these values and then apply data transformation .

df['read_date'] = df['read_date'].str.replace(' 0:00', '')
df['read_date'] = pd.to_datetime(df['read_date'], format='%m/%d/%Y')
Updated datatypes

Yes, success we now have the correct datatypes.

Step 2 : Feature engineering

An important point to note is that categorical data or string data can’t be used by many classification models such as Logistic regression, KNN and Random Forests , therefore we have to use clever techniques such as encoding or create custom numeric features to make the data suitable for modeling.

Here the read_date is not in a numeric format, we can extract features such as month, day of month, day of week and make them into new features to use them in modelling.

df['day_of_week_num'] = df['read_date'].dt.day_of_week
df['day_of_month'] = df['read_date'].dt.day
df['month'] = df['read_date'].dt.month
df.head()
Updated dataset with new features

Very Important !!!!

If you have noticed , we cannot treat this dataset like a normal dataset, we are trying to predict whether each id (house id) has an EV or not. However, each id has 46 rows correspoinding to its data. Hence when we perform the train-test split we have to ensure that rows of id that are in the test data are not in the train data. ie, if house id 50 is in test data set , no row of id 50 should be in the train dataset.

To accomplish this task we split the dataset based on their unique house ids and populate the training dataset and testing dataset based on that intial split.

from sklearn.model_selection import train_test_split
train_ids, test_ids = train_test_split(unique_ids, test_size=0.2, random_state=42)
train_data = df.loc[df['id'].isin(train_ids)]
test_data = df.loc[df['id'].isin(test_ids)]

We will use Xgboost, KNN , Logistic regression and RF(random forest) as models and use 5-fold cross validation.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import accuracy_score

The code needed to train and test the models is given below.

# Set up the KFold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Lists to store the scores from each fold
train_scores_lr = []
val_scores_lr = []
train_scores_rf = []
val_scores_rf = []
train_scores_knn = []
val_scores_knn = []
train_scores_xgb = []
val_scores_xgb = []

# Iterate over the folds
for train_idx, val_idx in kfold.split(X_train, y_train):
# Split the training data into train and validation sets
X_train_fold, X_val_fold = X_train.iloc[train_idx], X_train.iloc[val_idx]
y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[val_idx]

# Create the models
lr_model = LogisticRegression(random_state=42)
rf_model = RandomForestClassifier(random_state=42)
knn_model = KNeighborsClassifier()
xgb_model = xgb.XGBClassifier(objective='binary:logistic', random_state=42)

# Train the models on the training fold
lr_model.fit(X_train_fold, y_train_fold)
rf_model.fit(X_train_fold, y_train_fold)
knn_model.fit(X_train_fold, y_train_fold)
xgb_model.fit(X_train_fold, y_train_fold)

# Evaluate the models on the validation fold
train_score_lr = accuracy_score(y_train_fold, lr_model.predict(X_train_fold))
val_score_lr = accuracy_score(y_val_fold, lr_model.predict(X_val_fold))

train_score_rf = accuracy_score(y_train_fold, rf_model.predict(X_train_fold))
val_score_rf = accuracy_score(y_val_fold, rf_model.predict(X_val_fold))

train_score_knn = accuracy_score(y_train_fold, knn_model.predict(X_train_fold))
val_score_knn = accuracy_score(y_val_fold, knn_model.predict(X_val_fold))

train_score_xgb = accuracy_score(y_train_fold, xgb_model.predict(X_train_fold))
val_score_xgb = accuracy_score(y_val_fold, xgb_model.predict(X_val_fold))

# Append the scores to the lists
train_scores_lr.append(train_score_lr)
val_scores_lr.append(val_score_lr)
train_scores_rf.append(train_score_rf)
val_scores_rf.append(val_score_rf)
train_scores_knn.append(train_score_knn)
val_scores_knn.append(val_score_knn)
train_scores_xgb.append(train_score_xgb)
val_scores_xgb.append(val_score_xgb)

print(f'Fold {len(train_scores_lr)}:')
print(f'Logistic Regression: Train Accuracy = {train_score_lr:.4f}, Validation Accuracy = {val_score_lr:.4f}')
print(f'Random Forest: Train Accuracy = {train_score_rf:.4f}, Validation Accuracy = {val_score_rf:.4f}')
print(f'KNN: Train Accuracy = {train_score_knn:.4f}, Validation Accuracy = {val_score_knn:.4f}')
print(f'XGBoost: Train Accuracy = {train_score_xgb:.4f}, Validation Accuracy = {val_score_xgb:.4f}')
# Print the mean scores
print('\nMean Scores:')
print(f'Logistic Regression: Mean Train Accuracy = {sum(train_scores_lr) / len(train_scores_lr):.4f}, Mean Validation Accuracy = {sum(val_scores_lr) / len(val_scores_lr):.4f}')
print(f'Random Forest: Mean Train Accuracy = {sum(train_scores_rf) / len(train_scores_rf):.4f}, Mean Validation Accuracy = {sum(val_scores_rf) / len(val_scores_rf):.4f}')
print(f'KNN: Mean Train Accuracy = {sum(train_scores_knn) / len(train_scores_knn):.4f}, Mean Validation Accuracy = {sum(val_scores_knn) / len(val_scores_knn):.4f}')
print(f'XGBoost: Mean Train Accuracy = {sum(train_scores_xgb) / len(train_scores_xgb):.4f}, Mean Validation Accuracy = {sum(val_scores_xgb) / len(val_scores_xgb):.4f}')

# Train the final models on the entire training set
final_lr_model = LogisticRegression(random_state=42)
final_rf_model = RandomForestClassifier(random_state=42)
final_knn_model = KNeighborsClassifier()
final_xgb_model = xgb.XGBClassifier(objective='binary:logistic', random_state=42)

final_lr_model.fit(X_train, y_train)
final_rf_model.fit(X_train, y_train)
final_knn_model.fit(X_train, y_train)
final_xgb_model.fit(X_train, y_train)

# Evaluate the final models on the test set
test_accuracy_lr = accuracy_score(y_test, final_lr_model.predict(X_test))
test_accuracy_rf = accuracy_score(y_test, final_rf_model.predict(X_test))
test_accuracy_knn = accuracy_score(y_test, final_knn_model.predict(X_test))
test_accuracy_xgb = accuracy_score(y_test, final_xgb_model.predict(X_test))

print('\nTest Accuracies:')
print(f'Logistic Regression: {test_accuracy_lr:.4f}')
print(f'Random Forest: {test_accuracy_rf:.4f}')
print(f'KNN: {test_accuracy_knn:.4f}')
print(f'XGBoost: {test_accuracy_xgb:.4f}')

Lets see the test results , so Xgboost has performed the best followed by RF

Model test results
Model test results.

Essentially the best model is able to correctly classify the data with an accuracy of 83% fantastic. Now lets see the feature importance to see which features contributed the model come up with predictions.

# Get the feature importances
feature_importances = final_xgb_model.get_booster().get_score(importance_type='weight')

# Convert the feature importances to a DataFrame
feature_importances = pd.DataFrame(feature_importances.items(), columns=['feature', 'importance'])

# Sort the DataFrame by importance in descending order
feature_importances = feature_importances.sort_values('importance', ascending=False)

# Print the top 10 most important features
print("Top 10 Important Features:")
print(feature_importances.head(10))

# Plot the top 20 most important features
plt.figure(figsize=(10, 6))
feature_importances.head(20).plot.bar(x='feature', y='importance')
plt.title('Feature Importances')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

Ok so the morning hours gave the model the most useful data and the features that we engineered also helped slightly , based on this we can further refine the model or engineer other features but that is for another time.

Thank you for taking your time to read my blog, hope you found value form it.

--

--