Bike Curious — Neural Network Regression with TensorFlow

Blair Hone
5 min readApr 15, 2024

--

I’m learning to build neural networks with TensorFlow. The first focus is building prediction, or regression, models. As a use case, I built a model to predict the number of bikes rented in a city by a bike share company. The open data comes from Kaggle.

The data represents the daily count of bike share users, including information about the day, such as the season, month, and temperature. The first thing I did was review the data.

Columns are:

instant: record index
dteday : date
season: season (1:winter, 2:spring, 3:summer, 4:fall)
yr: year (0: 2011, 1:2012)
mnth: month ( 1 to 12)
holiday: weather day is holiday or not (extracted from [Web Link])
weekday: day of the week
workingday: if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit:
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp: Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

I’m using Google’s Colab notebook. First I’ll load and inspect the first 3 rows of the raw data from Kaggle using pandas python library.

# Bike Rent
# https://www.kaggle.com/datasets/ayessa/bike-sharing-dataset-regression
# day.csv

import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

# Import the bike share data as a DataFrame
bikerent = pd.read_csv("/content/day.csv")
bikerent.head(3)
sample of bike rental data

I identified the data that I would predict. While the data provides the total count of rental bikes per day, it also breaks it down between casual and registered users. I decided that I wasn’t interested that they were a registered user or not. The rationale was that if the company wanted to estimate the number of bikes required for a day based on the date and the weather forecast, that it didn’t matter if they were a registered user or not.

For the input data, I also decided to drop a few columns that I deemed to be less relevant.

# drop columns
X = bikerent.drop(["instant","yr","temp","dteday","casual","registered","cnt"], axis=1)
X
input data sample after dropped columns

These are the columns I’m going to base my predictions on.

Next I define my output data from the associated ‘cnt’ column, which is the total count of bikes rented in the day.

y = bikerent["cnt"]
y.head(3)

=====

0 985
1 801
2 1349
Name: cnt, dtype: int64

Four of the rows contain categorical data. Let’s consider the weekday. The values are 1 thru 7 representing the days of the week.

1 - Sunday
2 - Monday
3 - Tuesday
4 - Wednesday
5 - Thursday
6 - Friday
7 - Saturday

As I’ve learning, deep learning doesn’t deal with the information well in this form. So it needs to be one-hot encoded. The result is a data structure with 7 columns, each representing true or false, with each weekday on the rows.

So rather than Tuesday, for example, being represented as:

weekday = 2

With one-hot encoding it’s represented as:

weekday = [0, 0, 1, 0, 0, 0, 0]

In the data, weekday isn’t the only field that requires one-hot encoding; season, mnth, and weathersit also require it.

Split data into 2 dataframes; categorical and numerical data. Apply one-hot encoding to the categorical data. Concatenate the results.

# Split dataframe
categorical_labels = ['season',"mnth","weekday","weathersit"]
numerical_labels = set(X.columns) - set(categorical_labels)
categorical_columns = X.drop(numerical_labels, axis=1)
numerical_columns = X.drop(categorical_labels, axis=1)

# Apply one-hot encoding
categorical_labels = ['season',"mnth","weekday","weathersit"]
categorical_columns_encoded = pd.get_dummies(categorical_columns, columns=categorical_labels).astype(int)

# Concatenate results & review first 3 records
result = pd.concat([categorical_columns_encoded, numerical_columns], axis=1, join="inner")
result.head(3)

Split the input and output data into a training set and testing set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(result, y, test_size=0.2, random_state=42)

Create, compile and fit the deep learning model.

# Set random seed
tf.random.set_seed(42)

# Set the early stopping callback
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

# 1. Create the model
model_bike = tf.keras.Sequential([
tf.keras.layers.Dense(100, activation="relu"),
tf.keras.layers.Dense(100, activation="relu"),
tf.keras.layers.Dense(10, activation="relu"),
tf.keras.layers.Dense(1)
])

# 2. Compile the model
model_bike.compile(loss=tf.keras.losses.mae,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=["mae"])

# 3. Fit the model
# bike_history = model_bike.fit(X_train_normal, y_train, epochs=200, callbacks=[callback])
bike_history = model_bike.fit(X_train, y_train, epochs=200, verbose=0)

Evaluate the model with the test data and calculate an accuracy loss percentage.

eval = model_bike.evaluate(X_test, y_test, return_dict=True)
print("Accuracy Loss Percentage: {:.2f}%".format(eval["mae"]/y_train.mean()*100))

On average, the predictions on data that the model has never seen before will be within 27% of the actual. For example, if the model predicted the number of bikes rented for a future day based on the weather forecast was 4200. It could be assumed that the actual count for rental bikes that day would be between 3066 and 5334.

Business analysis: the business would need to decide if such an accuracy would be useful to them. Is it close enough that they could prepare and charge that number of bikes? Does the business save enough on labour and maintenance to offset any loss in revenue if the bikes are not available for rent?

Plot the predictions.

# Plot the predictions
plt.figure(figsize=(10,10))
plt.scatter(y_test, y_preds_bike, c='crimson')

p1 = max(max(y_preds_bike), max(y_test))
p2 = min(min(y_preds_bike), min(y_test))
plt.plot([p1, p2], [p1, p2], 'b-')
plt.xlabel('True Values', fontsize=15)
plt.ylabel('Predictions', fontsize=15)
plt.axis('equal')
plt.show()

With a linear regression model like this, the closer the red dot is to the blue line, the closer the prediction is.

--

--