A bike-sharing system is a service in which bicycles are made available to individuals on a short term. Users borrow a bike from a dock and return it at another dock belonging to the same system. Docks are bike racks that lock the bike, and only release it by computer control.
You’ve probably seen docks around town, they look like this:
I often used bike sharing to get around town when I was living in Belgium. But I’d sometimes walk all the way to a dock and find it completely empty with all the bikes already rented out.
Bike sharing companies try to even out supply by manually distributing bikes across town, but they need to know how many bikes will be in demand at any given time in the city.
So let’s give them a hand with a machine learning model!
I’m going to build a linear regression model in C#, NET Core, and ML.NET, and train it on a dataset of bike sharing demand. Then I’ll use the fully-trained model to make a prediction for a given date and time.
ML.NET is Microsoft’s new machine learning library. It can run linear regression, logistic classification, clustering, deep learning, and many other machine learning algorithms.
And NET Core is the Microsoft multi-platform NET Framework that runs on Windows, OS/X, and Linux. It’s the future of cross-platform NET development.
The first thing I need is a data file with lots of bike sharing demand numbers. I’m going to use the UCI Bike Sharing Dataset from Capital Bikeshare in Metro DC. This dataset has 17,380 bike sharing records spanning a 2-year period.
The file looks like this:
It’s a comma-separated file with 17 columns:
- Instant: the record index
- Date: the date of the observation
- Season: the season (1 = springer, 2 = summer, 3 = fall, 4 = winter)
- Year: the year of the observation (0 = 2011, 1 = 2012)
- Month: the month of the observation ( 1 to 12)
- Hour: the hour of the observation (0 to 23)
- Holiday: if the date is a holiday or not
- Weekday: the day of the week of the observation
- WorkingDay: if the date is a working day
- Weather: the weather during the observation (1 = clear, 2 = mist, 3 = light snow/rain, 4 = heavy rain)
- Temperature : the normalized temperature in Celsius
- ATemperature: the normalized feeling temperature in Celsius
- Humidity: the normalized humidity
- Windspeed: the normalized wind speed
- Casual: the number of casual bike users at the time
- Registered: the number of registered bike users at the time
- Count: the total number of rental bikes in operation at the time
I will ignore the record index, the date, and the number of casual and registered bikes, and use everything else as input features. The final column Count is the number I’m trying to predict.
I will build a linear regression model that reads in all feature columns and then makes a prediction for the total number of rental bikes in operation, for every date, time, and weather conditions.
Let’s get started. Here’s how to set up a new console project in NET Core:
$ dotnet new console -o Bikes
$ cd Bikes
Next, I need to install the required ML.NET packages:
$ dotnet add package Microsoft.ML
$ dotnet add package Microsoft.ML.FastTree
I’m also going to install a really nice package called BetterConsoleTables for displaying my model results:
$ dotnet add package BetterConsoleTables
Now I’m ready to add some classes. I’ll need one to hold a bike sharing demand observation, and one to hold my model’s predictions.
I will modify the Program.cs file like this:
The DemandObservation class holds one single bike demand record. Note how each field is tagged with a LoadColumn attribute that tell the CSV data loading code which column to import data from.
I’m also declaring a DemandPrediction class which will hold a single bike demand prediction.
Now I’m going to load the training data in memory:
This code uses the method LoadFromTextFile to load the training and testing data directly into memory. The class field annotations tell the method how to store the loaded data in the DemandObservation class.
Now let’s build the machine learning pipeline:
Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
My pipeline has the following components:
- Concatenate which combines all input data columns into a single column called Features. This is a required step because ML.NET can only train on a single input column.
- AppendCacheCheckpoint which caches all training data at this point. This is an optimization step that speeds up the learning algorithm.
Which training algorithm should I use to build my model? I could try FastTree, or maybe StochasticDualCoordinateAscent, or maybe a Poisson regression?
Well, why not try everything and see what sticks?
Here’s how to train the model with all four algorithms:
This code sets up an array of tuples holding the name and class of each training algorithm. Then a simple for-loop tries out each algorithm by appending it to the pipeline and calling Fit(…) to train the model.
Also note how I set up a console table in the results variable before the loop starts, and display the results at the end of the loop. Table is a helper class in the BetterConsoleTables package that will display my final results in a very nice tabular format.
I now have a fully- trained model. Next I need to grab the test data, predict the bike demand for each data record, and compare that to the actual values:
This code calls Transform(…) to set up predictions for every bike demand record in the file. The Evaluate(…) method then compares these predictions to the actual bike demand and automatically calculates three very handy metrics for me:
- metrics.RootMeanSquaredError: this is the root mean square error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
- metrics.MeanAbsoluteError: this is the mean absolute prediction error, expressed in the number of bikes.
- metrics.MeanSquaredError: this is the mean square prediction error, or MSE value. Note that RMSE and MSE are related: RMSE is just the square root of MSE.
To wrap up, let’s use the model to make a prediction:
I’m creating a new bike record for the fall of 2012, on a Thursday in August at 10am in the morning in clear weather.
The CreatePredictionEngine method sets up a prediction engine. The two type arguments are the input data class and the class to hold the prediction. Then I simply call Predict(…) to make a single prediction.
The final thing I need to do is collect all the results in the console table:
The AddRow() method adds a new row to the console table. I’m putting the training algorithm in column 1, followed by the RMSE, L1 score, L2 score, and the demand prediction for the sample.
Let’s find out how accurate my model is. Here’s the code running in the Visual Studio Code debugger on my Mac:
And here’s the app again running in a zsh shell:
The SDCA trainer performs the worst with an RMSE of 191.57. The Poisson regression is only slightly better. The clear winners are the fast tree algorithms with FastTree Tweedie in the lead with an RMSE of only 62.84.
The MAE score represents the mean error in each prediction. Looking at FastTreeTweedie, it means that on average my model is off by only 39 bikes.
For my sample, the model predicts that I will need 205 bikes to meet demand on a clear Thursday in August at 10am in the morning. Given the MAE score, I should add 39 bikes just to be safe and allocate a total of 244.
So what do you think?
Are you ready to start writing C# machine learning apps with ML.NET?