Use VisualML Studio To Predict Taxi Fares In New York
INTRODUCTION
I’ve created visual workflow designer (VisualML Studio) for Microsoft’s ML.NET library that runs on premises. It simplifies ML training and processing. One of the strong points are the foundations for simple models and project exchange.
It’s also easily extendable with C# code, so there is no need for learning Python.
VisualML Studio is part of Zenodys open source microservice orchestration stack.
INSTALLATION
Download and extract prebuilt binaries from releases page.
Check the latest release and under Assets section find binaries for your OS (visual_ml_studio_*)
This bundle contains :
- VisualML Studio that allows you create and train models visually
- Computing Engine that executes script created in VisualML Studio
- NetCore runtime that is hosted from this bundle so you don’t have to have it installed on your computer.
Now you’ll download Taxi Fare Prediction visual script (template) so that you don’t have to build it from the scratch. You will just have to adopt a few properties to your environment (path to files) during this tutorial.
- Navigate to Visual Templates releases page. Download and extract TaxiFarePrediction.zip visual script from latest release.
- Navigate to ZenEngine\project\HelloWorld folder under the root of VisualML Studio. Copy assets and DB from extracted TaxiFarePrediction.zip into this folder.
This is an open source project from following repositories :
- VisualML Studio
- Computing engine that executes scripts created in visual development tool
- Elements, visual wrappers around ML .NET object model
- Pre-created Visual Templates
LET’S MAKE SOME PREDICTIONS
This tutorial is based on Taxi Fares predictions. You can read more on ML NET Samples github repository.
Here is also a great article about Taxi Fares predictions written in ML.NET. We are going to extend this example visually, so that even non developers could easily make those kinds of predictions.
Problem description is taken from the official GitHub repository:
“This problem is centered around predicting the fare of a taxi trip in New York City. At first glance, it may seem to depend simply on the distance traveled. However, taxi vendors in New York charge varying amounts for other factors such as additional passengers, paying with a credit card instead of cash and so on. This prediction can be used in application for taxi providers to give users and drivers an estimate on ride fares.”
We are going to train and build a model that takes some inputs (trip time, passenger count…) and predict fare of the ride. In this example regression algorithm is used, because those kinds of algorithms are used to predict some continuous value for given parameters.
After you download prebuilt binaries, start VisualML Studio by clicking on visual-ml-studio executable file.
Click Open template button inside visual designer and select root path of the template that is saved in visual_ml_studio_win\ZenEngine\project\HelloWorld.
Before going into details, let’s just briefly explain the process of training and evaluating the model:
- The first element is Start and represents an entry point for template execution
- The second element is ML Context. This is the starting point for training, prediction, model operations and serves as a catalog of available operations. ML Context is needed for all your pipelines
- Next one is ML Text Loader Schema. We are going to read data from CSV files and here we are defining data structure (giving names to the columns)
- After schema is defined, ML Text Loader reads data from CSV file.
- ML Data Processing Pipeline is set of data transformation algorithms needed to train model.
- ML Training Pipeline is set of training operations. Here is also defined which training algorithm is going to be used to predict Taxi Fares.
- Test Loader Test reads test data for evaluating trained model
- ML Model Evaluation executes evaluation on trained model and prints result (metrics)
Let’s look closer at each of element.
ML Text Loader Schema
When you click on ML Text Loader Schema element, you’ll see Define schema button. This will open following view:
Those are column definitions from taxi-fare-train.csv file (you will need to set path to the file later). Here you can see columns and data types that will be used in data processing pipelines.
ML Text Loader (requires manual input)
This element reads data from files. Here you must select the path to taxi-fare-train.csv file on your computer. CSV file is part of the bundle you can find under VisualML Studio root directory: \visual_ml_studio_win\ZenEngine\project\HelloWorld\assets\taxi-fare-train.csv
ML Data Process Pipeline
This is one of the most powerful ML elements that defines the data transformation pipeline required by training process. By clicking on the Define data workflow button, you’ll get the following pipeline:
- First data processing pipeline operation is DPCopyColumns which copies the FareAmount column to a new column called Label. This Label column holds the actual taxi fare that the model has to predict.
- Next three elements are one hot encoding that perform operations on the three columns that contains enumerative data: VendorId, RateCode, and PaymentType. This is a required step because machine learning models cannot handle enumerative data directly. Those are text fields and this transformation apply to them. You can read about one hot encoding here
- Next three elements (PassengerCount, TripTime and TripDistance) are normalize mean variance transformers that calculates the mean and variance of the training data during the training model process. Those data are in numeric form but we don’t want to take them raw, so we apply normalization method against those columns to improve prediction results. Other normalization methods could be Normalize Log Mean Variance, Normalize Lp Norm, Normalize Min Max to name a few
- Last one is Concatenate operation which combines all input data columns into a single column called Features (ML.NET trains on a single input column). This provides a mechanism to build an aggregate field (Features)
Now you can close this view to return to the main workflow.
ML Training Pipeline
This is the step where you define training algorithm, Feature and Label columns.
Text Loader Test (requires manual input)
At this point, the model is built. All that’s needed to do is to validate it. This element loads training data. Select path to the taxi-fare-test.csv on your computer (same path that you applied already to taxi-fare-train.csv).
ML Model Evaluation (requires manual input)
This is a model evaluation step. Enter the location where trained model will be saved. This will create a zip file with all needed information about training process.
* After the path is set, click on ML Model Evaluation again so that colors it red
You can pass the training model around and used elsewhere just by loading it from a file.
Part of the VisualML Platform will be a standardized model sharing (training models marketplace) that will offer simple exchange and use of pre-trained models inside the Visual ML Studio.
Now you can click on the Save template button to save changed paths.
CREATE AND EVALUATE MODEL
To start template executions, click on Run template button.
This will give you the following output:
Evaluation process prints following metrics for regression:
- Rms: Coefficient of determination represents the predictive power of the model as a value between -inf and 1.00. 1.00 means there is a perfect fit, and the fit can be arbitrarily poor so the scores can be negative. A score of 0.00 means the model is guessing the expected value for the label. R2 measures how close the actual test data values are to the predicted values. The closer to 1.00, the better quality. However, sometimes low R-squared values (such as 0.50) can be entirely normal or good enough for your scenario and high R-squared values are not always good and be suspicious.
- Absolute loss: measures how close the predictions are to the actual outcomes. It is the average of all the model errors, where model error is the absolute distance between the predicted label value and the correct label value. This prediction error is calculated for each record of the test data set. Finally, the mean value is calculated for all recorded absolute errors. The closer to 0.00, the better quality. Note that the mean absolute error uses the same scale as the data being measured (is not normalized to specific range). Absolute-loss, Squared-loss, and RMS-loss can only be used to make comparisons between models for the same dataset or dataset with a similar label value distribution.
- Squared Loss: tells you how close a regression line is to a set of test data values. It does this by taking the distances from the points to the regression line (these distances are the errors E) and squaring them. The squaring gives more weight to larger differences. It is always non-negative, and values closer to 0.00 are better. Depending on your data, it may be impossible to get a very small value for the mean squared error.
- RMS loss: measures the difference between values predicted by a model and the values actually observed from the environment that is being modeled. RMS-loss is the square root of Squared-loss and has the same units as the label, similar to the absolute-loss though giving more weight to larger differences. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results. It is always non-negative, and values closer to 0.00 are better. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent.
So, if everything went well, you’ll be able to predict taxi fare prices next time you’ll be in New York :)
In a second article, I refined Taxi Fares Prediction case.
CONCLUSION
VisualML Studio is still a work in progress with those features in the roadmap:
- UX and UI improvements
- Additional ML.NET object model implementations
- TensorFlow.NET object model implementation
- Simple models and templates exchange
- ML visualizations
We live in interesting times. There are great ML libraries available and making ML models has never been easier.
Any feedback or ideas about improving VisualML Studio will be highly appreciated.
Stay tuned for updates and I wish you all accurate predictions, objects recognitions and all sweetness of ML :)