Jupyter notebooks have become the standard tool for hosting advanced machine learning code online.
A Jupyter notebooks is a fully interactive document that allows mixing of content, live source code, and program output, all in a single document. Notebooks are living dynamic documents and their embedded source code can be interactively run and modified.
This makes notebooks ideal for AI research or for teaching complex machine learning skills to IT students. But unfortunately they only support Python and R code. There isn’t any support for C#.
Until today that is!
Thanks to Microsoft’s tireless efforts we now have a fully functional C# kernel that allows us to run C# machine learning code directly in a Jupyter notebook.
And getting everything set up is a piece of cake. Here’s a quick guide on how you can run your own notebook server locally with full C# support.
Let’s start by installing Python. Yes, we do need python because the Jupyter server itself is a python application.
A great Python distribution for AI that comes bundled with Jupyter is called Anaconda. Here’s how to install it on Windows:
$ choco install anaconda3
I’m using Chocolatey, a nice package management system for Windows that runs on Powershell and makes installing new software a breeze.
Next you’ll need the NET Core SDK version 3. You can download it here.
You’ll also need
dotnet try, an interactive version of the NET runtime environment that can run code interactively and which forms the core of the Jupyter C# kernel.
Type the following on the command line:
dotnet tool install -g dotnet-try
This will install
dotnet try as a global tool.
Almost done! The final step is to check your start menu for the new Anaconda3 folder. Open it and run the Anaconda Powershell Prompt app. Then run the following command to add C# support to Jupyter:
dotnet try jupyter install
Now open the Anaconda3 menu from the start menu and click the Jupyter Notebook app.
You should see a Jupyter session appearing:
Click the New button in the top right, and notice that the first two items in the dropdown menu are NET (C#) and NET (F#).
Congratulations! Everything is working perfectly.
Let’s hack together a quick machine learning app to make sure everything is working. I’m going to build an app that loads and processes the famous California Housing dataset containing prices of houses in California.
I’ll start by creating a new notebook using the C# kernel (click New, then select NET C# from the dropdown)
Next I’ll download the 1990 California census data and save the file as california_housing.csv in the same folder as my notebook.
The file is a CSV file with 17,000 records that looks like this:
The file contains information on 17k housing blocks all over the state of California:
- Column 1: The longitude of the housing block
- Column 2: The latitude of the housing block
- Column 3: The median age of all the houses in the block
- Column 4: The total number of rooms in all houses in the block
- Column 5: The total number of bedrooms in all houses in the block
- Column 6: The total number of people living in all houses in the block
- Column 7: The total number of households in all houses in the block
- Column 8: The median income of all people living in all houses in the block
- Column 9: The median house value for all houses in the block
I could use this data to build a machine learning app that predicts the price of any house in and outside the state of California.
Let’s get started and install the NuGet package I need. Add a new code block, type the following line, and run the block:
Note how Jupyter immediately installs the package when you run this line.
Microsoft.ML is the new Microsoft machine learning library. I’ll use it to load and process my data. And note how the #r “nuget:XXX” command installs a NuGet package.
Now I’m ready to add code. Let’s start with a bunch of using statements. Create a new code block, paste the following into it, and run the block:
Note the XPlot.Plotly. This is the awesome XPlot plotting library that Jupyter loads by default. We’ll use it in this assignment to plot the data in our California Housing dataset.
Now I’m ready to add classes. I am going to need one class to hold all the information for a single housing block.
You know the drill by now. Create a new code block, paste the following into it, and run the block:
The HouseBlockData class holds all the data for one single housing block. Note how each field is tagged with a LoadColumn attribute that will tell the CSV data loading code which column to import data from.
Now I need to load the data in memory:
This code calls the LoadFromTextFile method to load the CSV data in memory. Note the HouseBlockData type argument that tells the method which class to use to load the data.
So we have the data in memory as a data view. Now let’s convert that to an enumeration of HouseBlockData instances:
This code calls CreateEnumerable to convert the data view to an enumeration of HouseDataBlock instances.
Your notebook should now look like this:
Now I’m going to plot the median house value by latitude and longitude. Let’s see what happens:
Check out the display() call at the end. This is a highly versatile method that can display many different types of data out of the box, including XPlot graphics.
When you run this block, you’ll notice that the Jupyter server immediately calculates the plot and renders the output in your notebook:
How cool is that? Interactive code and graphics output in the same interactive online document. That’s the awesome power of a Jupyter notebook!
So that definitely looks like California. Note the two high-value areas around San Francisco and Los Angeles, and how the house value gradually drops as we move further eastward.
I’m now going to search for a linear relationship that can predict the median house value. I’ll start by creating a plot of the median house value as a function of median income and see what happens.
This makes sense. People with a higher median income will probably tend to buy more expensive houses, so I expect these two columns to be related. But is the relationship linear or more complex?
Let’s find out!
Here’s the code to plot the relationship:
And this happens when you paste the code and run the block:
There’s a vaguely linear relationship here. The median house value increases when the median income increases. There’s a big spread in the house values but a vague ‘cigar’ shape is visible which suggests a linear relationship between these two variables.
But look at the horizontal line at 500,000. What’s that all about?
This is called clipping. The creator of this dataset has clipped all housing blocks with a median house value above $500,000 to $500,000. This shows up in the graph as a horizontal line that disrupts the linear cigar shape.
The clipped values pollute my dataset so I’m going to use data scrubbing to get rid of these clipped records:
The FilterRowsByColumn method will keep only those records with a median house value of 500,000 or less, and remove all other records from the dataset.
Let’s check if that worked:
The notebook now looks like this:
Much better! Notice how the horizontal line at $500k is gone now?
Now let’s take a closer look at the CSV data. I’m going to look at the first 10 records in the dataset:
The notebook now looks like this:
All the columns are numbers in the hundreds or thousands range, but the median house value column is an outlier because it contains values that go all the way up to 500,000.
I’m going to fix this by using data scaling. I will divide the median house value by 1,000 to bring it down to the thousands range, more in line with the other data columns.
I will add the following class:
and a bit more code:
Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components.
This pipeline has only one component:
- CustomMapping which takes the median house values, divides them by 1,000 and stores them in a new column called NormalizedMedianHouseValue.
Let’s see if the conversion worked. But first I’m going to need a quick helper method to print the results of the machine learning pipeline:
This code sets up an output formatter for Jupyter that can display DataDebuggerPreview values which I get from running the machine learning pipeline.
Let’s run the pipeline now, grab the first 10 results and display them:
The Fit method sets up the pipeline, creates a machine learning model and stores it in the model variable. The Transform method then runs all data through the pipeline and stores the result in transformedData. And finally the Preview method extracts a 10-row preview from the transformed data.
Here’s what that looks like in Jupyter:
Notice the NormalizedMedianHouseValue column at the end? It contains house values divided by 1,000. The pipeline is working!
Now let’s fix the latitude and longitude. I am reading them in directly, but geographical data should preferrably be binned, one-hot encoded, and crossed before passed on to a machine learning algorithm.
I’ll do that now. I will start by adding the following classes:
I am going to use these classes in the upcoming code snippets.
Now I will extend the pipeline with extra steps to process the latitude and longitude:
Note how I’m extending the data loading pipeline with extra components. The new components are:
- A NormalizeBinning component that bins the longitude values into 10 bins
- A NormalizeBinning component that bins the latitude values into 10 bins
Let’s see if that worked:
And here’s the output:
Check out the BinnedLongitude and BinnedLatitude columns at the end. Each unique longitude and latitude value has been grouped into a set of 10 bins, and the bin numbers have been normalized on a scale from 0..1.
Let’s plot the bins to get a feel for what just happened:
I’ve added a quick helper class called BinnedHouseBlockData to access the two new binned columns, and the plotting code is exactly the same as before.
Here’s the output:
The plot again shows the median house value by latitude and longitude, but now all locations have been binned into a 10x10 grid of tiles. This helps a machine learning algorithm pick up course-grained location patterns without getting bogged down in details.
Now let’s one-hot encode the binned latitude and longitude:
Note how I’m extending the data loading pipeline again. The new components are:
- An OneHotEncoding component that one-hot encodes the longitude bins
- An OneHotEncoding component that one-hot encodes the latitude bins
- A CustomMapping component that crosses the one-hot encoded vectors of the longitude and latitude. ML.NET has no built-in support for crossing one-hot encoded vectors, so we do it manually with a nested for loop and store the result in a new column called Location.
- A final DropColumns component to delete all columns from the data view that we don’t need anymore.
Let’s see if this worked:
And here’s the output:
Note how we now have an extra column called Location with a 100-element buffer of Single values. This is the result of the feature cross of longiture and latitude. Each vector will contain almost all zeroes with only a single 1.
Let’s display the crossed vector to make sure everything is working:
And here’s the output:
That looks perfect. There’s only a single 1 in every row, just as expected for a one-hot encoded feature.
So that was a quick walkthrough of the power of C# and ML.NET in a Jupyter notebook.
What do you think?
Are you ready to start building your own Jupyter notebooks with C# machine code?