You’ve probably missed one of the greatest ML packages out there

Published in

Practical coding

11 min readDec 28, 2020

And no, it’s not in Python. Or C++.

Artistic visualization of neural network outputs with random weights (2D inputs, scalar output). Credit: Author. To reproduce this visualization, see Wolfram docs.

It might not even be in a language you already know, based on how relatively unpopular it is. But if you’ve studied math or physics at university you’ve probably at least heard of it.

I’m talking about Mathematica. And since version 11, it’s been possible to train neural networks directly in Mathematica.

Seriously? Mathematica?

Yes. Seriously. Keep reading.

If you’ve ever been interested in symbolic math, let’s get it out there: there’s really no parallel. Matlab just has basic symbolic capabilities, and is not really at all comparable (although both Mathematica and Matlab share the maddening indexing system that starts at one instead of zero…). Similarly, Python’s SymPy is a nice project, but it doesn’t at all parallel Mathematica’s capabilities. Mathematica can take integrals and derivatives, simplify complex expressions, and solve systems of equations analytically. Plus, Mathematica basically invented the idea of computational notebooks, rather than in Python, where Jupyter notebooks were really just an afterthought. The plots too look a lot nicer than those default matplotlib ones!

But what about symbolic math for ML? The answer should be obvious: automatic differentiation is much easier when you can do symbolic math.

If you’re coming from e.g. TensorFlow, you’ll be stunned how different the interface is for building neural networks. Mathematica focuses on building general graphs —but I think you’ll find it incredibly elegant!

In the rest of the article, I’m gonna show you the pieces to get started with ML in Mathematica.

Get Mathematica

Before you ask if it’s free — it is free-ish. Officially it’s proprietary and, of course, expensive — at the time of writing, a home desktop license runs $354. But there’s two large caveats here:

If you’re a student or affiliated with a university, your institution probably provides it for free — this is probably the most common exposure to this great tool.
If you’re not a university — Raspberry Pi OS (formerly Raspbian) ships with Mathematica for free. That means you can pick up a little $5 Raspberry Pi Zero (maybe even the ZeroW with built in WiFi!) and get a free working copy of Mathematica. From $354 down to $5 — how absurdly cool is that!

Three ways to make a neural network

Load a pre-made one = NetModel .
Make a simple layer-to-layer one, like the ones you might see in an introductory ML textbook = NetChain .
Make a general graph = NetGraph .

Let’s do each one in turn.

Load a pre-made network

The main command here is NetModel :

net = NetModel["LeNet Trained on MNIST Data"]

Note: You always can (and should) click the little plus button to get a more detailed breakdown of the network.

You can just fire up NetModel[] to see a list of all possible pre-loaded networks — I’ll put it at the bottom of this article.

Simple neural networks

A “simple” neural network in this case means a graph of layers where the i-th layer is connected to the i+1-st layer. This is done with the NetChain command:

net = NetChain[{
   LinearLayer[10],
   Tanh,
   LinearLayer[20],
   LogisticSigmoid
   },
  "Input" -> 10,
  "Output" -> 20
  ]

This introduced a couple new things:

Layers: a NetChain is composed of layers that are connected together. A LinearLayer is just one like W * input + b . There’s lots more, like ElementwiseLayer , SoftmaxLayer, FunctionLayer, CompiledLayer, PlaceholderLayer, ThreadingLayer, RandomArrayLayer. Probably the most useful one is FunctionLayer , where you can apply a custom function (I’ll show you this layer).
Activation functions: you can stick in pretty much any function you like! Tanh , Ramp (ReLU), ParametricRampLayer (leaky ReLU), and LogisticSigmoid are obvious choices, but you can use anything here. You can even repeat activation functions (for whatever reason) — no constraints here.

You can see the network is “uninitialized”. We can initialize the net with random weights and zero biases:

net = NetInitialize[net]

Let’s get those weights out. You can use Information[net, “Properties”] to query all the possible parameters. The layers can be extracted with:

layers = Information[net, "LayersList"]
linearLayers = Select[layers, ToString[Head[#]] == "LinearLayer" &]

The weights and biases can be obtained with:

weights1 = NetExtract[linearLayers[[1]], "Weights"]
Normal[weights1] // MatrixForm
biases1 = NetExtract[linearLayers[[1]], "Biases"]
Normal[biases1] // MatrixForm

Note that Normal converts NumericalArray to the standard Mathematica list object that we can manipulate easily.

Make any arbitrary graph

Mathematica is great at graphs. The main command to make neural networks is NetGraph . This is where it gets really cool!

Let’s start with a network that takes a target vector of size 2 and another vector of size 2, and computes the L2 loss:

layerDefns = <|
   
   "l2" -> FunctionLayer[(#MyTarget - #MyLossInput)^2 &, 
     "MyTarget" -> 2, "MyLossInput" -> 2, "Output" -> 2],
   "sum" -> SummationLayer[]
   |>;layerConn = {{ NetPort["MyTarget"], NetPort["MyLossInput"]} -> 
    "l2" -> "sum" -> NetPort["MyLoss"]};l2lossNet = NetGraph[layerDefns, layerConn]

We see the format for NetGraph :

An association (the Mathematica term for a dictionary of key/values) of name to layer.
A list of layer names connected with arrows -> .

Also introduced is the notion of a NetPort . A NetPort is a named input/output of a network. You can clearly see the input ports on the left side of the layerConn definition, and the output ports on the right side:

layerConn = {{ NetPort["MyTarget"], NetPort["MyLossInput"]} -> 
   "l2" -> "sum" -> NetPort["MyLoss"]};

The inputs are "MyTarget" and “MyLossInput” , and the output is “MyLoss" . You can clearly see these in the graph as well:

Look at this carefully — it shows you the dimensions of each connection in the graph, which is extremely helpful. The inputs are vectors of length 2, while the output is a scalar.

You can now reuse this graph in yet larger graphs, for example with another subnetwork:

subNet = NetChain[{LinearLayer[2], Tanh, LinearLayer[10], Tanh, 
   LinearLayer[1]}]

and the combination of the sub network and the loss network:

layerDefns = <|
   "sn1" -> subNet,
   "sn2" -> subNet,
   "l2loss" -> l2lossNet,
   "catentate" -> CatenateLayer[]
   |>;layerConn = {
   NetPort["MyInput"] -> "sn1" -> "catentate",
   NetPort["MyInput"] -> "sn2" -> "catentate",
   {NetPort["MyTarget"], "catentate"} -> 
    "l2loss" -> NetPort["MyLoss"]
   };net = NetGraph[layerDefns, layerConn]

Let’s look at that more closely:

We have an input vector that get’s passed to in two separate channels of the subnetworks for a size of 2, then through different activation functions. These are then catenated (stacked), and passed into the loss function along with a target to finally obtain the loss.

Train networks

The command to train neural networks is NetTrain . Let’s train the net from the previous example. First, we need some data:

data = Flatten@Table[
    <|
     "MyInput" -> {x, y},
     "MyTarget" -> {Cos[x + y], Sin[x + y]}
     |>
    , {x, -1, 1, .02}
    , {y, -1, 1, .02}];
data[[1]]

We can see each data sample is of the format of an association (dictionary) with every NetPort labeled.

Next the training:

trainedNet = NetTrain[net, data, LossFunction -> "MyLoss"]

This is great — we get a simple overview and can visualize the training live. You can set different options — for example, the maximum number of iterations with:

NetTrain[..., MaxTrainingRounds->50000]

Finally, let’s evaluate the trained network. The network trainedNet outputs the full value of the loss function. Since we are interested more in the actual output, we can use NetTake to grab all layers up to and including the CatenateLayer :

trainedSubNet = NetTake[trainedNet, "catentate"]
trainedSubNet[data[[1]]]
data[[1]]["MyTarget"]

Let’s plot that and do some regression!

pts = Flatten[Table[
    {x, y, trainedSubNet[{x, y}][[2]]}
    , {x, -1, 2, .05}
    , {y, -1, 2, .05}], 1];
ptsTrain = Select[pts, #[[1]] <= 1 && #[[2]] <= 1 &];
ptsExtrapolate = Select[pts, #[[1]] > 1 || #[[2]] > 1 &];
pltPtsTrain = ListPointPlot3D[ptsTrain, PlotStyle -> Black];
pltPtsExtrapolate = ListPointPlot3D[ptsExtrapolate, PlotStyle -> Blue];
pltTrue = 
  Plot3D[Sin[x + y], {x, -1, 2}, {y, -1, 2}, 
   PlotStyle -> Opacity[0.3], Mesh -> None, PlotLabel -> "Sin[x+y]", 
   BaseStyle -> Directive[FontSize -> 24]];
Row[{
  pltTrue,
  Show[pltPtsTrain, pltPtsExtrapolate, PlotRange -> All],
  Show[pltPtsTrain, pltPtsExtrapolate, pltTrue, PlotRange -> All]
  }]

Final thoughts

You can read Mathematica’s more complete guide on neural networks here. However, it’s bit long…. If you prefer to just read through the API pages, here’s a better list from Wolfram.

I hope you got a sense of how powerful Mathematica’s neural networks are. The most powerful aspect is how you can build arbitrary graphs and loss functions — I don’t know any interface as capable as Mathematica’s for this, even if it is a bit different from TensorFlow. I really like the ability to combine networks easily with NetPorts and to visualize the dimensions of inputs and outputs easily. Additionally, the automatic monitoring training progress is incredibly useful.

Appendix: list of all possible pre-loaded networks at the time of writing

From the command NetModel[] :

2D Face Alignment Net Trained on 300W Large Pose Data3D Face Alignment Net Trained on 300W Large Pose DataAdaIN-Style Trained on MS-COCO and Painter by Numbers DataAdemxapp Model A1 Trained on ADE20K DataAdemxapp Model A1 Trained on Cityscapes DataAdemxapp Model A1 Trained on PASCAL VOC2012 and MS-COCO DataAdemxapp Model A Trained on ImageNet Competition DataAge Estimation VGG-16 Trained on IMDB-WIKI and Looking at People DataAge Estimation VGG-16 Trained on IMDB-WIKI DataBERT Trained on BookCorpus and Wikipedia DataBioBERT Trained on PubMed and PMC DataBPEmb Subword Embeddings Trained on Wikipedia DataCapsNet Trained on MNIST DataClinical Concept Embeddings Trained on Health Insurance Claims, Clinical Narratives from Stanford and PubMed Journal ArticlesColorful Image Colorization Trained on ImageNet Competition DataColorNet Image Colorization Trained on ImageNet Competition DataColorNet Image Colorization Trained on Places DataConceptNet Numberbatch Word Vectors V17.06ConceptNet Numberbatch Word Vectors V17.06 (Raw Model)CREPE Pitch Detection Net Trained on Monophonic Signal DataCycleGAN Apple-to-Orange Translation Trained on ImageNet Competition DataCycleGAN Horse-to-Zebra Translation Trained on ImageNet Competition DataCycleGAN Monet-to-Photo TranslationCycleGAN Orange-to-Apple Translation Trained on ImageNet Competition DataCycleGAN Photo-to-Cezanne TranslationCycleGAN Photo-to-Monet TranslationCycleGAN Photo-to-Van Gogh TranslationCycleGAN Summer-to-Winter TranslationCycleGAN Winter-to-Summer TranslationCycleGAN Zebra-to-Horse Translation Trained on ImageNet Competition DataDeep Speech 2 Trained on Baidu English DataDilated ResNet-105 Trained on Cityscapes DataDilated ResNet-22 Trained on Cityscapes DataDilated ResNet-38 Trained on Cityscapes DataDistilBERT Trained on BookCorpus and English Wikipedia DataEfficientNet Trained on ImageNetEfficientNet Trained on ImageNet with AutoAugmentEfficientNet Trained on ImageNet with NoisyStudentELMo Contextual Word Representations Trained on 1B Word BenchmarkEnhanced Super-Resolution GAN Trained on DIV2K, Flickr2K and OST DataForkNet Brain Segmentation Net Trained on NAMIC DataGender Prediction VGG-16 Trained on IMDB-WIKI DataGloVe 100-Dimensional Word Vectors Trained on TweetsGloVe 100-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 DataGloVe 200-Dimensional Word Vectors Trained on TweetsGloVe 25-Dimensional Word Vectors Trained on TweetsGloVe 300-Dimensional Word Vectors Trained on Common Crawl 42BGloVe 300-Dimensional Word Vectors Trained on Common Crawl 840BGloVe 300-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 DataGloVe 50-Dimensional Word Vectors Trained on TweetsGloVe 50-Dimensional Word Vectors Trained on Wikipedia and Gigaword 5 DataGPT2 Transformer Trained on WebText DataGPT Transformer Trained on BookCorpus DataInception V1 Trained on Extended Salient Object Subitizing DataInception V1 Trained on ImageNet Competition DataInception V1 Trained on Places365 DataInception V3 Trained on ImageNet Competition DataLeNetLeNet Trained on MNIST DataMicro Aerial Vehicle Trail Navigation Nets Trained on IDSIA Swiss Alps and PASCAL VOC DataMobileNet V2 Trained on ImageNet Competition DataMulti-scale Context Aggregation Net Trained on CamVid DataMulti-scale Context Aggregation Net Trained on Cityscapes DataMulti-scale Context Aggregation Net Trained on PASCAL VOC2012 DataOpenFace Face Recognition Net Trained on CASIA-WebFace and FaceScrub DataPix2pix Photo-to-Street-Map TranslationPix2pix Street-Map-to-Photo TranslationPose-Aware Face Recognition in the Wild Nets Trained on CASIA WebFace DataPre-trained Distilled BERT Trained on BookCorpus and English Wikipedia DataResNet-101 for 3D Morphable Model Regression Trained on Casia WebFace DataResNet-101 Trained on Augmented CASIA-WebFace DataResNet-101 Trained on ImageNet Competition DataResNet-101 Trained on YFCC100m Geotagged DataResNet-152 Trained on ImageNet Competition DataResNet-50 Trained on ImageNet Competition DataRetinaNet-101 Feature Pyramid Net Trained on MS-COCO DataRoBERTa Trained on BookCorpus, English Wikipedia, CC-News, OpenWebText and Stories DatasetsSciBERT Trained on Semantic Scholar DataSelf-Normalizing Net for Numeric DataSentiment Language Model Trained on Amazon Product Review DataSingle-Image Depth Perception Net Trained on Depth in the Wild DataSingle-Image Depth Perception Net Trained on NYU Depth V2 and Depth in the Wild DataSingle-Image Depth Perception Net Trained on NYU Depth V2 DataSketch-RNN Trained on QuickDraw DataSqueeze-and-Excitation Net Trained on ImageNet Competition DataSqueezeNet V1.1 Trained on ImageNet Competition DataSSD-MobileNet V2 Trained on MS-COCO DataSSD-VGG-300 Trained on PASCAL VOC DataSSD-VGG-512 Trained on MS-COCO DataSSD-VGG-512 Trained on PASCAL VOC2007, PASCAL VOC2012 and MS-COCO DataU-Net Trained on Glioblastoma-Astrocytoma U373 Cells on a Polyacrylamide Substrate DataUnguided Volumetric Regression Net for 3D Face ReconstructionVanilla CNN for Facial Landmark RegressionVery Deep Net for Super-ResolutionVGG-16 Trained on ImageNet Competition DataVGG-19 Trained on ImageNet Competition DataVGGish Feature Extractor Trained on YouTube DataWide ResNet-50-2 Trained on ImageNet Competition DataWolfram AudioIdentify V1 Trained on AudioSet DataWolfram C Character-Level Language Model V1Wolfram English Character-Level Language Model V1Wolfram ImageIdentify Net V1Wolfram JavaScript Character-Level Language Model V1Wolfram LaTeX Character-Level Language Model V1Wolfram Python Character-Level Language Model V1Yahoo Open NSFW Model V1YOLO V2 Trained on MS-COCO DataYOLO V3 Trained on Open Images Data