Accelerated Data Science

Data Science and Machine Learning Workloads on an M1 Mac

Peter Killick
Met Office Informatics Lab
12 min readJun 21, 2022

--

Apple’s latest M1 chip versions (excluding the M1 Ultra). Credit: apple.com

A New Order

November 2020 saw the announcement of a significant change of direction for Apple: for the first time in nearly 15 years, the Apple Mac computer would no longer use Intel processors. Intel processors would be replaced by the M1 System on Chip (SoC), developed in-house by Apple and an enhancement on the A-series SoCs that have powered iPhones and iPads for a number of years.

Why does this matter?

Intel’s x86 microarchitecture is very prevalent across desktop computers, laptops, servers, the Cloud, and HPC (although IBM’s Power microarchitecture is a notable exception in the HPC sphere). ARM’s microarchitecture, meanwhile, is completely ubiquitous in mobile devices. If you own a modern smartphone, it’s almost certain to be running an ARM-architected chip, just manufactured under license by Qualcomm, Samsung, Huawei, MediaTek, Apple (of course), or another chip manufacturer.

The move by Apple to using the M1 chip for desktop computer applications is much more significant than turning MacBooks into glorified mobile phones, however. The change in microarchitecture from x86 to ARM introduces new uncertainty to the usability of desktop applications. Putting aside the question of the different application installers used by Windows, macOS and Linux, the likelihood was that if you compiled a binary (that is, a program compiled to an executable object) for x86, it would be likely to function on all of these desktop operating systems without trouble.

The M1 SoC’s ARM microarchitecture upends this comfortable certainty. A different microarchitecture means a different instruction set, which fundamentally means that a binary compiled for x86 will not run on ARM. It’s similar to going into a café in a country where you don’t speak the local language — it’s going to be hard to ask to buy a local delicacy if you and the barista don’t understand each others’ spoken language.

A person makes a purchase in a café.
Better hope they’re both speaking the same language. Photo by Toa Heftiba on Unsplash

The Pertinent Questions

To get around this, you now require all the programs that you want to use on an M1 to be compiled for ARM microarchitecture as well as x86. The question is, have they been?

This question recently came to roost in the Informatics Lab as we received our first Macs that use the M1 SoC. The Lab (and indeed the wider Met Office) run a lot of Python-based Data Science and Machine Learning analysis workflows on large n-dimensional gridded weather and climate datasets.

The pertinent question for the Lab, therefore, is can we run these workflows on an M1 Mac? And if we can, how does the experience compare to running the same workflows on an Intel Mac? We’ll spend the rest of this blog post exploring the answers to these questions.

Data Science

I’ll handle Data Science as distinct from Machine Learning here and define it as all classical analysis workflows that don’t involve an element of machine-driven learning (unsurprisingly we’ll look at that in the next section on Machine Learning).

The Lab’s data science workflows typically use common scientific Python packages to load, analyse, visualise and save nD datasets of numerical data describing weather and climate phenomena. These are often provided as CSV or NetCDF files. To do this we use Python libraries such as Jupyter, NumPy, matplotlib, dask, pandas, Iris and cartopy, and Xarray, and their surrounding ecosystems of dependency packages. The simple question I wanted to ask of the M1 Mac in this case is “can I install and use these Python libraries on an M1 Mac?”

To test this I installed Python and conda via miniforge and went to work setting up environments that contained these different packages. Quite simply, I found that all of these packages installed and functioned without issue on an M1 Mac, so this was an easy and very gratifying conclusion to reach.

There is no doubt that this state of affairs is on account of the amazing efforts put in by the maintainers — both of the libraries themselves and the package builds of the libraries at conda-forge. Their efforts have ensured that these libraries are all available with either microarchitecture-independent (noarch) package builds, or with builds for multiple different microarchitectures, including x86 and ARM. So, when you install these libraries using conda, it will automatically select the package build that’s best suited to your system, ensuring consistent functionality across all sorts of different systems.

Machine Learning

For Machine Learning (ML) workflows I wanted to take things beyond the first question of “can I install the libraries?” and explore whether we can also find a computational advantage for using the M1 Intel. There is good reason for hoping that we might: the M1 Pro SoC in our MacBooks Pros contains an 8 core CPU as well as a 14 core GPU and a 16 core Neural Engine. These last two items are both accelerators that can be used to speed up the training of ML models — a GPU handles matrix operations at high speed typically to drive graphical applications, but can just as well be applied to matrix operations in ML applications; and the Apple Neural Engine is a type of Neural Processing Unit (NPU) that’s specifically designed for the sort of operations found in Neural Networks.

However, just because an accelerator is present doesn’t necessarily mean that you can just use it to accelerate training your ML model. You need to have a mechanism to transfer the training to the accelerator otherwise it will just run on the CPU and you won’t benefit from the presence of the accelerator. This fact turned out to particularly apply to the Apple Neural Engine, which seems to only be accessible when using Core ML, Apple’s Swift library for running ML models. But as this is written in Swift it’s not a lot of direct use for Python ML libraries, so I didn’t explore this further.

The GPU on the M1 SoC, however, is accessible via Metal, Apple’s graphical accelerator API. Apple have also written a TensorFlow plugin for Metal that enables accelerated training of ML models written in TensorFlow using Mac GPUs. It made sense, therefore, to test accelerating ML workflows using ML models built with TensorFlow (PyTorch and XGBoost, for example, only support CUDA-based acceleration on NVIDIA accelerators).

The Experiment — TensorFlow

Accelerators work best when data volumes are large and the computations being run are complex. This is because there is a computational overhead associated with moving the model and in-memory data from the CPU to the accelerator. If the model and/or data are small, this overhead may be significantly larger than any speed-up gained from running on the accelerator, and the net result is that performance actually decreases for using the accelerator.

So, to best test for performance improvements for using the GPU in the M1 SoC, we want a complex ML model to be run on large data volumes. Or perhaps better still, we want a series of models (ranging from simple to complex) but all running on the same data, from which we can look for noticeable performance improvements when accelerating the more complex models compared to the simpler models.

Pleasingly, just such an example can be found in the Overfitting and Underfitting tutorial in the TensorFlow docs. This demonstrates an important principle in ML model architecture (a complex model doesn’t necessarily produce better results) by running a series of models of increasing complexity against the same large-ish dataset. Or, to put it more simply, this is exactly what we were after. More details about the architectures of the different models and the dataset used can be found at the link above.

The setup of the experiment was as follows:

  • Use six of the models defined in the tutorial linked above (the models named ‘tiny’, ‘medium’, ‘large’, ‘L2’, ‘dropout’ and ‘combined’). Record the wall time taken to train the model and the total number of training epochs required, and find the result value of number of seconds per epoch for the training run.
  • Each model was run three times on each hardware platform then the mean result value across the three runs was calculated.
  • As Metal allows us to accelerate ML models using any recent Mac with a GPU, we can compare the M1 SoC to the previous generation of Intel+AMD Macs.

Therefore, the following hardware platforms were used:

  • A 14” MacBook Pro 2021 with a M1 Pro SoC with an 8 core CPU, a 14 core GPU and 16GB integrated memory, with and without Metal acceleration.
  • A 16” MacBook Pro 2019 with an 8 core Intel i9 9880H, an AMD Radeon Pro 5500M 4GB and 32GB system memory, with and without Metal acceleration.

Thus we had two devices each running with and without Metal-enabled acceleration, providing four platforms in total, on which to run six experiments three times each. To accurately compare different models that ran to a different number of epochs, the data values we chose to record was the time in seconds taken to calculate one training epoch, calculated by dividing the total number of wall-time seconds that the model trained for by the number of epochs taken to train.

Results — TensorFlow

The graph below shows the average runtime (s) per epoch for each of the six models on each of the four hardware platforms.

Line graph showing mean time per epoch for various ML models on different processor microarchitectures.

The graph demonstrates a number of interesting results:

  • The M1 SoC outperforms the Intel processor both with and without Metal acceleration enabled.
  • Metal acceleration makes small and medium sized ML models a lot slower to train than without Metal acceleration.
  • However, as the models get larger, the difference between Metal acceleration being enabled and not being enabled shrinks for the M1 Mac, and in fact the Intel Mac trains faster for the two largest models with Metal acceleration enabled.
  • The performance difference between M1 and Intel gets more pronounced as the ML models get more complex.

One other result that is not demonstrated in the graph above is the thermal impact of running these experiments, particularly with the most complex models and with Metal acceleration enabled. The M1 Mac became warm to the touch, but at no point did its fans switch on. The Intel Mac, meanwhile, became too hot to touch and its fans were running at maximum — to the extent that some runs showed obvious evidence of thermal throttling.

The Experiment — PyTorch

PyTorch has just (at time of writing) released support for Metal devices in v1.12. Unlike TensorFlow’s Metal plugin which is provided by Apple, the PyTorch plugin has been provided by the PyTorch community. This means we can run a similar experiment in PyTorch to the experiment we ran in TensorFlow.

Again we’ll use an example from the documentation to provide a reasonably-sized ML model with which we can test the performance of Metal acceleration. For PyTorch we’ll use this transfer learning tutorial and use both the fine-tuning and feature extraction applications. We’ll use the same hardware platforms for PyTorch as were used in the TensorFlow experiment.

Results — PyTorch

The bar chart below shows the results of running the two applications from the transfer learning tutorial on the Intel and M1 Mac, with and without Metal acceleration enabled. Here we simply recorded the total wall time for each application on each platform and treated these as our results.

Bar chart showing run time (s) for two different ML applications on different processor microarchitectures.

Again, we see broadly the same pattern as with the TensorFlow experiment — typically the M1 is faster than the Intel Mac, and there is typically limited benefit for using the GPU (if any speed-up at all). The notable exception is with the fine-tuning experiment on the M1 Mac, where a significant speed-up was recorded with the GPU enabled.

Note that the results for this experiment were recorded a little differently than to the TensorFlow experiment, with total time elapsed for each application recorded rather than time per epoch. Each experiment was also run only once, meaning spurious results will not be averaged out over multiple runs.

Timing each run allowed us to capture not only wall time (the time between start and end of execution) but also the user process time (the amount of time the CPU spent executing the user process). CPU time for the two applications, with and without GPU acceleration, are shown below:

It’s significant that CPU time decreases significantly when the GPU is enabled. This is a direct demonstration that the processing is being offloaded from the CPU to the GPU, so the time the CPU spends running the processing is significantly reduced, even if the overall wall time of the processing (from the previous chart) does not change significantly.

Conclusions

The results from both these experiments allow us to draw these interesting conclusions:

  • There is an evident improvement in training more complex ML models on an Intel Mac with Metal acceleration enabled.
  • There is the suggestion of a similar performance improvement on the M1 Mac if the ML model was even more complex.
  • The difference in thermal performance between M1 and Intel is quite striking: the M1 is not only faster but also far more thermally efficient than Intel. Indeed this conclusion reflects one of the main reasons that Apple pursued an ARM-based SoC for Mac computers.

In Summary

This brief exploration of using an M1 SoC for Data Science and Machine Learning workflows has found a number of conclusions that are significant to the applicability of the M1 to these fields:

  • We encountered no difficulty with installing and using common Scientific Python libraries for Data Science workflows on an M1 Mac.
  • The M1 SoC is faster than Intel in the ML model training experiment that we performed. And the M1 seemed to be faster for the Data Science workflows that I used to test the functionality of scientific Python libraries on the M1 Mac.
  • We found that Apple’s Metal API can be effectively used to accelerate training ML models both on M1 and Intel Macs, with a performance gain evident with more complex models on the Intel Mac.
  • The M1 SoC is much more thermally efficient than Intel.

Next Steps

Here are some ideas of where to go to extend this investigation further:

  • Data Science workflows felt like they were faster on the M1. We could time these workflows and see if we can prove (or disprove!) this with some numerical results.
  • The ML experiment showed that the M1 was always faster without Metal acceleration. It would be interesting to explore how big an ML model we would need for the M1 with Metal acceleration to train faster than without.
  • I still want to try and make use of the Neural Engine to accelerate ML workflows. It would be a lot of effort, but it would be interesting to see if we could reproduce the ML model in Swift for Core ML, or even call Core ML from Python somehow and see if we could make use of the Neural Engine.
  • Simply out of curiosity it would be interesting to run the ML experiments again but on Colab Pro with a TPU connected, and see just how fast we could go.

Appendix

In this article we’ve explored how you can speed up training of ML models using the GPU in recent Mac devices via Apple’s Metal API. We’ve shown that this is possible with both TensorFlow and PyTorch. What we haven’t explored is how to actually enable GPU acceleration in either of these libraries! We’ll correct that here; detailing how to install each library with Metal integration enabled, and also how to utilise Metal acceleration in both TensorFlow and PyTorch code.

TensorFlow

There is a specific build of TensorFlow with Metal acceleration available from Apple, along with install instructions, which must be followed exactly. This is particularly true on the Intel Mac, where in testing I found that the build of TensorFlow with Metal acceleration was only found by pip when installed in a Python virtualenv — and not in a conda env. The build is only available for Python 3.8.

To check Metal acceleration is available to TensorFlow:

import tensorflow as tf
tf.config.list_physical_devices()

If you get the following result (or similar, but with a GPU in the list of PhysicalDevices), Metal support has been properly configured for TensorFlow:

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Note that you do not need to explicitly target the GPU for your workloads— TensorFlow will use it automatically if it is present.

PyTorch

It’s a lot easier to install PyTorch with Metal acceleration enabled — you just need to install v1.12 or later. At time of writing this requires installing PyTorch from the preview channel, but once PyTorch v1.12 moves out of pre-release this will get simpler again. Details are available in the PyTorch install instructions.

Conversely, it’s a little more difficult to check Metal acceleration is available in PyTorch applications and then also use it to accelerate training. You can use the following to check if Metal acceleration is available:

import torchmps_built = torch.backends.mps.is_built()
mps_avail = torch.backends.mps.is_available()
device = "mps" if mps_built & mps_avail else "cpu"

Note that PyTorch refers to Metal acceleration as MPS (Metal Performance Shaders). To utilise the MPS device to train a model you can then do something similar to the following, having predefined model:

model = model.to(device)

--

--

Peter Killick
Met Office Informatics Lab

Cloud Platform Architect, open-source software engineer and technology researcher in the UK Met Office Informatics Lab. I tend to blog on these themes.