GPU Training on Apple M1

Razin Tailor
3 min readJul 25, 2021

Since it’s release in November 2020, the first Macs with an Arm-based M1 chip, have been a topic of discussion in the developer community. The new M1 chip on the MacBook Pro consists of 8 core CPU, 8 core GPU, and 16 core neural engine, in addition to other things. Both the processor and the GPU are far superior to the previous-generation Intel configurations.

So far, it has proven itself to be superior to anything Intel has offered. However, the Deep Learning gang was struggling with native arm support, especially since most libraries/frameworks support cuda and x86 architecture.

A few days ago, I saw that https://github.com/apple/tensorflow_macos got Archived, and the README file now states that TensorFlow v2.5 natively supports M1.

To know more about it and installation guide Click Here.

To test the performance, I wrote a simple Feed Forward Network in tensorflow to train MNIST. The code is structured as follows

  1. Library Imports
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.datasets import mnist
from tensorflow.keras import backend as K
import matplotlib.pyplot as plt
import numpy as np
import argparse
import time

2. Getting the MNIST Dataset. Note, the script will download the dataset when you run it for the first time. The MNIST data is a 28x28x1 image. For Feed Forward Architecture, I need to “flatten” the image so that it becomes a 784 pixels array (28*28). Also, I am normalizing the data to be in the range [0,1]. Finally, I convert the labels into vectors using LabelBinarizer.

train, test = mnist.load_data()
Xtrain, Ytrain = train
Xtest, Ytest = test
Xtrain = Xtrain.reshape((Xtrain.shape[0], 28 * 28 * 1))
Xtest = Xtest.reshape((Xtest.shape[0], 28 * 28 * 1))

Xtrain = Xtrain.astype("float32") / 255.0
Xtest = Xtest.astype("float32") / 255.0
lb = LabelBinarizer()
Ytrain = lb.fit_transform(Ytrain)
Ytest = lb.transform(Ytest)

3. Model Architecture

model = Sequential()
model.add(Dense(256, input_shape=(784,), activation="sigmoid"))
model.add(Dense(128, activation="sigmoid"))
model.add(Dense(10, activation="softmax")

4. Using Stochastic Gradient Descent optimizer, learning rate of 0.01, batch size of 128 and Categorical Cross Entropy Loss Function, I train the model for 100 epochs. I have added the basic time.time() to know the time taken in training the model. Finally, I calculate the total time taken and average time taken per epoch.

model.compile(
loss="categorical_crossentropy",
optimizer=SGD(0.01),
metrics=["accuracy"]
)
print("Begin Training")
start_time = time.time()
H = model.fit(
Xtrain, trainY, validation_data=(Xtest, Ytest), epochs=100, batch_size=128
)
end_time = time.time()
print("Total Time Taken: {}".format(end_time - start_time))
print("Average Time Per Epoch: {}".format((end_time - start_time) / 100))

5. Then, I predict on the test dataset and generate the classification report

predictions = model.predict(Xtest, batch_size=128)
print(
classification_report(
Ytest.argmax(axis=1),
predictions.argmax(axis=1),
target_names=[str(x) for x in lb.classes_],
)
)

6. I also plot and save the loss and accuracy graph

plt.figure(figsize=(16,9))
plt.plot(np.arange(0, 100), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, 100), H.history["val_loss"], label="val_loss")
plt.plot(np.arange(0, 100), H.history["accuracy"], label="train_acc")
plt.plot(np.arange(0, 100), H.history["val_accuracy"], label="val_acc")
plt.title("Training Loss and Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend()
plt.savefig("output.png")

The entire script is available on GitHub.

I ran this experiment on my Apple M1 Macbook Pro as well as on Google Colab. The Colab Environment that I tested on was having NVIDIA Tesla T4. Below is the performance comparison graph.

As you can see M1 takes ~3 seconds per epoch while the Tesla T4 on Google Colab takes ~2 seconds. I still see the results as pretty much comparable. Atleast now I can leverage the GPUs to train models using tensorflow-metal natively on Apple M1.

Let me know what you think.

--

--