Classify stuff using TensorFlow 2

Feiyu CHEN
4 min readMay 2, 2020

--

TensorFlow is Google’s machine learning (ML) framework.

TensorFlow 2 Symbol, looks like 7F…

TensorFlow 1 is well known for its user unfriendliness (especially for ML beginners). But things have been different since TensorFlow 2 was released later last year. In TensorFlow 2, most of the APIs are simplified and in my personal opinion, it is THE simplest and the most powerful ML framework in the market now. Therefore it’s the high time for ML lovers to try this TensorFlow 2 and see what it can do.

In this post, I won’t include any details of the differences between TensorFlow 1 and 2, instead I’ll only focus on how to use TensorFlow 2. By that I mean I’ll walk you through on how to classify cancer or not cancer using a publicly available dataset with TensorFlow 2.

So here it goes.

First of all, the installation of TensorFlow 2 for Python.

The installation is pretty straight forward. Using python’s package installation tool (pip) gets this job done quickly. (This part used to be quite complicated in TensorFlow 2’s early days.)

After that, remember to confirm the installation by printing the version number. By the time of 2020.04.18, you should be expecting some thing like ‘2.2.0-rc2’ being printed.

Code snippet for installing tensorflow 2 for Python

Secondly, let me introduce you the dataset used in this post. The dataset is officially called UCI ML Breast Cancer Wisconsin (Diagnostic) dataset. (A lot of organizations make their data public so that anybody in the world can help analyze the data and learn something. This dataset is one of those.)

Now instead of downloading the data from the link above, we can just use sklearn to load the dataset. Introduction of sklearn is out of scope of this post. If you don’t know it already, ask Google. It will worth your well.

And the python code for loading the dataset is

Load breast cancer dataset from sklearn.

Thirdly, let’s do some pre-processing with the dataset. By this I mean we want to separate the dataset into train and test. Again, instead of doing this ourselves, we want to use sklearn. The code for this is below.

Separate dataset into train and test datasets.

This code snippet separates the downloaded dataset into train (x_train, y_train) and test (x_test, y_test) datasets, with train = 67% of the original dataset and test=33%.

Note here having a validation dataset is usually preferable in ML problems (this dataset is used for hyperparameter tunning, which is out of scope of this post). But we will ignore that and continue with train/test datasets for the seek of simplicity.

Now it is a good hobby to “scale” the train/test datasets before training. Because sometimes the datasets can contain extremely large numbers as well as extremely small ones. Scaling the datasets allows use to convert those numbers into the same range therefore making the model more accurate. The code for this is

Adjust the scale of the train and test datasets.

Finally, after all those are done. We are ready to use TensorFlow 2. TensorFlow 2 has Keras APIs, which let us build models fairly simple. In the mean while it also provides low level APIs and allow us building custom models (this is quite difficult). Here we will be following the easy path and use Keras to build our classification model.

The Keras API we will be using is called “Sequential”. And we will be adding 2 layers into the model, a input layers (this layer is necessary for every model), and a dense layer.

Define a model for classification.

As you can see from the code snippet, there are always 2 steps in defining a TensorFlow model:

Model declaration > Compile.

In the model declaration step, we need to add layers to the model using Keras Layers. Note that the dimension of the layers need to match the shape of your training dataset.

After that, we need to compile the model by specifying “optimizer”, “loss” function and “metrics” to optimize.

With all those works done, we can finally start training the model using our train dataset. In the code I specify the model to run 100 epochs (epoch means to run through the whole train dataset 100 times). Since our train dataset if fairly small, 100 epochs is sufficient and won’t cost a long time.

After training is finished, we can print the evaluation score using both train and test datasets. In my PC, the number looks like the following. The 2nd number in each row is the “accuracy” score, and they are very high (>95%). This means our model is working very well.

Train Score: [0.09029368311166763, 0.9842519760131836]

Test Score: [0.12974153459072113, 0.957446813583374]

Note that the scores can be different in your PC. The reason for this is because we randomly split the original dataset into train and test, therefore we may be training on different datasets.

To evaluate a classification model, the most commonly used metrics are “accuracy” and “loss”. The following code allows you to plot the accuracy and loss curves for training data.

Plot the model metrics.

And the execution result of the code should look like this.

Accuracy change to epoch. (Expecting increasing accuracy with epoch increases)
Loss change to epoch. (Expecting decreasing loss with epoch increases)

Now since the model is training, we can finally use the model to do some prediction. Since we don’t have more fresh data to predict on, we’ll re-use the x_test as input.

The code for prediction is quite simple.

Predict with the trained model.

If you print p, it will be something like [[9.86109257e-01], [9.58251834e-01], …]. The numbers can be interpreted as the “probability of having cancer”.

Summary: This post shows how to solve simple classification problems using TensorFlow 2.

--

--