Playing with machine learning: An introduction using Keras + TensorFlow.
I did some artificial intelligence (AI) courses while I was studying but I never did anything outside the classroom. Machine learning (ML), and AI in general is the most trending subject in the tech scene right now.
And for a good reason. We are in the “democratizing” era of AI. Tools are being developed and popularized that are aimed to be used by everyone, not just the small academic community. If you want an introduction to the inner workings of Machine Learning and get an intuition on how things work, I cannot recommend enough the Machine Learning course by Andrew Ng. I will try to explain some basic theory here, but I am just exploring/learning, so follow with caution :)
A bit of theory
Artificial Intelligence (AI) is a “superset” of Machine Learning (ML) and Neural Networks (NN) is a computational model used in ML. NN is a Supervised Learning technique which means that a dataset with multiple examples with the “right answers” is needed for the model to “learn”. To get to the NN, we have to define some more stuff along the way.
A regression problem means we have to predict a real-valued output. For example, let’s use the (typical) house pricing example where we only have 1 characteristic (feature), the area of the house, and we want to predict the house price. Given a set of area(X)/house-price(Y) pairs (dataset) we can plot the following graph (where each orange dot is an area/house-price pair):
From basic linear algebra we know that the equation of the line is in the form Y = θ0 + θ1*X. This is called the hypothesis and what we are looking for are the values for θ0 and θ1 (called the parameters). If we find a good set of θ0, θ1 then we will be able to predict a house price given the area of any house. To do that, first we need to define the error/cost which is the difference of our hypothesis from the actual value:
There are many ways to calculate the cost function with one of the most popular to be the mean squared error. Then all we need to do is to start from a (most of the times random) initial assignment of θ0, θ1 and using an iterative optimization algorithm (like Gradient Descent) to minimize the cost of our hypothesis by iterating over our dataset.
But what if we do not want a real-valued output and we would like a probability? We can use Logistic Regression where the hypothesis outputs real-valued numbers from 0 to 1. To do that we can use the Sigmoid function to map any real-valued number of our hypothesis to the (0, 1) interval:
We can then adapt the cost function and use the same optimization algorithms to find a hypothesis that minimizes the cost.
But most of the real world problems are not linearly separable. It turns out that connecting small units that do logistic regression between them is one of the most computationally efficient ways to compute non-linear hypothesis:
Every node in the graph, except from input layer, represents simple logistic regression with inputs the incoming edges and outputs the outcoming edges. The rightmost node, the output layer, will give us a final probability given the input characteristics (features).
Common problems and validation set
The two most common problems are underfitting and overfitting (with the latter to be more popular). Underfitting occurs when your hypothesis does not fit the data well enough, overfitting occurs when the hypothesis is too closely fit the training dataset and does not generalize well to new unseen data. To evaluate our model, we split our dataset into the training and validation set. Then we use the training set to fit our model and test it with the unseen validation set.
TensorFlow + Keras
If you are following any tech news site, you’ve probably heard of TensorFlow. It’s Google’s machine learning framework that was open-sourced in 2015 and met with huge success from the open-source community. Unfortunately, the framework although powerful is highly technical and difficult to use by a data analyst/machine learning “outsider”. This is where Keras comes in, a higher level (= user friendlier) layer on top of TensorFlow (and other ML frameworks) that allows you to make ML and more specifically Neural Networks (NN) faster and easier.
I used the Anaconda Python distribution and found the setup of the development environment relatively easy. After installing Anaconda for your platform (you can also try Minicoda for a lighter version if you are short on space), CD to the bin directory of the distribution and run (Linux & MacOS):
./pip install tensorflow
./pip install keras
./pip install pandas
That’s it, if everything works fine, you are all set. You can try the excellent PyCharm Community Edition if you are a fan of the JetBrains’ family of IDEs.
“But I don’t know python!” I hear you saying. Neither did I. I just learned the basics and I keep learning. For my introduction to the language, I used the interactive PyCharm Edu which is a desktop app with an IDE and a Python interpreter with the only aim in life to educate you about Python. The basic tutorial consists of a set of unit tests that you have to complete a small part for them to run successfully. I found Python in general to be a very pleasant to use language.
Let’s start with the actual script. I used the “Medical Appointments No Show” from Kaggle as my dataset for this experiment. This is a dataset that contains 300k medical appointments with characteristics (features) of the patients and whether the patient showed up in the appointment or not. We will try to create a Neural Network that given the characteristics of a patient will predict the probability of that patient showing up.
Preprocess the dataset
The first step is to prepare our data before training our model. We read our data and separate them into X and Y; X are the patient’s characteristics (features) and Y is whether the patient shows up or not. Use the following to have a glimpse at the structure of our dataset:
import pandas as pds
dataframeX = pds.read_csv('No-show-Issue-Comma-300k.csv', usecols=[0, 1, 4, 6, 7, 8, 9, 10, 11, 12, 13])
dataframeY = pds.read_csv('No-show-Issue-Comma-300k.csv', usecols=)
As you can see, some of the features are not represented as numeric values. We can change this using the powerful Pandas library:
if gender == 'M':
if status == 'Show-Up':
dataframeX.DayOfTheWeek = dataframeX.DayOfTheWeek.apply(weekdayToInt)
dataframeX.Gender = dataframeX.Gender.apply(genderToInt)
dataframeY.Status = dataframeY.Status.apply(statusToInt)
Now we have replaced the values for each feature with a unique numeric value. We do the same for the target dataframeY.
The Neural Network
Time to create our Neural Network. Creating and training a NN in Keras is really easy. You describe the model as sequential layers that data flow through. Then you feed it the dataset we prepared in the earlier section and train the model:
import numpy as np
seed = 7
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(12, input_shape=(11,), init='uniform', activation='sigmoid'))
model.add(Dense(12, init='uniform', activation='sigmoid'))
model.add(Dense(12, init='uniform', activation='sigmoid'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
tbCallBack = keras.callbacks.TensorBoard(log_dir='/tmp/keras_logs', write_graph=True)
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
model.fit(dataframeX.values, dataframeY.values, epochs=9, batch_size=50, verbose=1, validation_split=0.3, callbacks=[tbCallBack])
Let’s take it step by step and I will do my best to explain concisely what is happening.
#1: We use a constant seed for our random number generator to create the same pseudo-random numbers each time. This comes handy when we want to try different models and to compare their performances.
#2: Define the NN as with fully connected layers with 12 nodes each that are using the sigmoid function.
#3: Needed to use the TensorBoard tool to visualize the model training.
#4: Train our model using the defined optimizer and loss function. Epoch is the number of times (iterations) the whole data set will go through the network, validation_split is how much data from the dataset to hold back just to validate the performance of the model.
We can now make predictions using our trained model using the model.predict() function.
We can use the TensorBoard tool to visualize the training of our model by running the following inside the Anaconda bin folder (we defined the path in #3 of the previous section):
./tensorboard — logdir=’/tmp/keras_logs’
Machine Learning and Neural Networks are powerful and using TensorFlow+Keras can become fun as well. I have just scratched the surface of the subject in this post but I hope that I gave you a glimpse on what you can do with a few lines of code. I will keep learning about the subject and maybe come back with something more advanced (and useful)!