Getting Started with Keras

Published in

SkyshiDigital

12 min readFeb 21, 2018

Machine learning has progressed from just nerd-only topic into widespread commercial practice. As the core of artificial intelligence (AI), ML combines computer science and statistics to answer 2 questions: How can one construct computer systems that automatically improve through experience? and What are the fundamental statistical computational-information-theoretic laws that govern all learning systems, including computers, humans, and organizations? (Jordan and Michell, 2015). Yes, it sounds complicated and hard to implement, fortunately thanks to the growing demand of ML, there are a lot of library created to lower the entry barrier for someone new to learn about ML such as Keras, TFLearn, PyTorch etc. In this article I would like to present a tutorial on how to implement ML using Keras.

What should I learn?

For someone new to ML, your main goal is to understand ML principles as soon as possible, not to learn about how to code. In the learning process you should not focus on “How the code should be?” but you need to put more attention to questions such as “What happend if I increase the hidden layers?”, “What differentiate a good dataset with the bad one”, “Why we need to separate our data into training data and test data?”, “With my dataset characteristic, which ML technique is more suitable?” or any other question specifically talking about how ML works. This is important because ML is not about the code, but more about how you choose the ML technique and troubleshoot if the model behave poorly when you use it in the real implementation.

In my opinion, ML is more like an art than programming, and like any other art, the only way for you to learn and become better is by focusing on the art itself not the tools to create the art. Pablo picasso was able to make an art out of anything, sculptor, painting, even I believe that he could turn garbage into art. He could do it not because he keeps worrying what tools and media should he use, but because he understand art so he can express any emotion in any art media beautifully. In this article, I want you to be picasso for a moment and not worrying about the code and focus more on how to express your problem into something that ML can solve. Therefore, for someone new, I recommend you to use Keras since the syntax is simple and you can easily play around with your ML structure easily, not to mention great documentation and community support Keras has, compared to other high level API ML library such as TFlearn and PyTorch.

Before that, ML dictionary….

The biggest pain point for someone to understand ML principles is almost every article or video about ML we find on the internet explain the concept of ML using terms that we don’t understand such as features, convergence, model etc. To help you understand this article and any other learning source you might found in the future, I will explain some ML terms in the most simple english possible.

Training : process to teach the program how to produce certain output based on certain input by using set of known input-output pairs. You can imagine this like a human who learn how to play chess from experience so he can defeat his next opponent based on those experience. For machine this learning process usually called training and the experience is the dataset we have.
Supervised Training : training process by using data that have pair input and ouput. In this training technique the goal is to map every known input to every known output. Example: classification problem, regresssion(prediction) problem. Most ML use this approach since it’s easy to implement.
Unsupervised training : training process by using data which only consist of input, without any known output (label).The goal of this training method is to understand or model the underlying structures of our data. Example: clustering.
Model : to put it simply it is the result of training process. You can imagine it as a black box that will give a certain output when you give an input, and training process is just some method to tune this black box so it can produce desirable output.
Convergence : condition in training process where the model can produce correct output for every input in dataset. In our chess analogy, we can imagine it as condition where a players can defeat anyone in training session. High convergence doesn’t means that the model is capable to predict any input accurately, just like chess player not always wins when he face someone he never play against even though he always win against his training partner, convergence only means that it can produce correct output for the known input inside the dataset, not for unknown input outside the dataset.
Test : process to evaluate how good the training result by using data outside of the known dataset used in training process. The purpose of someone train to play chess is not just to face the same opponent like he face in the training. The main goal is to defeat anyone even if you never play against them. Therefore to test your training effort you need to face someone other than your training partner. That’s why in ML we split the data into training data and test data because we need to know how good our ML model is when it face an input outside of known data.
Overfitting : Just like someone who defeat everyone in training session but never win in any competition, ML model also could perform better at training, but perform poorly on test data and real implementations. This is because the test data is a data that new and the machine never face before. In fancy term, overfitting is condition which the machine fail to generalize the model and only manage to create specialize model for training data.
Features : It is the representation of our input in form of numbers, array or anything that computer can process. The main goal of using feature representation is to make data unique and computable by machine. So for example if our data is a picture we need to convert it into array of color. In this case the feature is how red the picture is?, How blue the picture is? and how green the picture is? or RGB. In other example is if our input is text, since computer can’t do calculation on text we need a method to represent our text data by using technique like bow or Word2vec. If we use bow, the feature is every single word in the sentence while if we use word2vec the feature is every column in the matrix.

What will we do?

After we understand some ML terms, it’s time for us to put it into practice using Keras. To give better understanding, we will use Keras to solve text classification problem. I believe text classification problem would give better insight since it consist of techniques that usually implemented to solve other problems. By using this dataset which is a modification from its original version, we will classify the reason people complain when they travel using US airline. There will be 4 classes which are: Flight Attendant Complaints, Late Flight, Customer Service Issue and Lost Luggage. To follow this article you need to install several python library which are:

tensorflow==1.4.1
Keras==2.1.3
pandas==0.22.0
numpy==1.14.0
scikit_learn==0.19.1

The ML technique that we will implement in this article is neural network with one hidden layer which consist of 5k neurons. The structure of our Neural Network would look like something like this:

I decide the number of neuron arbitrarily, so if you think there are too much neurons, you can change it to any number which suits your preference. The main things that you need to understand when deciding the numbers of neuron is that the bigger the number of neuron the better your model will fit your training data but, it consume more resource and prone to overfit. For now to decide the numbers of neuron we will do it arbitrarily, but if you have sufficient knowledge and need some challenge, you can look into evolutionary algorithm which is an algorithm you can use to automatically determine the most suitable numbers of neuron.

Text to Feature using BOW

Like what I explain previously in ML dictionary, if our data is consist of text we need a method to convert it into other form that computer could recognize, preferably in form of array of numbers. Therefore, we will use bag-of-word to extract the feature from our input text.

Basically, BOW will create list of all word that appear in our data and for each sentence if it contain a word in dictionary the value for that word will be 1 and 0 otherwise. Because I’m suck explaining with words, it’s better for you to just see this illustration for better understanding :P.

To create the dictionary and convert a sentence to bow array we will use CountVectorizer from scikit learn. The corpus that we will use consist of all text input that we will use for training data. Here is the code to convert corpus (csv) into vocabulary (json) which could be use later for bow using CountVectorizer.

You might notice that our corpus is not only consist of letters but there is also numbers and specials character. If you create your vocabulary using your own code, you need to clean this numbers and specials character since we don’t need it and it increase the matrix dimension of bow representation. Remember that each unique word will increase the column of bow matrix, so if your corpus consist of thousands of unique words, I suggest to use other word representation such as word2vec or Glove since it will produce smaller 2D matrix compared to bow which produce 1D matrix with thousands column.

After we create the vocabulary, it’s time for us to try convert a sentence to bow. This code below is the code that we are going to use to convert sentence to bow.

To use the code above you can open python in your terminal and import it using from tobow import to bow like what illustrate in the picture below (ignore the warning for now since i’m not sure why it happens lol):

Create the Neural network

Now it’s time for us to use Keras to create the neural network structures like what previously designed. To create NN by using Keras we will use Keras Sequential model and create the layers by adding Dense layers to the sequential model like what shown in the code below:

As you can see that to create a layer in Keras we just need a single line of code. If we use Tensorflow or Theano we might need to do it from scratch and ends up with lines of code just to create a single layer.

In the first Dense layer which is a hidden layer, I give 3 argument which are 5000 as the numbers of neuron on that layers, input shape which is gonna be the size of our bow matrix and relu as activation function. You need to remember we only need to define the input shape for the first layer in the model because for the next layers Keras will automatically extrapolate the input shape from previous layer output.

At this moment you might wondering what’s activation function is and what relu means. To put it simply, the activation function is the one who tune the output of a neurons. So for example the output of our neuron is expected to contain no negative value, but the neuron produce an output of -0.87. In this scenario we need a function to convert -0.87 into positive number. If we use relu as activation function, it means that we will use function f(x)=max(0,x) which in plain english means that any negative number will be converted into 0, so in our example the final output of the neuron is 0 not -0.87. Other than relu there are some popular activation function that you could use in Keras such as softmax, tanh, linear etc. For detail list of activation function in Keras you can check in the Keras documentation while the explaination of how the activation function works can be found in this article.

For the second layer, I add dropout layer to prevent overfitting. Theoretically dropout layer could prevent the overfitting by randomly ignore neuron(s) in each iteration meaning that for the ignored neuron, the weights will not be updated and temporally removed from the training process. The intuition behind dropout is that by temporary remove a neuron from training process, it will make the weights adjustment for other neuron changes differently compared to normal condition which the neuron is included. So in case our model will overfit in the end of the training, it could be prevented because the weights is adjusted in different ways than the overfit weights. In our model, I give 0.2 as an argument to represent the probability of a neuron would be ignored in training process. Generally the probability should be small around 20%-50%, so if you want to play around with dropout layer, I suggest you don’t change the probability less than 20% or more than 50%.

For the last layer which is the output layer, we need to ensure that the number of neuron is equal with the number of class we have. Why? Well, in this example we will use one hot encoding to represent the class, so the number of output of our neural network should be equal with the size of one-hot-encode matrix which is gonna be determined by the number of class we have.

After we define the layers of our model, in Keras we need to compile the model and defining the loss function, optimizer and metrics. The loss function will measure the performance of our model in training process and measure the learning process. Smaller loss means that our model is closer to converge with training data. The dumb way to explain loss function is that it’s just a formula that measure the convergence error of our model during the training process. Optimizer in machine learning refer to method, rule or formula that govern how our model minimize the loss during training. Because optimizer job is to govern oh how to update the weights in Neural Network, some people also call it by update rule. For classification problem, we could use categorical_crossentropy as loss function in keras and adam as the optimizer.

Train the Model

To train ML model in keras we just need to use model.fit(), but before that, we need to prepare our data since it’s still in csv form by using the code below:

The code above do 2 things which are convert text data to bow matrix and convert label into one-hot-encoded matrix. Worth to be noted that Keras model.fit only accept data in form of numpy.array, so we need to convert all of our data into numpy.array format.

By combining the code for creating the model and prepare the data our final training code look like this:

Notice that in the end of the code, I add code to save the trained model, weights and encoder. We need this because if we want to classify a new data in the future, we just need to load the trained model from file instead of train a model from scratch every single time we classify the input.

Evaluate the Model

To evaluate the model we could use Keras function, but I also want to show you how to load the model from file, so we gonna do the evaluation by using our own code as follows:

The result of our classification can be visualized in this table below:

From 16 input data that we use, the classifier manage to predict 11 data correctly (68.75% accuracy). If we observe the table above we could see that if the input classified into “Lost Luggage” and “Late Flight”, the model is able to predict it correctly but if the data is classified into “Flight Attendant Complaints” and “Customer Service Issue”, the model is not always predict it correctly. There is a lot of reason why this behavior occurred but my guess is because there is ambiguity in how the training data is classified or in training data, the text that classified into “Flight Attendant Complaints” and “Customer Service Issue” have no identifier word that will separate them with other class.

What’s Next?

Well, to be honest by following this article doesn’t means that you suddenly will understand ML. But, the program that we create through out this article could be served as foundation for you to explore ML behavior. There is several thing that you can change in our structure to observe the behavior of a Neural Network created by using Keras for example increase the number of hidden layer, reduce neurons on the first layer, change the activation function etc. After you change the structure of the neural network, observe the training time and accuracy to improve your understanding on how a parameter affect the ML performance.

If you encounter any difficulty following this article or simply want to discuss further about ML implementation using Keras, you can contact me at himang@skyshi.io or preferably you can come to my office at PT Skyshi Digital Indonesia. Thanks for spending your 12 minutes by reading this article.

PS: you can find the code in this article at my github