Airplane Image Classification using a Keras CNN

Kyle O'Brien
19 min readFeb 7, 2018

Hi everyone! I have wanted to start writing Medium articles about machine learning (ML) for a while now, so this represents my first attempt to do just that!

This article details building a ML pipeline to classify the presence of planes in satellite images using a Convolutional Neural Network (CNN). The topics that will be covered are as follows:

  1. Prerequisites
  2. Data Acquisition
  3. Data Preprocessing
  4. Data Visualization
  5. Model Creation
  6. Model Training
  7. Model Evaluation
  8. Conclusion

This tutorial is meant to be somewhat beginner friendly, but we will see if I succeed in accomplishing that! Therefore, it is intentionally long and detailed. If you just want the code, follow this GitHub link.


There are 3 major prerequisites for this tutorial:

  1. Familiarity with the Python programming language
  2. A Python environment equipped with numpy, scikit-learn, Keras, and TensorFlow (with TensorBoard). These packages are by far the most annoying prerequisite to meet, (especially if you plan on using a GPU) and many machine learners have died on this hill. Depending on your OS, the best way to install these packages changes fairly rapidly. Therefore, I am going to save myself some trouble and tell you that you should just do your best to install these packages on your own. If you don’t have a GPU or can’t get it to work, you can also try using “pip install (package name)” to obtain all of these packages. The model won’t train as quickly, but it should get the job done
  3. A Kaggle account. Kaggle is a great resource if you are interested in ML, and it is unlikely you will regret opening an account there

Data Acquisition

For the most part, every ML algorithm hinges upon access to data. This is because the data defines the task and effectively writes your code for you. In my opinion, this is something that makes ML very different from other algorithmic approaches.

The dataset we will be using is from Kaggle. It was uploaded by Kaggle user rhammel and has 32,000 labeled satellite images of airplanes. By “labeled” I mean that a human being has combed through the images, identified the regions that contain airplanes, and then neatly organized them in an easy to understand format. This is typically a laborious effort, so we are lucky that this has been done for us! If you don’t fully understand what I meant by “labeled”, don’t worry, the details will become clear in the Data Visualization section.

Perform the following steps to obtain and situate your dataset:

  1. Go to this link
  2. On the left-hand side, under “Data” select “”
  3. Select “Download”
  4. After this compressed file has finished downloading, extract it to any location on your computer
  5. Viewing the extracted “planesnet” folder, you should see a bunch of PNG image files with seemingly nonsensical names like “0__20140723_181317_0905__-122.14328662_37.697282118.png”

That’s it! You now have all the data you need for this tutorial.

Data Preprocessing

Data preprocessing refers to all the steps you must take to get your data ready for a ML algorithm. Effectively, we are trying to convert the image files we downloaded into an organized and structured format that allows our ML algorithm to interpret the data. So, without further ado, let’s get to the code.

Import statements first. These are the tools we will need for this section.

# Imports
import glob
import numpy as np
import os.path as path
from scipy import misc

Retrieve the filename of all images in the IMAGE_PATH

# IMAGE_PATH should be the path to the downloaded planesnet folder
file_paths = glob.glob(path.join(IMAGE_PATH, '*.png'))

Load the images into a single variable and convert to a numpy array

# Load the images
images = [misc.imread(path) for path in file_paths]
images = np.asarray(images)

Retrieve the size of the images

# Get image size
image_size = np.asarray([images.shape[1], images.shape[2], images.shape[3]])

This image_size array will let us know the dimensions of the images. If we print this variable, we learn that the dimensions of the images are [20, 20, 3]. This means that each image in the dataset has 20 rows, 20 columns, and a depth of 3 (or 3 channels, Red, Green, and Blue). These numbers define the spatial resolution of the image.

Next we scale the images such that their values are between 0–1

# Scale
images = images / 255

The images we are working with are 8-bit. This means that each pixel in the image is one of 256 (2⁸ = 256) possible values. While some machine learning algorithms can handle having relatively large pixel values, most methods perform optimally (train within our lifetimes) when small, floating point values are processed. Hence we divide by 255 (0 is a possible pixel value as well, so 255 is actually the maximum value found within our data) to scale our data between 0–1. So, after this step, a value that was represented by an integer of 128, would now instead be represented by a floating point value of 128/255 = ~0.502. It is important to note that the appearance of the images will be fundamentally unaltered after this step.

The next step is to retrieve the image labels. As referenced above, these labels were manually annotated by a human being. Each 20 by 20 image was reviewed, and given a “1” (True) if it contained an airplane and a “0” (False) if it did not contain an airplane. We can extract these labels by reading in the first character in the image filename. For example, in the image filename “0__20140723_181317_0905__-122.14328662_37.697282118.png”, the very first “0” is indicating to us that this image does not contain an airplane. We can extract these labels with the following code:

# Read the labels from the filenames
n_images = images.shape[0]
labels = np.zeros(n_images)
for i in range(n_images):
filename = path.basename(file_paths[i])[0]
labels[i] = int(filename[0])

Great! Now we have a bunch of images and their corresponding labels. In the classic supervised machine learning lingo, we consider the images to be “X” and the labels to be “y”. The machine learning algorithm is going to find a function “f” such that for any given image, y = f(X). In other words, we are going to take the 1,200 (20 x 20 x 3 = 1,200) X values, run them through a function, f, and predict one singular value, y, that represents a classification of either “plane” or “not plane”.

The next step is to split our data into training and test sets. This practice is important, because it is the only way to evaluate your model in an unbiased way. Basically, you want your model to learn on the training set, (usually about 90% of all of the images you have available), and then report back its accuracy by evaluating it on the test set (the remaining 10%). Generally, a model that only performs well on data that it was trained on is not useful or interesting, and is described as “over-fit”. Over-fitting is the equivalent of the model memorizing the labels. It would be like a student who learns that 2 + 2 = 4, and then assumes that every addition problem they see is equal to 4. They have not actually learned how to perform addition, rather they have just memorized an output. We want our model to generalize a concept. It needs to be able to grab an image with an unknown label and accurately predict if it contains an airplane or not, regardless of whether or not the model has seen the image before. We can best simulate this scenario by creating a test set, as shown below.

# Split into test and training sets

# Split at the given index
split_index = int(TRAIN_TEST_SPLIT * n_images)
shuffled_indices = np.random.permutation(n_images)
train_indices = shuffled_indices[0:split_index]
test_indices = shuffled_indices[split_index:]

# Split the images and the labels
x_train = images[train_indices, :, :, :]
y_train = labels[train_indices]
x_test = images[test_indices, :, :, :]
y_test = labels[test_indices]

With this split, we have finished our preprocesing steps. In the next section we will visualize some images to see what we are up against.

Data Visualization

Data visualization is typically the process of reviewing your training data and getting a sense of the patterns in it. In the case of image data, this (not surprisingly) involves taking a peek at some of the images in your dataset alongside their labels. If you are familiar with Python, the details of this won’t be particularly interesting, and you can just look at the final product at the end of this section.

First, let’s import matplotlib.pyplot to help us plot the images

import matplotlib.pyplot as plt

Next, let’s define a visualization function that will plot the data for us.

def visualize_data(positive_images, negative_images):
# positive_images - Images where the label = 1 (True)
# negative_images - Images where the label = 0 (False)

figure = plt.figure()
count = 0
for i in range(positive_images.shape[0]):
count += 1
figure.add_subplot(2, positive_images.shape[0], count)
plt.imshow(positive_images[i, :, :])

figure.add_subplot(1, negative_images.shape[0], count)
plt.imshow(negative_images[i, :, :])

This function will plot positive and negative image examples for us. By “positive” we mean data with a label of “1” or “True”, and by “negative” we mean data with a label of “0” or “False”. We need negative examples (i.e. images that do not contain airplanes), because otherwise the model would not have any reference point and likely assume that the mere presence of an image is indicative that it contains an airplane. The inner workings of the visualize_data function create two rows of images where some number of positive examples are on the top, and the same number of negative examples are on the bottom. The titles above them indicate what their label is. Let’s take a look at what a call to this function looks like.

# Number of positive and negative examples to show

# Select the first N positive examples
positive_example_indices = (y_train == 1)
positive_examples = x_train[positive_example_indices, :, :]
positive_examples = positive_examples[0:N_TO_VISUALIZE, :, :]

# Select the first N negative examples
negative_example_indices = (y_train == 0)
negative_examples = x_train[negative_example_indices, :, :]
negative_examples = negative_examples[0:N_TO_VISUALIZE, :, :]

# Call the visualization function
visualize_data(positive_examples, negative_examples)

While this may not be the most efficient code, it demonstrates that for both the positive and negative examples, we retrieve the appropriate labels, select the corresponding subset of relevant training examples, and then feed them to our visualization function. The result of calling this function is pictured below

Wow, those are some blurry images! This isn’t surprising given that the images only have a resolution of 20px by 20px. At this poor resolution we can somewhat pick out the airplanes in the positive examples, and the lack of airplanes in the negative images. It seems we have our work cut out for us! That being said, let’s list out some more detailed observations we might make from this data.

  1. There are partial airplanes in the negative examples
  2. The positive examples appear to contain airplanes of different sizes. Sometimes the plane occupies 16–18 pixels, other times only 8–10 pixels
  3. There are varied atmospheric conditions present in the images

If you spend more time looking at these examples, you will probably notice even more intricacies about the data. The devil is often in the details, so it is worth your time to get to know your data as well as possible. Next, we will explore the creation of the ML model.

Model Creation

In machine learning, a model is something that accepts input data (X) and “predicts” an outcome(y). The model is trained (or “fit”) such that it achieves the best possible performance on the data it is given. It is important to keep this in mind as we go through the details below.

The model we are going to create is called a Convolutional Neural Network (CNN). CNNs have been all the rage in recent years. CNNs as we know them were first proposed in this 1998 paper by Yann LeCun (et. al.). The paper was groundbreaking when it came out, but CNNs did not become commonplace until very recent hardware improvements in the form of advanced GPUs and TPUs. Since then, CNNs have been rapidly improving to the point where keeping up with them is a full time job. Most modern societies benefit heavily from CNNs, as they power the latest and greatest advancements in Computer Vision, such as self-driving cars and medical image analysis.

For the purposes of this tutorial, we will develop a CNN that serves as a practical introduction. While parts of this may get tricky, I promise that this example will give you the necessary tools to investigate this topic further.

Let’s jump right in with our import statements. Don’t worry too much about understanding all of them

# Imports
from keras.models import Sequential
from keras.layers import Activation, Dropout, Flatten, Dense, Conv2D, MaxPooling2D
from keras.callbacks import EarlyStopping, TensorBoard
from sklearn.metrics import accuracy_score, f1_score
from datetime import datetime

Notice that there are a ton of Keras related imports. Keras is the framework on which we will build our CNN. It is very popular, and (as of February 2018) my personal favorite for deep learning model creation. Keras is a high level package that is built on top of other “backends” like TensorFlow, MXNet, and Theano. This means that whatever code you write in Keras could also be written with any of those other backends. However, it would likely take a lot more code to build the same model on one of those backends, and the performance would effectively be the same. For this reason, Keras is invaluable to the beginner or practical deep learning coder.

Okay, I have been throwing around this phrase “deep learning” but haven’t really explained it. I am going to rely on the graphic below to do some of the explaining for me.

A poorly drawn figure describing how deep learning fits in the broader ML world

This figure indicates that “Deep Learning” is a classification that occurs when we are talking about a Neural Network with 2 or more layers. I’m sure some audience members will disagree with me on this definition of deep learning, but it is easily quantifiable and I haven’t been particularly convinced by other definitions, so I prefer it. This is all great, but what are “layers”?

A layer is basically a computational unit. We could have a layer called “multiply by 5” that multiplies every number it gets by 5. It might then feed into a layer “multiply 10” that takes the output from the “multiply by 5” layer and multiplies it by 10. In the case of CNNs, the prominent layer type that gets used is a “Convolutional Layer”, which convolves the input to produce an output. I am not going to go into the mathematical details about convolution, but here is a pretty neat graphic of what it means to convolve an input. The blue image is the input, and the green image is the output.

Convolution of an image (Courtesy of:

Imagine the operation above happening many times in parallel, and you’ll have a reasonable grasp on what a Convolutional layer is. There are other types of layers as well. For instance, a “ReLu” layer simply converts all negative numbers to 0. At the time of writing, there are dozens of layer types one could utilize in their model. The “secret sauce”of a particular CNN is the arrangement of its layers, commonly referred to as the “network architecture”.

It should also be noted that the little 3x3 square that is moving across the animated image above is called the “convolutional kernel”. During training, the CNN is going to attempt to learn the optimal values for the kernels such that feeding the X data through the layers will yield the correct prediction for y. You can think of these kernel values as learnable model parameters. There is a lot more to kernels, but it is outside the scope of this tutorial.

At this point, I think I have laid enough of a foundation for us to start building our own CNN in Keras.

First, let’s define a hyperparameter. A hyperparamter is a value typically defined by the user to tweak the model’s performance. This is not to be confused with a model parameter, which is something the model itself tweaks during training.

# Hyperparamater

The “N_LAYERS” hyperparameter defines how many convolutional layers our CNN will have. Next, let’s go ahead and use Keras to define our model.

def cnn(size, n_layers):
# size - size of the input images
# n_layers - number of layers
# model - compiled CNN

# Define hyperparamters
KERNEL = (3, 3)

# Determine the # of neurons in each convolutional layer
steps = np.floor(MAX_NEURONS / (n_layers + 1))
nuerons = np.arange(MIN_NEURONS, MAX_NEURONS, steps)
nuerons = nuerons.astype(np.int32)

# Define a model
model = Sequential()

# Add convolutional layers
for i in range(0, n_layers):
if i == 0:
shape = (size[0], size[1], size[2])
model.add(Conv2D(nuerons[i], KERNEL, input_shape=shape))
model.add(Conv2D(nuerons[i], KERNEL))


# Add max pooling layer
model.add(MaxPooling2D(pool_size=(2, 2)))

# Add output layer

# Compile the model

# Print a summary of the model

return model

This is likely going to be an intimidating block of code. For the purposes of this guide, you do not need to become a master of this code, as it is just one of hundreds of possible network architectures. However, you should note that the “model.add()” operations are building a chain of layers that the network is going to pass data through. In addition to the Conv2D layers (This is how Keras constructs convolutional layers), these include many non-convolutional layers such as MaxPooling2D, Dense, and Activation layers. At the end of this function, we compile the model which allows us to define how we would like the performance to be optimized and evaluated. When designing a CNN in Keras, you can generally follow the guidelines below:

  1. Create a model that we will sequentially add layers to with “model = Sequential()”
  2. Add various layers to the model using “model.add()”
  3. Compile the model with “model.compile()”

We should take a moment to discuss the inputs of the model.compile() block.

  • Loss — The loss function that will define the model’s success or failure. Each iteration of training, it will calculate its loss using this function to evaluate its current performance, and tweak the model parameters based on this feedback. Binary cross entropy is a pretty standard choice when working with a binary classification task.
  • Optimizer — Defines how the parameters will be tweaked (i.e., should the parameters be modified by a large or small amount?) The adam optimizer we are using is particularly robust, and does a few neat tricks to make sure that optimization is efficient. It is also a standard choice for most deep learning models.
  • Metrics — Defines which metrics we would like the model to report back to us during training. Accuracy gives us a more human-friendly interpretation of the model’s current performance than loss, so I chose to report it here.

The code above also prints out a summary of the model architecture, as shown below

Summary of model architecture

This summary is a powerful tool. Note that “None” is the first dimension of every “Output Shape”. This is because we are going to be giving the model many images at once during training, but have not yet defined how many. “None” is therefore more of a wildcard in this scenario. When this model is deployed in the real world, it is likely that that the None values will be 1, to represent that the model will make a prediction on one image at a time.

Additionally, it is important to notice that the “Output Shape” changes until it eventually becomes (None, 1). This final shape indicates that we will have a singular floating point value that tells us the model’s final prediction (I.e. What is the probability that the given image contains an airplane?).

We also see the number of trainable parameters in each layer. When we “train” this model, it will basically have 489,597 different values it can tweak to try to optimize its performance.

Finally, we actually build an instance of this model by passing in the input image size and the model hyperparameter we defined earlier

# Instantiate the model
model = cnn(size=image_size, n_layers=N_LAYERS)

Model Training

Training is the process the ML algorithm undertakes to tweak and optimize its parameters. Typically, a CNN is trained iteratively, and the model tries to improve its performance every iteration. A bit of taxonomy: An iteration is defined as a pass through a single batch of images. An epoch is defined as single pass through all images in the training dataset. To make this clear, let’s say we have 1000 training images, and a batch size of 100. This would mean that there are 10 iterations per epoch, because we would need to pass 10 batches in before we would have cycled through the entire training dataset.

This should help make sense of some of the hyperparameters defined below.

# Training hyperparamters
EPOCHS = 150

Why deal with the overhead of managing batches of data? The answer is somewhat complicated, but basically we are looking to optimize the loss function on small pieces of the training data at a time. Looking at the entire training dataset at once may not lead to the best minimization of the loss function. Additionally, in many use cases you cannot fit all of your data into the GPU’s memory simultaneously, so it is impractical to do so. However, it would also be computationally inefficient to train on a single image at a time, as it would take an unbearably long amount of time to complete training. Therefore the batch size represents a compromise between training the entire dataset simultaneously and iterating through individual instances.

Next, let’s define an early stopping callback. A callback is simply a function that can “check-in” on the model as it trains.

# Early stopping callback
early_stopping = EarlyStopping(monitor='loss', min_delta=0, patience=PATIENCE, verbose=0, mode='auto')

Training is an iterative process, and the model can either increase, decrease or maintain its performance after each iteration. Early stopping tells the model to stop training if it doesn’t see any improvement for a user-defined time-frame (Defined in number of epochs via the PATIENCE hyperparameter).

Now let’s define a TensorBoard callback

# TensorBoard callback
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
log_dir = "{}/run-{}/".format(LOG_DIRECTORY_ROOT, now)
tensorboard = TensorBoard(log_dir=log_dir, write_graph=True, write_images=True)

TensorBoard is a powerful visualization tool available to anyone using Keras with a TensorFlow backend. It allows you to view the model’s performance in real-time as training progresses. To open up TensorBoard, open a command window and type:

tensorboard --logdir LOG_DIRECTORY_ROOT

You should then open up your web browser to the address mentioned in the response from this command. This is usually something like 192.168.1.X:6006. No information about your model will appear in Tensorboard, but when we actually start training the model, you will see outputs in TensorBoard that look something like this:

Accuracy vs. training iteration

Aside: This plot indicates that the model generally improved every epoch until it hit nearly 100% (1.0) accuracy.

Finally let’s organize both callbacks into a list, as this arrangement will be expected by the Keras model’s fit function.

# Place the callbacks in a list
callbacks = [early_stopping, tensorboard]

With the callbacks defined, we are ready to train the model

# Train the model, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, callbacks=callbacks, verbose=0)

This call to is where the magic happens. This function call passes in the training images (x_train), labels (y_train), epochs, batch_size, and callbacks. On a GPU, the model will take about 10 minutes to run.

While the model trains, you should take some time to open TensorBoard in your browser and watch as the training progresses. You should go to the “Scalars” tab and expand the 2 metrics that appear there. You should see that the “acc” graph is gradually approaching 1.0, while the loss graph is gradually approaching 0. These are signs of a healthy model! It is obvious why you would want acc (accuracy) to be high, but it may not immediately be clear why you would want loss to be low. For the most part, loss and accuracy are two sides of the same coin. Loss is simply how the model internally evaluates its performance. It is trying to do everything it can to not make mistakes, thereby minimizing its loss. As alluded to earlier, most people find the idea of maximizing the accuracy easier to grasp than minimizing the loss, so accuracy is also reported, even though the algorithm technically doesn’t use that information.

Additionally, if you are looking for a highly detailed representation of the model we built, go to the “Graphs” tab in TensorBoard. I will not walk through reading a graph, but you should be able to see that TensorBoard neatly arranged the layers we defined in the cnn function.

Model Evaluation

After the model has finished training, you should evaluate the model on the test set using the following code.

# Make a prediction on the test set
test_predictions = model.predict(x_test)
test_predictions = np.round(test_predictions)

We have to round the scores so that we get a binary output.

Then use the predictions and compare them to the ground truth

# Report the accuracy
accuracy = accuracy_score(y_test, test_predictions)
print("Accuracy: " + str(accuracy))

This will tell you the model’s final performance on the test dataset. The model I trained had ~98.88% accuracy. Your performance should be ±0.5%.

With the final performance, it is typical to inspect the images that your model failed on, so you can know where you need to improve. The block of code below will display the images from the test set that failed

import matplotlib.pyplot as plt
def visualize_incorrect_labels(x_data, y_real, y_predicted):
# x_data - images
# y_data - ground truth labels
# y_predicted - predicted label
count = 0
figure = plt.figure()
incorrect_label_indices = (y_real != y_predicted)
y_real = y_real[incorrect_label_indices]
y_predicted = y_predicted[incorrect_label_indices]
x_data = x_data[incorrect_label_indices, :, :, :]

maximum_square = np.ceil(np.sqrt(x_data.shape[0]))

for i in range(x_data.shape[0]):
count += 1
figure.add_subplot(maximum_square, maximum_square, count)
plt.imshow(x_data[i, :, :, :])
plt.title("Predicted: " + str(int(y_predicted[i])) + ", Real: " + str(int(y_real[i])), fontsize=10)

visualize_incorrect_labels(x_test, y_test, np.asarray(test_predictions).ravel())

Here is the output

Output showing all of the false positives and false negatives in the test dataset

False Positives: We can see from the images that our model is fooled by roads that appear to take the shape of an airplane. Additionally, if there is a large white spot in the image, the model might also incorrectly predict that an airplane is present. Images that contain partial planes also can result in false positives.

False Negatives: Airplanes that are non-white in color seem to throw off the model. The model also has trouble detecting airplanes with unusual wing shapes.


If you made it to the end of this tutorial, congratulations! To recap, we acquired and prepared data, built a deep learning model called a Convolutional Neural Network (CNN), and achieved >98% accuracy. Thank you for reading, and I hope you enjoyed this tutorial!

Further Exploration:

  1. We used accuracy to evaluate the performance of this model. However, due to the unbalanced nature of this dataset (meaning we have more negative examples than positive examples), it may be wiser to evaluate the model with F1 score. You should try to do this on your own using the “f1_score” function we imported earlier. Also, try to do some research to understand why I am suggesting we use F1 score.
  2. Try learning more about the other layers used in the model, particularly the ReLu and MaxPooling layers. These layers were crucial discoveries, and drastically improved the performance and feasibility of CNNs
  3. Fiddle with the model hyperparamters. Better yet, try to write a script that automatically tests different hyperparamter combinations and see the highest accuracy you can achieve. If you do better than what I posted, please share your network architecture in the comments! (Hint: I have achieved improved accuracy by adding something called a “Dropout” layer to the model)
  4. See if you can save the model and load it back up again.
  5. Try this model out on a different dataset. Kaggle has many labeled, binary classification, image datasets that you can run this model on.