Anomaly Detection Part 1: Autoencoder
Autoencoder in action
What is an anomaly?
Lexico defines it as: something that deviates from what is standard, normal, or expected.
This isn’t a great definition because it implies a perfect observer.
In the real world, an anomaly is a phenomenon that happens all the time either due to the imperfect observational capabilities of the observer or, as the dictionary definition states, due to something changing intrinsically within the subject we are investigating.
By definition, anomalies are a perfect opportunity to correct prior fallacious beliefs or to capture behavioral changes within subjects.
In this post, I will introduce and discuss:
- A number of common approaches to detecting anomalies;
- The autoencoder;
- An example of the autoencoder in action.
Popular Approaches To Anomaly Detection
There are literally a million ways to create an anomaly detection system. Here are a few of the most notable:
- Density-based techniques (k-nearest neighbor, local outlier factor, isolation forests, and many more variations of this concept);
- Subspace, correlation-based, and tensor-based outlier detection for high-dimensional data;
- One-class support vector machines;
- Replicator neural networks, autoencoders, and long short-term memory neural networks
- Bayesian Networks;
- Hidden Markov models (HMMs);
- Cluster analysis-based outlier detection;
- Deviations from association rules and frequent itemsets;
- Fuzzy logic-based outlier detection;
Ensemble techniques, using feature bagging, score normalization, and different sources of diversity. In this series, I’ll introduce each of the models I personally use one by one.
At work, I am tackling anomaly detection by using an ensemble model. First and foremost, I will introduce one of the models of my ensemble: the classic version of an autoencoder. In my next post, I’ll cover a fancy variant called Variational Autoencoder.
Introducing Autoencoder
If most of the traditional programming approaches can be categorized as lossless compression in which programs are built on top of strict, handcrafted business rules, then machine learning approaches are lossy compression — in which business rules (logic, patterns) are inducted from data.
An autoencoder is the best example of this. Here are a few key features of autoencoders:
- Autoencoders are data-specific, which means they will only be able to compress data similar to that on which they have been trained. An autoencoder trained on pictures of faces would do a rather poor job of compressing pictures of trees because the features it has learned would be face-specific.
- Autoencoders are lossy, which means that the decompressed outputs will be degraded compared to the original inputs.
- Autoencoders are learned automatically from data examples, which is useful because we do not have to do the extra work of preparing extra labels or data for the algorithm.
To build an autoencoder, you need three things: an encoding function, a decoding function, and a distance function between the amount of information loss between the compressed representation of your data and the decompressed representation.
You could think of it as doing what a copy machine does. Which may lead you to ask, why do we need another copy machine?
As you can see, the shape of an autoencoder resembles that of an hourglass. We call the hidden layer h in the middle the bottleneck layer.
This bottleneck layer has far fewer neurons than typical deep learning models, so the autoencoder model has to find a way to represent data by letting go of all its noise and fluff.
If you are interested in the math and in depth explanations, you can check out Andrew Ngs lecture on the topic.
Autoencoders sound super daunting, but they aren’t.
Example of An Autoencoder In Action
Without further ado, let’s implement an autoencoder!
Here is the code sample implemented with TensorFlow. The encoder will consist of a stack of Conv2D and MaxPooling2D layers (max pooling being used for spatial down-sampling), while the decoder will consist of a stack of Conv2D and UpSampling2D layers.
from tensorflow.keras.layers import Input, Dense,Conv2D, MaxPooling2D, UpSampling2D
from tensorflow.keras.models import Model
from tensorflow.keras.datasets import mnist
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import regularizers
# this is the size of our encoded representations
from tensorflow.keras.callbacks import TensorBoard
input_img = Input(shape=(28, 28, 1)) # adapt this if using `channels_first` image data format
x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)
# at this point the representation is (4, 4, 8) i.e. 128-dimensional
x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
(x_train, _), (x_test, _) = mnist.load_data()
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = np.reshape(x_train, (len(x_train), 28, 28, 1)) # adapt this if using `channels_first` image data format
x_test = np.reshape(x_test, (len(x_test), 28, 28, 1))
autoencoder.fit(x_train, x_train,
epochs=50,
batch_size=128,
shuffle=True,
validation_data=(x_test, x_test),
callbacks=[TensorBoard(log_dir='/tmp/autoencoder')])
decoded_imgs = autoencoder.predict(x_test)
n = 10
plt.figure(figsize=(20, 4))
for i in range(1,n+1):
# display original
ax = plt.subplot(2, n, i)
plt.imshow(x_test[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
The code starts with the usual: — importing and preparing data. The rest is pretty simple: we create the encoder with a few convolutional layers, and the same goes for the decoder.
Voila! We have our first autoencoder up and running.
After running this code, here’s what we get:
The top row contains the original digits, while the bottom row shows the reconstructed ones. As you can see, we are losing quite a bit of detail with this basic approach.
In Summary
Autoencoders are powerful. They can be used to learn efficient data codings in an unsupervised manner.
On the other hand, variational autoencoders — which I will introduce in my next post — learn the parameters of a probability distribution that represent the data. Since it learns to model the data, we can sample from the distribution and generate new input data samples.
If autoencoders were a scrappy copy machine, you could think of variational autoencoders as cunning art forgers. Much more exciting, don’t you think?
Stay tuned!
Thanks for reading! If you enjoyed this article, please hit the clap button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge.
Feel free to share your questions and comments here, and follow me so you don’t miss the latest content!