Artificial Intelligence and Anomaly Detection

Anomaly Detection With LSTM Autoencoders

Unsupervised ML Approach for fund management

Sarit Maitra
Sep 15, 2020 · 7 min read
Image by Author

nomaly here to detect that, actual results differ from predicted results in price prediction. As we are aware that, real-life data is streaming, time-series data etc., where anomalies give significant information in critical situations. In the detection of anomalies, we are interested in discovering abnormal, unusual or unexpected records and in the time series context, an anomaly can be detected within the scope of a single record or as a subsequence/pattern.

Estimating the historical data, time-series based predictive model helps us in predicting future price by estimating them with the current data. Once we have the prediction we can use that data to detect anomalies on comparing them with actuals.

Let’s implement it and look at its pros and cons. Hence, our objective here is to develop an anomaly detection model for Time Series data. We will use neural-network architecture for this use case.

Let us load Henry Hub Spot Price data from EIA. We have to remember that, the order of data here is important and should be chronological as we are going to forecast the next point.

print("....Data loading...."); print()
print('\033[4mHenry Hub Natural Gas Spot Price, Daily (Dollars per Million Btu)\033[0m')
def retrieve_time_series(api, series_ID):
series_search = api.data_by_series(series=series_ID)
spot_price = DataFrame(series_search)
return spot_price
def main():
try:
api_key = "....API KEY..."
api = eia.API(api_key)
series_ID = 'xxxxxx'
spot_price = retrieve_time_series(api, series_ID)
print(type(spot_price))
return spot_price;
except Exception as e:
print("error", e)
return DataFrame(columns=None)
spot_price = main()
spot_price = spot_price.rename({'Henry Hub Natural Gas Spot Price, Daily (Dollars per Million Btu)': 'price'}, axis = 'columns')
spot_price = spot_price.reset_index()
spot_price['index'] = pd.to_datetime(spot_price['index'].str[:-3], format='%Y %m%d')
spot_price['Date']= pd.to_datetime(spot_price['index'])
spot_price.set_index('Date', inplace=True)
spot_price = spot_price.loc['2000-01-01':,['price']]
spot_price = spot_price.astype(float)
print(spot_price)

Raw data visualization:

print('Historical Spot price visualization:')
plt.figure(figsize = (15,5))
plt.plot(spot_price)
plt.title('Henry Hub Spot Price (Daily frequency)')
plt.xlabel ('Date_time')
plt.ylabel ('Price ($/Mbtu)')
plt.show()
print('Missing values:', spot_price.isnull().sum()) 
# checking missing values
spot_price = spot_price.dropna()
# dropping missing valies
print('....Dropped Missing value row....')
print('Rechecking Missing values:', spot_price.isnull().sum())
# checking missing values

The common characteristic of different types of market manipulation is that, the unexpected pattern or behavior in data.

# Generate Boxplot
print('Box plot visualization:')
spot_price.plot(kind='box', figsize = (10,4))
plt.show()
# Generate Histogram
print('Histogram visualization:')
spot_price.plot(kind='hist', figsize = (10,4) )
plt.show()

Detecting anomalous subsequence:

Here, the goal is identifying an anomalous subsequence within a given long time series (sequence).

Anomaly detection is based on the fundamental concept of modeling what is normal in order to discover what is not….Dunning & Friedman

Pre-processing:

We’ll use 95% of the data and train our model on it:

Next, we’ll scale the data using the training data and apply the same transformation to the test data. I have used Robust scaler as shown below:

# data standardization
robust = RobustScaler(quantile_range=(25, 75)).fit(train[['price']])
train['price'] = robust.transform(train[['price']])
test['price'] = robust.transform(test[['price']])

Finally, we’ll split the data into sub-sequences with the help of a helper function.

# helper function
def create_dataset(X, y, time_steps=1):
a, b = [], []
for i in range(len(X) - time_steps):
v = X.iloc[i:(i + time_steps)].values
a.append(v)
b.append(y.iloc[i + time_steps])
return np.array(a), np.array(b)
# We’ll create sequences with 30 days of historical datan_steps = 30# reshape to 3D [n_samples, n_steps, n_features]X_train, y_train = create_dataset(train[['price']], train['price'], n_steps)
X_test, y_test = create_dataset(test[['price']], test['price'], n_steps)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

LSTM Autoencoder in Keras:

Autoencoder is a from of neural network architecture which is capable of discovering structure within data to develop a compressed representation of the input. It employs a recurrent network as an encoder to read in an input sequence into a hidden representation. Then, the representation is fed to a decoder recurrent network to reconstruct the input sequence itself.

Here, our Autoencoder should take a sequence as input and outputs a sequence of the same shape. We have a total of 5219 data points in the sequence and our goal is to find anomalies. We are trying to find out when data points are abnormal.

If we can predict a data point at time ‘t’ based on the historical data until ‘t-1’, then we have a way of looking at an expected value compared to an actual value to see if we are within the expected range of values for time ‘t’.

We can compare y_pred with the actual value (y_test). The difference between y_pred and y_test gives the error, and when we get the errors of all the points in the sequence, we end up with a distribution of just errors. To accomplish this, we will use a sequential model using Keras.

Model architecture:

  • The model consists of a LSTM layer and a dense layer.
  • The LSTM layer takes as input the time series data and learns how to learn the values with respect to time.
  • The next layer is the dense layer (fully connected layer).
  • The dense layer takes as input the output from the LSTM layer, and transforms it into a fully connected manner.
  • Then, we apply a sigmoid activation on the dense layer so that the final output is between 0 and 1.

We also use the ‘adam’ optimizer and the ‘mean squared error’ as the loss function.

Issue with Sequences:

  • ML algorithms, and neural networks are designed to work with fixed length inputs.
  • Temporal ordering of the observations can make it challenging to extract features suitable for use as input to supervised learning models.
# history for loss
plt.figure(figsize = (10,5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Evaluation:

Once the model is trained, we can predict using test data set and compute the error (mae). Let’s start with calculating the Mean Absolute Error (MAE) on the training data.

MAE on train data:

Accuracy metrics on test data:

# MAE on the test data:
y_pred = model.predict(X_test)
print('Predict shape:', y_pred.shape); print();
mae = np.mean(np.abs(y_pred - X_test), axis=1)
# reshaping prediction
pred = y_pred.reshape((y_pred.shape[0] * y_pred.shape[1]), y_pred.shape[2])
print('Prediction:', pred.shape); print();
print('Test data shape:', X_test.shape); print();
# reshaping test data
X_test = X_test.reshape((X_test.shape[0] * X_test.shape[1]), X_test.shape[2])
print('Test data:', X_test.shape); print();
# error computation
errors = X_test - pred
print('Error:', errors.shape); print();
# rmse on test data
RMSE = math.sqrt(mean_squared_error(X_test, pred_reshape))
print('Test RMSE: %.3f' % RMSE);

RMSE is 0.099, which is low, and this is also evident from the low loss from the training phase after 20 epochs: loss: 0.0749— val_loss: 0.0382. Though this might be a good prediction where the error is low but the anomalous behavior in the actuals cant be identified using this.

Threshold computation:

Objective is that, anomaly will be detected when the error is larger than selected threshold value.

Looks like we’re thresholding extreme values quite well. Let’s create a data frame using only those:

Anomalies report format:

Inverse test data:

Finally, let’s look at the anomalies found in the testing data:

The red dots are the anomalies here and are covering most of the points with abrupt changes to the existing spot price. The threshold values can be changed as per the parameters we choose, especially the cutoff value. If we play around with some of the parameters we used, such as number of time steps, threshold cutoffs, epochs of the neural network, batch size, hidden layer etc., we can expect a different set of results.

With this we conclude a brief overview of finding anomalies in time series with respect to stock trading.

Key takeaways:

Though the stock market is highly efficient, it is impossible to prevent historical and long term anomalies. Investors/ fund managers may use anomalies to earn superior returns is a risk since the anomalies may or may not persist in the future. However, every report metric needs to be validated with parameters fine-tuned so that anomalies are detected when using prediction for detecting anomalies. Also for metrics with different distribution of data a different approach in identifying anomalies needs to be followed.

Connect me here.

Note: The programs described here are experimental and should be used with caution for any commercial purpose. All such use at your own risk….by Author.

The Startup

Get smarter at building your thing. Join The Startup’s +724K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app