Analytics Vidhya
Published in

Analytics Vidhya

Random Cut Forest with example

Random Cut Forest with Code, What you should watch out for.

This is a continuation of my previous story where I have to explain the theory on Random Cut Forest.

Photo by James Harrison on Unsplash

Hey Guys, As we already know how RCF works, let’s directly jump into coding. I can guarantee that It will be the easiest way till now you have ever faced any anomaly detection code.

To brief you on this in one line — Based on our data points, the RCF model will assign scores, and based on that score, anomalies will be decided.

To start with I am not going to take the direct example of what AWS is providing the NYC taxi data instead will use Machine Failure data from here.


Let’s import all necessary libraries

## Importing all necessary Libs that will be required
import pandas as pd
import numpy as np
import os
import warnings
import pickle
import matplotlib.pyplot as plt
import boto3
# import sagemaker
import sys
import seaborn as sns
import holoviews as hv
from holoviews import opts

will add some better view options as well

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Loading the data:

## Loading Data
df = pd.read_csv("./data/nab_machine_failure.csv")
# print("Data columns ---> {}".format(list(fulldata.columns)))
# fulldata = df
print("Total Number of records ---> {}".format(df.shape[0]))
print("Total Number of features ---> {}".format(df.shape[1]))
anomaly_points = [
["2013-12-10 06:25:00.000000","2013-12-12 05:35:00.000000"],
["2013-12-15 17:50:00.000000","2013-12-17 17:00:00.000000"],
["2014-01-27 14:20:00.000000","2014-01-29 13:30:00.000000"],
["2014-02-07 14:55:00.000000","2014-02-09 14:05:00.000000"]
df['timestamp'] = pd.to_datetime(df['timestamp'])
#is anomaly? : True => 1, False => 0
df['anomaly'] = 0
for start, end in anomaly_points:
df.loc[((df['timestamp'] >= start) & (df['timestamp'] <= end)), 'anomaly'] = 1
df['year'] = df['timestamp'].apply(lambda x : x.year)
df['month'] = df['timestamp'].apply(lambda x : x.month)
df['day'] = df['timestamp'].apply(lambda x :
df['hour'] = df['timestamp'].apply(lambda x : x.hour)
df['minute'] = df['timestamp'].apply(lambda x : x.minute)

modified the data frame by adding few more features converting the complete date timestamp to individual features. Let’s see how they form into.

df.index = df['timestamp']
df.drop(['timestamp'], axis=1, inplace=True)

We will see an example how the anomaly point will look like. as we have explicitly given anomaly points based on anomly_point. so for visualization, I have used hole views, they are great will make a story on them sometime.

anomalies = [[ind, value] for ind, value in zip(df[df['anomaly']==1].index, df.loc[df['anomaly']==1,'value'])]
(hv.Curve(df['value'], label="Temperature") * hv.Points(anomalies, label="Anomaly Points").opts(color='red', legend_position='bottom', size=2, title="Temperature & Given Anomaly Points"))\
.opts(opts.Curve(xlabel="Time", ylabel="Temperature", width=700, height=400,tools=['hover'],show_grid=True))

Now for the main event, How will apply RCF to this, and will get all of our anomalies. Let’s start by importing some of the imp libs again in-order to use the random cut forest. As RCF is an AWS-created model we have to load Sagemaker, boto3, and all.

from sagemaker import RandomCutForest
import boto3
import botocare
import sagemaker
import sys

defining the role:

iam = boto3.client('iam')
role = iam.get_role(RoleName='NAB_machine_failure_analytics')['Role']['Arn']

Making the training Estimator role ready:

# Training
rcf = RandomCutForest(
# automatically upload the training data to s3 and run the training job['value'].to_numpy().reshape(-1,1)))

if you see something like this that means your training got started. Make sure you get the training completed message as well at the end.

For Predicting data we have to inference it which means we have to deploy it. and we have a deployed method that will create inference.

rcf_interference = rcf.deploy(initial_instance_count=1, ins)

Then we have to serialize it so we will add the scores to the main data frame. Then you will get a data frame like this.

from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
rcf_inference.serializer = CSVSerializer()
rcf_inference.deserializer = JSONDeserializer()
machine_data_numpy = data.value.to_numpy().reshape(-1, 1)
results = rcf_inference.predict(
machine_data_numpy[:6], initial_args={"ContentType": "text/csv", "Accept": "application/json"}
results = rcf_inference.predict(machine_data_numpy)
scores = [datum["score"] for datum in results["scores"]]
# add scores to taxi data frame and print first few values
data["score"] = pd.Series(scores, index=data.index)

Let’s see what the score shows us as anomalies.

As you can see this is a very straightforward method to work with anomaly detection.

Thanks, guys for reading this. Please comment for any queries.

What Now?

There you go .. you do now code random cut forest. The next blog will be looking into other AWS models. and with that will look a close look into holoviews.

Thanks for reading.

If you like the article please make sure to give a clap. Please follow me for more projects and articles on my Github and my medium profile.

Don’t forget to check out the end-to-end deployment of a deep learning project with Android application development.

Thanks. Please comment down for any queries.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tapan Kumar Patro

Tapan Kumar Patro


📚 Machine learning | 🤖 Deep Learning | 👀 Computer vision | 🗣 Natural Language processing | 👂 Audio Data | 🖥 End to End Software Development | 🖌