K-Means clustering with Apache Spark and Zeppelin notebook on Docker
Happy ML
Background
In my previous post I have discussed about running K-Means
machine learning algorithm with Apache Spark
and Scala
. In that post I have discussed about running ML model with IntelliJ IDEA
on local spark environment. In this post I’m gonna discuss about running K-Means
ML model with Apache Zeppelin
notebook on top of Docker
. Apache Zeppelin is a web based notebook which we can use to run and test ML models. It provide data visualization, data exploration and sharing features which I could not do with IntelliJ IDEA
based local environment. Read more about Apache Zeppelin from here. All the source codes and dataset which relates to this post available on the gitlab. Please clone the repo and continue the post.
Zeppelin docker
I have run Zeppelin in my local machine with docker. Following is the docker-compose.yml
to run Zeppelin. It defines two docker volumes which keeps the datasets and notebooks.
Dataset
I’m gonna analyze network data anomalies and find the attack behaviors in the network by using K-Means
clustering. To do that I had to generate an attack dataset. To generate the dataset I have run a simple Scala
based HTTP REST
service and do real attacks to it by using Metasploit
. Then captured all the packets coming to the server via TShark
. I wrote a blog post about capturing network data with TShark
in here. The captured dataset published as a .CSV
file in gitlab. Please download the dataset and put it in the Zeppelin docker container data directory volume. Then it will be available inside Zeppelin docker container. For an example I have put it on /private/var/services/spark/zeppelin/data
directory in my local machine.
Run Zeppelin
I can start Zeppelin container via docker-compose up -d zeppelin
command. Then I can browse the Zeppelin web in http:localhost:8080
url. Zeppelin web provides interface to create notebooks, import notebooks, data visualizations.
I have created a notebook named tshark
. The K-Means
clustering spark program written in that notebook. I have used Scala
language to write the spark job.
Load dataset
To build K-Means model I have loaded the dataset into spark DataFrame
. The dataset locates in /usr/zeppelin/data/tshark.csv
path inside the docker container. Following is the way to load the dataset into DataFrame.
Prepare dataset
Next I have done various data preparations and added feature columns into the dataset. I have used packet count
, average packet length
, average frame length
, packet local anomaly
, frame local anomaly
as the features. To generate the features I have grouped the dataset with frame_time
(in 1 minute time interval), ip_src
and frame_protocols
. Then calculated the packet count
, average packet length
, average frame length
, packet local anomaly
, frame local anomaly
and generated a new data frame. This data frame joined with original data frame to obtain the dataset with full feature set. Following is the way to do that. I have added comments in each phase in the code to make thing more clearer.
K-Means model
In order to train and test the K-Means model, the dataset need to be split into a training dataset and a test dataset. 70% of the data is used to train the model, and 30% will be used for testing. The K-Means model built with training dataset and evaluated with test dataset. I have used K=2
in here.
Cluster distance
Finally I have calculated distance of each data point from their respective cluster center(it’s for visualization purpose). Spark User-Defined-Function
used to calculate the distance. I have created new data frame with the distance
, id
(which is a sequence no) fields and saved it as a spark SQL view
.
Visualization
Zeppelin support to do the data visualization via SQL
queries. I can execute SQL queries with data on the SQL view and generate various graphs/charts from the query output. Following are some example charts that I have used to visualize the distance of each data point from their respective cluster centers. select id, distance from tshark
is the SQL query that I have used to select the relevant data. tshark
is the SQL view
name that I have previously created.
In this graphs most of the data scatted in to the cluster center. The points which far away from the cluster center(exceeds the threshold) can be identified as anomaly data(attack data).
Full spark job
Following is the full source code of the spark job. The spark jobs can exported as JSON
documents from Zeppelin web. I have exported this job as a JSON
document and published in gitlab. If want you can download it and directly import it into Zeppelin web.
References
- https://www.cloudera.com/products/open-source/apache-hadoop/apache-zeppelin.html
- https://lamastex.github.io/scalable-data-science/sds/2/2/db/999_01_StudentProject_NetworkAnomalyDetection/
- https://zeppelin.apache.org/docs/0.5.5-incubating/tutorial/tutorial.html
- https://medium.com/rahasak/k-means-clustering-with-apache-spark-cab44aef0a16
- https://metasploit.help.rapid7.com/docs/metasploitable-2-exploitability-guide
- http://www.sunlab.org/teaching/cse6250/fall2222/lab/zeppelin-tutorial/
- https://www.offensive-security.com/metasploit-unleashed/scanner-http-auxiliary-modules/