K-Means clustering with Apache Spark and Zeppelin notebook on Docker

Happy ML

(λx.x)eranga

Published in

Effectz.AI

4 min readFeb 2, 2020

Background

In my previous post I have discussed about running K-Means machine learning algorithm with Apache Spark and Scala. In that post I have discussed about running ML model with IntelliJ IDEA on local spark environment. In this post I’m gonna discuss about running K-Means ML model with Apache Zeppelin notebook on top of Docker. Apache Zeppelin is a web based notebook which we can use to run and test ML models. It provide data visualization, data exploration and sharing features which I could not do with IntelliJ IDEA based local environment. Read more about Apache Zeppelin from here. All the source codes and dataset which relates to this post available on the gitlab. Please clone the repo and continue the post.

Zeppelin docker

I have run Zeppelin in my local machine with docker. Following is the docker-compose.yml to run Zeppelin. It defines two docker volumes which keeps the datasets and notebooks.

Dataset

I’m gonna analyze network data anomalies and find the attack behaviors in the network by using K-Means clustering. To do that I had to generate an attack dataset. To generate the dataset I have run a simple Scala based HTTP REST service and do real attacks to it by using Metasploit. Then captured all the packets coming to the server via TShark. I wrote a blog post about capturing network data with TShark in here. The captured dataset published as a .CSV file in gitlab. Please download the dataset and put it in the Zeppelin docker container data directory volume. Then it will be available inside Zeppelin docker container. For an example I have put it on /private/var/services/spark/zeppelin/data directory in my local machine.

Run Zeppelin

I can start Zeppelin container via docker-compose up -d zeppelin command. Then I can browse the Zeppelin web in http:localhost:8080 url. Zeppelin web provides interface to create notebooks, import notebooks, data visualizations.

I have created a notebook named tshark. The K-Means clustering spark program written in that notebook. I have used Scala language to write the spark job.

Load dataset

To build K-Means model I have loaded the dataset into spark DataFrame. The dataset locates in /usr/zeppelin/data/tshark.csv path inside the docker container. Following is the way to load the dataset into DataFrame.

Prepare dataset

Next I have done various data preparations and added feature columns into the dataset. I have used packet count, average packet length, average frame length, packet local anomaly, frame local anomaly as the features. To generate the features I have grouped the dataset with frame_time(in 1 minute time interval), ip_src and frame_protocols. Then calculated the packet count, average packet length, average frame length, packet local anomaly, frame local anomaly and generated a new data frame. This data frame joined with original data frame to obtain the dataset with full feature set. Following is the way to do that. I have added comments in each phase in the code to make thing more clearer.

K-Means model

In order to train and test the K-Means model, the dataset need to be split into a training dataset and a test dataset. 70% of the data is used to train the model, and 30% will be used for testing. The K-Means model built with training dataset and evaluated with test dataset. I have used K=2 in here.

Cluster distance

Finally I have calculated distance of each data point from their respective cluster center(it’s for visualization purpose). Spark User-Defined-Function used to calculate the distance. I have created new data frame with the distance, id (which is a sequence no) fields and saved it as a spark SQL view.

Visualization

Zeppelin support to do the data visualization via SQL queries. I can execute SQL queries with data on the SQL view and generate various graphs/charts from the query output. Following are some example charts that I have used to visualize the distance of each data point from their respective cluster centers. select id, distance from tshark is the SQL query that I have used to select the relevant data. tshark is the SQL view name that I have previously created.

In this graphs most of the data scatted in to the cluster center. The points which far away from the cluster center(exceeds the threshold) can be identified as anomaly data(attack data).

Full spark job

Following is the full source code of the spark job. The spark jobs can exported as JSON documents from Zeppelin web. I have exported this job as a JSON document and published in gitlab. If want you can download it and directly import it into Zeppelin web.