K-Means clustering with Apache Spark

Happy ML

(λx.x)eranga
Effectz.AI
4 min readJun 27, 2019

--

Happy ML

This is the first part of my Happy ML blog series. In this post I will discuss about Machine Learning basics and K-Means unsupervised machine learning algorithm with an example. The second part of this blog series which discussed about Logistic Regression algorithm can be found from here.

About Machine Learning

Machine learning uses algorithms to find patterns in data. It first built a model based on the patterns on existing/historical data. Then use this model to do the prediction on newly generated live data. In general machine learning can be categorized into three main categories Supervised, Unsupervised and Reinforcement machine learning.

Supervised machine learning also identified as Predictive Modeling build on labeled data(data with defined categories or groups). Classification and are two types of problems in supervised machine learning. Decision Tree, Linear Regression, Logistic Regression are some examples for supervised machine learning algorithms. Unsupervised machine learning finds patterns on unlabeled data(data without defined categories or groups). It deals with two types of problems, Clustering and Dimensionality Reduction. The example of Unsupervised machine learning algorithms are K-Means, K-Medoids and Feature Selection. Reinforcement machine learning uses combination of labeled and unlabeled data. Since there are several machine learning algorithms available we have to choose right algorithm to solve our problem. This article describes about the available machine learning algorithm and their application scenarios.

In this post I’m gonna use K-Means algorithm to build a machine learning model with Apache Spark.(if you are new to Apache Spark please find more informations for here). The K-Means model clusters the uber trip data based on the trip attributes. Then this model can be used to do real time analysis of new uber trips. All the source codes and dataset which relates to this post available on the gitlab. Please clone the repo and continue the post.

About K-Means

K-Means clustering is one of the simplest and popular unsupervised machine learning algorithms. The goal of this algorithm is to find groups in the data, with the number of groups/clusters represented by the variable K. K-Means algorithm iteratively allocates every data point to the nearest cluster based on the features. In every iteration of the algorithm, each data point is assigned to its nearest cluster based on some distance metric, which is usually Euclidean distance. The outputs of the K-means clustering algorithm are the centroids of K clusters and the labels of training data. Once the algorithm runs and identified the groups from a data set, any new data can be easily assigned to a group.

K-Means algorithm can be used to identifies unknown groups in complex and unlabeled data sets. Following are some business use cases of K-Means clustering.

  1. Customer segmentation based on purchase history
  2. Customer segmentation based on interest
  3. Insurance fraud detection
  4. Transaction fraud detection
  5. Detect unauthorized IoT devices based on network traffic
  6. Identity crime locality
  7. Group inventory by sales

Uber data set

As mentioned previously I’m gonna use K-Means to build model from uber trip data. This model clusters the uber trips based based on trip attributes/features(lat, lon etc). The uber trip data set exists on the gitlab repo as .CSV file. Following is the structure/schema of single uber trip record.

Load data set

To build K-Means model from this data set first we need to load this data set into spark DataFrame. Following is the way to do that. It load the data into DataFrame from .CSV file based on the schema.

Add feature column

We need to transform features on the DataFrame records(lat, lon values on each record) into FeatureVector. In order to the features to be used by a machine learning algorithm this vector need to be added as a feature column into the DataFrame. Following is the way to do that with VectorAssembler.

Build K-Means model

Next we can build K-Means model by defining no of clusters, feature column and output prediction column. In order to train and test the K-Means model the data set need to be split into a training data set and a test data set. 70% of the data is used to train the model, and 30% will be used for testing.

Save K-Means model

The built model can be persisted in to disk in order to use later. For an example use with spark streams application to detect the clusters of realtime uber trips.

Use K-Means model

Finally the K-Means model can use to detect the clusters/category of new data(ex real time uber trip data). Following example shows the detecting clusters of sample records on a DataFrame.

Reference

  1. https://www.quora.com/What-is-machine-learning-in-laymans-terms-1
  2. https://www.goodworklabs.com/machine-learning-algorithm/
  3. https://mapr.com/blog/apache-spark-machine-learning-tutorial/
  4. https://mapr.com/blog/fast-data-processing-pipeline-predicting-flight-delays-using-apache-apis-pt-1/
  5. https://www.datascience.com/blog/k-means-clustering
  6. https://medium.com/rahasak/hacking-with-apache-spark-f6b0cabf0703
  7. https://medium.com/rahasak/hacking-with-spark-dataframe-d717404c5812

--

--