Streaming ML: Perceptron Algorithm

Machine learning has become mainstream in big data analysis and improving exponentially breaking lot of barriers with the help of modern computation capability. As the vast popularity of Internet of Things, large amount of data is acquired over time. Instead of waiting until the data is collected, streaming data analysis bring us a great opportunity to analyse and predict the data on the spot. When the data change over the time streaming analysis adapt and scale when it’s impractical to store raw data and while persist on smaller and focused representation.

Batch Processing vs Streaming

Batch processing is used for the large volume of data while streaming is suitable for data with high velocity. As the growth of the IoT, large amount of sensor data is gathered and shared rapidly overtime. So there is a huge demand has emerged for the streaming data analysis. Batch processing is efficient if we used method like mapreduce to distribute the data across the clusters to be processed in batch processing. But when it comes to streaming data, it should be able to handle data with extremely low latency for workloads that must be processed with minimal delay.

Traditional Data Mining Algorithms work with a static dataset and the algorithm can afford to read the data several times on the other hand Stream Mining only can afford to read the data once so the algorithms for this subfield of data mining are based on a single scan. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data.

Implement the Perceptron

Building the hypothesis function h(x)

x(i) = input variable
y(i) = target variable

approximate y as a linear function of x:

h(x) = predicted value
θ = Weights values

let x0 = 1 (intercept term)

Cost Function

Here we use gradient descent algorithms to minimize the cost function.

gradient descent algorithm

α = Learning Rate

But as the Widrow-Hoff learning rule or LMS update rule for a single training example:


Stochastic Gradient Descent (incremental gradient descent)

In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent (also incremental gradient descent). Whereas batch gradient descent has to scan through the entire training set before taking a single step — a costly operation if m is large — stochastic gradient descent can start making progress right away, and continues to make progress with each example it looks at. Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent. (Note however that it may never “converge” to the minimum, and the parameters θ will keep oscillating around the minimum of J(θ); but in practice most of the values near the minimum will be reasonably good approximations to the true minimum.) For these reasons, particularly when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.

However, it has been shown that SGD almost surely converges to the global cost minimum if the cost function is convex (or pseudo-convex).

Improve Stochastic Gradient Descent based learning

  • An adaptive learning rate η Choosing a decrease constant d that shrinks the learning rate over time:
  • Momentum learning by adding a factor of the previous gradient to the weight update for faster updates:
  • Randomly shuffle samples in the training set.


I used Iris Data Set from UC Irvine Machine Learning Repository for the calculate the accuracy of the model. Which can be downloaded from this link

Note — As this dataset is used for the binary classification, I preprocessed the dataset accordingly.