Try Kill batch processing with unified log stream processing...

Mukesh Kumar
logika.io
Published in
6 min readAug 2, 2017

Logs in your application are abstraction view of functionality/behavior whether application is a web application to cryptocurrency.

I love to create analogy and here I compare logs of any system as white blood cells. White blood cell in our body helps us to fight an infection and recognizes the invading particles before it cause disease.

let’s start with what is log?

It is the structured request of error, request or other messages in a sequence of rotating text file. A log is not all that different from a file or a table. A file is an array of bytes, a table is an array of records, and a log is really just a kind of table or file where the records are sorted by time. We can manually read logs for what is happening in application.

This definition take a complex view when we have a distributed environment. Now let’s consider the case when we have many services or servers are joined together and generating logs then this log information is quite difficult to manage. There after we started relying on graphs to understand the behavior of our application.

Logs in Distributed Systems: -

The log-centric approach to distributed systems arises from a simple observation that we can call the state machine replication principle.

If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state.

The distributed systems commonly distinguishes two broad approaches to processing and replication using logs.

The state machine modelandThe primary-backup model

The primary-backup model

The primary backup model a master node is chosen to handle all reads and writes. Each write is posted to the log. Slaves subscribe to the log and apply the changes that the master executed to their local state. If the master fails, a new master is chosen from the slaves. In the state machine replication model, all nodes are peers. Writes go first to the log and all nodes apply the write in the order determined by the log.

The primary-backup model vs The state machine model

The state machine model

The state machine model usually refers to an active-active model, where we keep a log of the incoming requests and each replica processes each request in log order. This model generally used to elect one replica as the leader. This leader processes requests in the order they arrive and logs the changes to its state that occur as a result of processing the requests. The other replicas apply the state changes that the leader makes so that they will be in sync and ready to take over as leader.

The logs also act as the buffer to make data production asynchronous from data consumption. It is important when multiple subscriber consumes at different rates. I use the term “log” here instead of “messaging system” or “pub sub” because it is much more specific about semantics and a much closer description of what we need in a practical implementation to support data replication. The critical point here is the destination system only knows about the log and does not know any details of the system of origin. The consumer system need not concern itself with whether the data came from a relational database, a key-value store, or was generated directly by some application.

How about to create a unified log:-

My approach here to write this blog is to give you an idea to simplify practical systems using log-centric design.

A fully connected architecture has separate pipelines

Now how about if we try to make something more generic like below:-

Unified log centric approach

To approach it we need to isolate each consumer from the source of the data. The consumer should ideally integrate with just a single data repository that would give her access to everything. The idea is that adding a new data system — be it a data source or a data destination — should create integration work only to connect it to a single pipeline instead of to each consumer of data.

Kafka is little unique in achieving this architecture as an infrastructure product — it is neither a database nor a log file collection system nor a traditional messaging system.

There are few products already out there below:-

I have only described what amounts to a fancy method of copying data from place to place. However, copy bytes between storage systems is not the end of the story. It turns out that “log” is another word for “stream” and logs are at the heart of stream processing.

But wait, what exactly is stream processing?

OK, you have heard definition like below about the stream processing:- Stream processing is a model where you process all your data immediately and then throw it away.

But stream processing is a model for continuous data processing with ability to do computation with low latency results. The roll of stream processing comes when we collect data continuously and it is natural to process it continuously.

Many have one common area of confusion about stream processing i.e certain kinds of processing cannot be done in a stream processing system and must be done in batch. A typical example I have heard used is computing percentiles, maximums, averages, or other summary statistics that require seeing all the data. But this somewhat confuses the issue. It is true that with computing, for example, the maximum is a blocking operation that requires seeing all the records in the window in order to choose the biggest record. This kind of computation can absolutely be carried out in a stream processing system. Indeed, if you look at the earliest academic literature on stream processing, virtually the first thing that is done is to give precise semantics to windowing so that blocking operations over the window are still possible.

When data is collected in batches, it is almost always due to some manual step or lack of digitization, or it is a historical relic left over from the automation of some nondigital process.

Any modern web company does not need to have any batch data collection but many data transfer processes still depend on taking periodic dumps and bulk transfer and integration. As these processes are replaced with continuous feeds, we naturally start to move towards continuous processing to smooth out the processing resources needed and reduce latency.

Logs and Stream processing: -

Why do you need a log at all in stream processing? Why not have the processors communicate more directly using simple TCP or other lighter-weight messaging protocols?

First, it makes each data set to be multi-subscriber. Each stream processing input is available to any processor that wants it; and each output is available to anyone who needs it

The second use of the log is to ensure that order is maintained in the processing done by each consumer of data.

The third use of the log is arguably the most important, and that is to provide buffering and isolation to the individual processes. Let’s say, if a processor produces output faster than its downstream consumer can keep up, we have many problems.

We have developed the web log analysis system and believe it is good at one and hope it will good for other too.

Next Article we will discuss the Lambda Architecture and alternative of Lambda Architecture.

--

--

Mukesh Kumar
logika.io

Apart from Big data as my full time profession, I am a robotics hobbyists and enthusiasts… My Web Site: http://ammozon.co.in/