The case for Realtime Stream Processing

Merrin Kurian
The Startup
Published in
4 min readJun 11, 2020

--

Photo by fabio on Unsplash

Even since I got interested in providing insights based on data already available in the database, I’ve been looking for the infrastructure to provide them. Now insights can be of many categories, the ones I am interested are those powered by heavy duty analytical queries which you wouldn’t want to run on your online transactional database. The alternative for a very long time used to be data warehouses with powerful large scale analytical databases that store facts and dimensions and reports generated out of them. These processes are run in offline batch mode, making the number of jobs limited by the infrastructure available or the window in which they can be run. This was the state of affairs before public cloud became popular. Now it is only a matter of provisioning additional hardware to schedule more jobs, provided you are able to make the data also available to those jobs. In this problem space, it is not about adding additional compute, but it is also about making data available on these compute nodes so tasks can be performed by breaking them into smaller subtasks and leveraging parallel execution. Hadoop and MapReduce made this paradigm popular although heavy duty databases such as Vertica and Netezza and Redshift perform these operations as well.

With ever increasing data, however, the batch jobs could only provide so much catch up. There was still a lag on insights…

--

--