Events and Data Processing : 10 Must Learn Topics

Jayant Kumar
2 min readDec 29, 2019

--

There has been tremendous progress in event and large scale data processing (batch or stream) in the last two decades leading to many successful novel applications such as feed-based apps (ex: LinkedIn, Facebook, Quora), ride-sharing apps (ex: Uber, Lyft), home and travel booking (AirBnB, Expedia) and real time transactions in financial apps (banks, credit-cards, E-trade).

I recently realized that even for data science (DS) and machine learning (ML) engineers whose main job is to perform data analysis and modeling, better understanding of these topics help them in designing an end to end pipeline for many ML based applications.

I started learning about these topics and came up with a list of resources (mainly Youtube videos) that might be useful for folks working on productionizing DS and ML based applications. Many of these are open source projects (initiated and supported by big companies) and so adoption and progress is more community driven (hence high-quality).

Apache Kafka

Intro to Streams | Apache Kafka

Apache Beam over Apache Kafka Stream processing

Hive Tables (Data lake)

Cassandra feature store

Apache Spark

Spark vs Map reduce

Apache Flink

Adopting stream processing

What is NoSQL and how are NoSQL databases different?

MongoDB Crash Course

Elastic Search in Action

Lastly I want to add TFX which is aimed at productionizing ML models and has Data Ingestion and Data Validation components which connects some of the above topics to a model in production.

Tensorflow Extended

Disclaimer: I am not an expert on these topics but I found these useful while learning as an ML engineer.

--

--

Jayant Kumar

I am passionate about technology and how it impacts our daily life. I am a computer vision and applied machine learning researcher/engineer/leader.