Real-Time Big Data Processing with Spark and MemSQL

Global IT News
Feb 23, 2017 · 2 min read

Background
I got an opportunity to work extensively with big data and analytics in Myntra, an e-commerce store based in India. Data-driven intelligence is one of the core values at Myntra, so crunching and processing data and reporting meaningful insights for the company is of utmost importance.

Every day, millions of users visit Myntra via the app or website, generating billions of clickstream events. It's very important for the data platform team to scale to such a huge number of incoming events, ingest them in real-time with minimal or no loss, and process the unstructured or semi-structured data to generate insights.

We use a varied set of technologies and in-house products to achieve the above, including Go, Kafka, Secor, Spark, Scala, Java, S3, Presto, and Redshift.

Motivation
As more and more business decisions tend to be based on data and insights, batch and offline reporting from data were simply not enough. We required real-time user behavior analysis, real-time traffic, real-time notification performance, and more to be available with minimal latency. We needed to ingest as well as filter and process data in real-time and also persist it in a write-fast performant data store to do dashboarding and reporting.

Meterial is a pipeline that does exactly this and even more with a feedback loop for other teams to take action from the data in real-time.

Read more at https://dzone.com/articles/realtime-big-data-processing-with-spark-and-memsql

Global IT News

Written by

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade