Introducing Saltside Data Platform

Pramod Niralakeri
Saltside Engineering
4 min readNov 22, 2019

Data, yes data. A new language spoken by every industry in the modern era. Why does or how does a data plays an important role in any business? because a data helps you to drive your business, gives a kick start for from where to start? whom to target? what to improve? when to make to call? and what not?

Since the data helps you in many ways, be it a decision making or measuring your KPIs or many other ways. It is important to know how we consume the data? how we transform the data? and also how we look at the data? (since a single data point may have multiple dimensions).

Today we’ll showcase our data platform at Saltside and discuss about how we’re building it. Also we’ll talk about the traditional Data-Pipeline and various technologies to consider, before you jump in into building one.

Data undergoes various stages during it’s journey in data-pipeline, where data gets stored and undergoes various transformations.

From a very high-level view, what does a pipeline looks like?

You’ll have the source data coming in, and data gets stored and transformed in your pipeline and finally goes out for helping your business and stakeholders.

Following are the different technologies which are good to know in building a data-pipeline. And what specific tools/frameworks to choose, is depending on the use case you are trying to solve.

  • Storage: Data storage system, helps you in storing your incoming data. It is important because you can reprocess the data if it stored somewhere and also multiple consumers can consume it from one source. And over the period of time you can also build warehouses and data-lakes.
  • Processing engine: Helps you in processing the data, and mostly contains the data transformation and business logic.
  • Scheduler: Helps you in scheduling the periodic jobs.
  • BI tool(s): Helps you to visualise your data.

Now, coming to the interesting part that is, How did we start the data analytics journey at small/mid sized Organisation like Saltside?. From the early stage of our data-platform establishment, we’d hundreds of thousands of events coming in every day. So we took the traditional way of pipeline development to process them.

Our data-pipeline at Saltside, looks like.

Data Pipeline at saltside

Obviously, it’s a very high level design, so lets break it into pieces to understand it better.

Kafka, message queuing system.

We’ve various data sources, let’s call them as upstream data generator. These upstream data generator would push the events to Kafka broker using Kafka publisher. So lets look into Kafka architecture first.

Kafka architecture

The above Kafka architecture is out of the scope for this post, but in short Kafka is a message queuing system. So let’s talk about our Kafka implementation a bit. We have multiple topics in Kafka segregating the homogeneous data at one place, and divided by partition based on desired business logic with replication factor for disaster recovery, just in case if a node fails.

Some of the basic configuration of our Kafka cluster.
Note that, some of the configs may change for a specific topic.

9 node Kafka cluster, across all the markets. with
30+ topics
3 replication factor
3 partitions

Apache Storm, processing engine.

Now we’ve data stored in Kafka, next step is to consume this data and transform it according to the needs and get it into a required shape. For that matter we’ve our processing engine as Apache STORM, which is a real-time distributed processing framework.

Again, it’s a very high level design. We consume data from Kafka using KafkaSpout in our storm topology. And we do have multiple topologies for consuming data from various Kafka topics.

36 nodes storm cluster with
30+ topologies processing 100’s of thousands of msgs from Kafka.
each topology solves a different business use case

Redshift, data storage system.

The final bit, once we slice and dice the data in Apache Storm. Next step is to store it in data-warehouse. But we don’t store them directly to Redshift, instead storm will upload the data to S3(simple storage service from AWS) and then do a Redshift S3 bulk load. Why we keep a copy in S3? just to a backup copy and also create a data-lake.

Redshift with 64TiB storage for analytics. And rest of the data will be stored in S3 data lake.

After the data gets stored in Redshift, we run a periodic ETL (Extract-Transform-Load) jobs triggered from Apache Airflow (A tool from Apache for job scheduling). These jobs are written in SQL and triggered from simple python script within Airflow.

Once we’re done with all the processing, transformation, slicing and dicing, finally a cube is ready for reporting. So for that matter we’ve Tableau BI tool for consuming the aggregated data from Redshift and visualise it.

--

--

Pramod Niralakeri
Saltside Engineering

My life revolves around data, always striving to bring great values out of data insights.