Data Engineering Practices

How to efficiently version, store and process data for your pipeline

When you have large volumes of data, storing it logically helps users discover information and makes understanding the information easier. In this post, we talk about some of the techniques we use to do so in our application.

In this post, we are going to use the terminology of AWS S3 buckets to store information. The same techniques can be applied on other cloud, non cloud providers and bare metal servers. Most setups will include a high bandwidth low latency network attached storage with proximity to the processing cluster or disks on HDFS if the entire platform uses HDFS. Your…


If you're using AWS EMR to run your spark jobs, enable Ganglia (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-ganglia.html). It is a pretty good free tool to use. People seem to also use DataDog and a few other paid tools but I don't have experience with setting them up for Spark jobs.


Terraform is a tool to build your infrastructure as code. We used it to manage our infrastructure through code. What we used to struggle with was managing our environment configuration and secrets as there was no clear pattern on how to manage them in the terraform ecosystem.

In this post, we’d like to share the challenges with managing configurations and secrets manually in the beginning and how moving to a version controlled mechanism for configurations and secrets helped improve the reliability of our system.

Life before version control

Before we can do that, it’s important to understand build process before we began on this…


How to make your showcases exciting without destroying a world heritage site. (PC: dilbert.com)

Showcases are a key part of our agile ceremonies. We showcase our work to our stakeholders for feedback at the end of every iteration. And as with every presentation, I believe there is a Science in the Art of the Showcase (for distributed teams).

On one of our recent teams, our showcases had challenges. Each of these challenges is a piece of feedback. We added structure to our showcases by running it like a theatre recording TV shows.

This isn’t revolutionary stuff. …

Karun Japhet

Karun is a human. Writes code at Sahaj Software. Speaks publicly about scaling software, testing, CD & ML. ❤️ riding motorcycles and playing CS:GO.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store