Crafting a Data Platform with Precision and Passion — Part 1
Introduction: A Tale of Two Pitches
In 2017, I embarked on a journey to establish a data engineering practice in a sprightly startup. Fast-forward to April 2022, I faced the same challenge but within the broader horizons of a mid-sized public company. Each venture offered unique hurdles, akin to playing two very different innings.
The Opening Bat: Swinging at Data Pitches with Platform Strategy
My first tryst with setting up a data engineering and analytics practice in a startup was a tough yet enriching experience. In my second attempt, I wanted to set up the practice quicker and taller. Two years had passed and I was not sure how I fared.
During the second stint, some of our data engineers had shown displeasure with mundane projects. Creating data pipelines was limiting and repetitive. A few days back, I heard from the team and learned that work had become exciting and they wanted to build more data products! “A good change, I felt!” — quipped the manager of the team.
The transformation from a service-oriented to a platform and product-centric mindset was a pivotal shift for the data engineering team and marked our first victory.
The Second Innings — Batting Long with a Data Platform
As the title goes, setting up a data engineering team, tech stack and processes is like playing a test cricket match and not T20. Patience, perseverance and playing one session at a time are the key.
Cricket is entertainment for spectators, but it is also a serious business. A balanced pitch is crucial for an exciting cricket match and business.
From day zero, my goal was to construct our “data pitch” with the precision of a curator, layer by thoughtful layer. There was a lot to do and it was unclear about what layers to consider, where to start from and how to unify them.
I drew inspiration from Linux’s init systems and run levels (or targets in systemd). A runlevel defines the state of the machine after it starts. Each runlevel has a certain number of services stopped or started, giving the user control over the behavior of the machine. The run levels are — [Single user (1) → Multi user, no NFS (2) → Multi user under CLI, no GUI (3) → Multi user under GUI (5, default)].
The runlevel 1, also known as single-user mode is used for system maintenance. The main difference between runlevel 3 and 5 is the ability to run a desktop environment in runlevel 5.
I applied this sequence to design our data stack, continuously adding to the developer experience and customer experience along the way.
Cultivating a Culture of Innovation
In the early days, we used Apache NiFi for authoring Extract-Transform-Load (ETL) pipelines. We started encountering NiFi’s limitations, like its limited support for error handling and observability. One day, a senior data engineer suggested — “I can build an ETL pipeline utility tool in a week”. “Writing software is easy but maintaining it is hard”, I reminded myself.
We started building the data ingestion platform with a Python library that could run on Apache Spark and named it Ingesta. We selected AWS MWAA (Airflow) as the scheduler and AWS EMR as the infrastructure, backed by a microservice written in Go. It was designed to do one specific task — extract data from RDBMS and publish it in S3 data lake. We built it quickly and migrated hundreds of NiFi data pipelines to Ingesta.
This was our cover drive: swift, elegant, and impactful. Our platform journey had begun!
Embracing Product Thinking
From runlevel 1–3, we started thinking about runlevel 5. The goal — automate the data movement and speed up product experimentation.
We had built a good tool and it had the potential to become the de facto platform for data ingestion in the company. We started thinking about externalizing the tool by adding a good developer experience and data consumer experience layer. Data analysts and software engineers within the company were our initial target audience.
Ingesta received a shiny new web interface. It cleverly hid the complexities of configuring and executing a reliable data pipeline. The team added some delight features viz workspace for teams, test run pipeline in sandbox, etc.
After the rollout, we learned that expecting a user’s data pipeline to run smoothly on its first go is like expecting a toddler to keep a white shirt clean at a spaghetti dinner — optimistic, but wildly unrealistic.
Our basic assumptions about simple attributes of a data pipeline like timezone were wrong and it had to be made configurable. Further, we observed that users had configured a pipeline’s run too frequently, making the pipeline unstable. The team quickly iterated on these feedbacks.
Playing a long innings
Just like in cricket, where a pitch has few cracks, our data engineering efforts faced their own challenges.
In a data ingestion platform ecosystem, schema evolution, basic transformations, data quality metrics and data governance play a crucial role. As the platform scaled, we realized that Ingesta had to architecturally evolve to address these unique requirements.
Product and Platform mindset
In this journey, Ingesta as a platform could evolve because the team’s mindset evolved. They began to view their projects not as isolated tasks, but as integral components of a cohesive system designed to deliver sustained value to the company and its customers.
In our experience, transitioning to a platform and product mindset encouraged our engineers to think more strategically about the data platform’s long-term needs. As innovation thrived, engineers felt empowered to take ownership of their work, and the development of a new platform became a shared mission.
Ingesta Stats
Since Nov 2023, the platform has transferred 100+ Bn records and pooled in 30 TB in the Data Lake. On a daily basis, the Ingesta scheduler executes 6,000 data pipelines. From a cost perspective, the cost of transferring 1 Mn records is less than $1.
My biggest learning has been that combining servant leadership with platform strategy is a winning combination.
Conclusion
Our journey from the ground up in building a simple and scalable data platform showcases an evolution from simple aspirations to a sophisticated, robust infrastructure capable of handling immense data volumes with elegance and efficiency.
As we continue to build our data platform, we’re aiming for a century — not in runs, but in innovative data products and solutions. May our data batsmen stand tall, and may our pipelines flow smoother than a perfect cover drive.
In part 2, we will cover Ingesta’s design philosophy, its architecture and features.
Dedicated to the Data Core team at A1.