My Journey to Data Engineering

Published in

Pinboard Consulting

4 min readMar 3, 2022

My first encounter with data engineering was when I started my career. Then I used to be a part of a team which handled schedulers. Schedulers is a task scheduling technology where at a set date and time you can program a script to run. Since the scale was too big we had failed schedulers everyday and the role of my team was to look into the failed schedulers and fix them. The primary role of these scheduled scripts was to fetch data from CSV, PDF ,TXT files and even databases and merge or compile data and place it in a certain location within the network for the data to be processed further. These files were further read by a single or sometimes multiple ETL flows before being loading into different databases.

My key learning here was to understand the complexity and how quickly things get complicated while building a large-scale architecture. Files were created, manipulated, transferred, copied and deleted. More often than not schedulers failed because files were missing and this was often the case when an unread file was deleted due to buggy code or human error. Under these circumstances respective database and workflow teams had to be involved to find the missing data and run the workflow with the missing dataset. In spite of the extreme complexity, new workflows or data streams were created everyday because of new clients or new client requirements. Since this architecture ran all day everyday including holidays a dedicated team was required to watch over schedulers through the week, making sure they weren’t left unattended.

The next addition to my data engineering experience came when I was working as a Backend Developer. A large chunk of dynamic data had to be migrated from an SQL database to a no-SQL database. Since the database was huge in size, we couldn’t take the easy way to dump the data in a text file and upload it into the no-SQL Database. We had to build APIs on both ends of both the databases. For the bulk load, the main purpose of these APIs was not just to send data but also to validate if the load was successful before requesting the next chunk of data to be processed. These endpoints constantly communicated with each other to make sure the data was in sync. Triggers were used for every change in data on the SQL side and the respective change was sent to the no-SQL database and validated before they reflected on the database. This load was very slow and redundant but since the data was very sensitive it had to be handled carefully unlike my first job where there was breathing room for mishap.

The latest and the best experience in terms of learning came with my current job. The task wasn’t very complicated, it was simple in terms of its size and architecture. The task was to transfer data from an MSSQL server database to a graph database. The reason why it was so much fun was because every tool I had to use was new to me. The graph database was something I hadn’t used before, understanding its workings, how schema design was so much different from other databases. The query language used to interact with the database was new as well. Since the scale was really small, we did not try to make it simple by just loading the data through a file but rather we built an integrated data stream. We picked Kafka and Apache-Spark as two of our tools for the job. I had never used them before, it was exciting but also a little intimidating.

As a developer who has years of experience it’s safe to assume we heavily relied on Google for solutions but since this time we were dealing with graph databases there was little to no support available online. It was a roller-coaster of emotions as you celebrated every inch of progress you made and started doubting yourself when you got stuck for days. But in spite of that I learned something each day. The interesting bit was that we literally had the opportunity to speak to the actual developers of the database. How cool is that!? Not just the graph database, but Kafka and Spark proved to be fairly challenging too. New technologies take time to be understood. But finally, we managed to make it work using spark and the loads were complete. The aim wasn’t just to load data but to make it as scalable as we could within our reach. This taught us a lot of things, new concepts and certainly new tools.

By joining Pinboard Consulting I took a leap of faith and trusted myself to pick up on new culture, new technology and looking at things from a consultant’s perspective. Being part of a small team, you’re not restricted to one role or one job. Here getting the work done is more important that doing your job which provides an excellent environment to grasp as much as you can and get hands on knowledge on as many roles as you like. People at my work have already started fulfilling roles of a Data Engineer, Data Analyst, Data Scientist and Programmers.

Looking back at my days in my first and second company, the tasks that took us days could be done in hours if we had used newer tools. A lot of complexity and redundancy could have been eliminated if we could have overcome the fear of trying an unknown tool or learning something new. I am sure that my journey certainly isn’t anywhere close to an end. I have so much to learn and a heap of new tools to explore. With the same mentality I will keep moving forward and explore.

My Journey to Data Engineering

Written by Naman Chawhan