Get that “Spark” in your life — Data engineering, the new draw!

As a graduate in Information Systems Management preparing to enter the job market a few years back, I was overwhelmed by the number of job titles in the data space — analytics engineer, data analyst, data engineer, ML engineer, data scientist ..phew! I knew I wanted to be in a data role but I was faced with the ultimate question as many others; which of these is a better fit for me and what exactly are the skill sets needed to be successful? I finally settled on data engineering because I felt it could give me the opportunity to be at the cross-section of software development, big data technologies, and business strategy, all of which I wanted to learn about.
We hear about “data” a lot these days. Your colleagues. Your friends. Your parents, your long-lost relatives, your neighbors, your neighbor’s dog… everybody is talking about the growing opportunities in data.
While artificial intelligence, machine learning, and data science are the “buzzwords”, data engineering usually gets left behind in these conversations. Sure, AI, ML and data science are important and changing lives, but data engineering is as pivotal and acts as a fundamental building block for AI, ML, and data science work. I am a bit tired of reading articles that describe data engineering as the least “glamorous” job of all data roles or completely miss out on the foundational role of a data engineer. There are certainly a lot of myths surrounding data engineering and the thing about myths is — they are not always true!
Data engineer? So do you analyze data?
Kind of, but not really! There is certainly a gray area in the job roles which vary from company to company but to oversimplify, data scientists need to have a strong math/statistics background in order to convert raw data into actionable results using machine learning, mathematical models and algorithms, while data analysts define key performance metrics for businesses, create visualizations using out-of-the-box business intelligence tools, and work with key stakeholders to provide insights for the business. There are also other specialized roles in data science like machine learning (ML) engineers who run ML experiments and build/deploy end to end ML models. Data engineers, on the other hand, have more of a programming background in addition to data modeling, data warehousing, and SQL skills. They are more focused on building and maintaining the analytics infrastructure and architecture for data generation and building data pipelines. A typical data interaction flow looks like this.
Services which software engineers build, generate a lot of data in an unclean and unstructured format. For data analysts to analyze this dataset and data scientists to build scalable and useful models, this data needs to be cleaned up and transformed into an easily consumable format. The models and insights are as good as the data which is used to generate them. If the input data is not of good quality and not well-governed, it’s a “garbage in, garbage out” situation. Here is where data engineers come in — to automate pipelines that clean the data, munge and transform it into a healthy, usable dataset. There are essentially three phases as part of building a data pipeline.
- Ingestion — Researching and gathering the right input datasets. The source data can come from a variety of different sources — data warehouse, log files, internal or third-party APIs, event-driven messages, etc.
- Processing — Plug into the source systems, munge the raw data, transform it to conform to a desirable data model and decorate the data with useful identifiers
- Storage — Store the data in a storage system or data warehouse for easy consumption by data scientists and analysts.
In a nutshell, the major responsibilities of a data engineer are to — 1) Build and maintain complex data pipelines for moving data across systems in an automated fashion; 2) Design data models for storing the data in a consumable format; 3) Maintain data quality and integrity (this is where a little bit of the data analytics skills kick in)
What a cakewalk!
In simplistic terms, we clean and transform data. Doesn’t sound very complicated, does it? Sure, data engineering is no rocket science but that does not mean that it can be taken as dead-weight. It needs as much effort and hard work as that of other software streams. There are complex and challenging data problems that data engineers handle on a day to day basis and doing it at scale is additional complexity in itself. As the data volume is growing at massive rates, data engineering comes down to this — Single machines running an RDBMS (like SQLServer ) are not capable of storing all of the data produced at this scale. We need a distributed way of storing and processing these large volumes of data over a cluster of machines. Today, companies are using distributed computing frameworks like Spark and Hadoop for computing and processing large datasets as the growing volumes pose frequent scalability challenges for data engineers.
For instance, healthcare is one sector that generates a large volume of data and has extensive use cases for big data analytics. Let's say a large scale healthcare company just released a new wearable technology that tracks the customer’s fitness levels in terms of calories burnt, exercise length, etc. The wearable device is constantly tracking and sending the customers’ activity as real-time event feeds which are then stored in a transactional data store. Since the high volume logs are unstructured events, they are not easily queryable for analysis. However, the data analysts in the company want to analyze the fitness levels of the customers over time to better serve the customer’s needs, and suggest personalized and realistic goals but they need clean, queryable and reliable data to run this analysis on. To bridge this gap, the data engineer comes in and is responsible for processing these high volume logs in a distributed fashion by parsing out the raw events, applying transformations like extracting necessary data fields and anonymizing sensitive customer data, and applying data optimization techniques for improved query performance before storing them in the data lake used for data analytics. Data handling for large sensitive data like this use case needs to have a well thought out engineering approach to not only address scalability and latency problems but also major customer privacy and data security challenges in the space.
It’s all about SQL!
The big data ecosystem as it stands today has too many technology choices. The data space is going through a major transformative phase. The scalability issues surrounding traditional relational databases gave birth to a ton of NoSQL databases and the rise of functional programming languages for data pipelining like Python and Scala. While SQL is still a relevant skill and did find a place in the vast Big Data stack, there are more than a couple of SQL options for Big Data as well — Presto, HiveQL, SparkSQL to name a few. Furthermore, there is an entirely different technology stack for real-time data processing. Kafka, Spark Streaming, Apache Storm are some of the top choices. And not to forget the eclectic plethora of choices for workflow orchestration tools like Airflow, Luigi, Oozie, etc. To make the long story short, the big data landscape is continuously growing and evolving. There is never a dull day in this space as you continuously grow and learn.
Honey, there is less money!
There is a common belief that data scientists and software engineers make more money than their data engineering counterparts. There is a paradigm shift in the skill sets required to be a data engineer which requires niche skills for big data storage and processing. These are just as necessary and valuable in the industry as math or software development skills. Data engineering has just as good opportunities for competitive pay packages as any other software fields.
Conclusion
Love it or hate it but you cannot ignore it — data engineering has become a fundamental requirement for building scalable and efficient data infrastructure. They are critical to creating the foundational blocks to aid data analysts and scientists in building analytical data products.
Data science, engineering, and analytics — none can live without the other as all have come together to form a data intelligence cocktail to enable organizations to grow and scale.
