The rise of the Data Engineer

Alexander Jacobsen
KTH AI Society
Published in
6 min readSep 29, 2021

It’s been an open secret in the tech community for the last decade that most ML projects fail. As a student looking to get into the space, you might not yet have been confronted with the realities of the industry and bought into the ML-hype yourself. Simply put, without good data, your ML skills and models can be infinitely intricate but the results will still only be as good as your data. In this post, I will explain why and how you should add data engineering to your skillset so that you can deliver the amazing results you’ve always dreamt of on the job.

Data engineering VS Data Science (by Terence Shin)
Data Engineering VS Data Science (by Terence Shin)

The last years have seen the ascent of a new job title taking job boards and tech board rooms by storm. With the lofty expectations of the former sexiest job of the 21st century finally brought back to the ground, the time for the Data Engineers to step into the foreground has finally come. According to DICE’s Tech Job Report 2020, Data Engineering was the fastest-growing job role (ahead of Data Scientists and Backend Developers) with demand increasing by a whopping 50% in 2019 without sign of slowing down. Casey McNamara, Head of Data&Analytics at Capgemini simply put it like this :

I had a team filled with data scientists and started shifting to more data engineers. The data engineering team works far better and solves real world problems far faster.

So what’s all the fuss about? Who are these elusive Data Engineers and why have they so suddenly become indispensable and even preferred by companies previously hoarding Data Scientists like they were covid-vaccines? Why now and not in 2012?

The promise of data

Curing cancer, communicating by telepathy, a technological singularity that would propel our civilisation into a Star Trek-like future: these are some of the promises of the new data-age that have been made by various engineers and business analysts alike over the last two decades. In the mid to early 2010’s the promise of Big Data was perhaps at its peak, coining terms like “the big data revolution”.

The world if we had bigger data

The realisation of “real-AI” seemed to be just around the corner. Granted, some exciting developments have come of it such as Google’s Deepmind or in the form of super-relevant ads on Facebook (yay), still, only 46% of companies admitted to seeing positive ROI on their big data investments (which is probably a gross overstatement in my opinion)

Google searches for Big Data increasing fast around 2012 then stagnating.

Despite the hype, it is fair to say that the age of data still has not delivered on its big promises. The question is, why?

The problem of data

To exemplify the issue, I’ll bring you back to a harrowing spring afternoon of my junior year in college, the day I realised the problem of data. I was tasked with finding an employer for which I could perform a 2-week project for my course in software engineering. A small government agency tasked with collecting healthcare-related complaints offered me a position to build them a shiny new chatbot. Naively enough, I accepted and asked them to show me what kind of data I had to work with… Imagine now the horror I felt when they opened the door on a real-life file-room, filled from floor to ceiling wall to wall with actual physical files made of paper. Dreadful. Not even Jesus could have managed to transform this dusty library into a conversational AI assistant in 2 weeks, even less a Gen-Z programmer to which this file room looked more intimidating than Pan’s Labyrinth. Suffice to say I kindly declined the offer. Most probably, some less inquisitive student fell into the trap and spent two weeks in absolute misery with very little to show for it.

An example of bad data infrastructure

Despite the extreme nature of this anecdote, it exemplifies the main problem of investments in AI: most companies are simply not ready for it. Hasty decisions such as the one I described have led to shocking statistics showing that 85% of big data projects fail (Gartner, 2017) and that 87% of data science projects never make it to production (VentureBeat, 2019). The table below shows results from a 2014 survey asking organisations what they see as the leading causes of these massive failures.

Clearly, a lack of the right technology and data management knowledge, as well as appropriate infrastructure, is the culprit. In the face of such bleak prospects, many have lost faith. In fact, the percentage of companies self-reporting as “data-driven” has actually decreased in recent years from 37.1% in 2017 to only 31.0% in 2019. Enter, the Data Engineer.

The rise of the Data Engineer

Around 2011 the term “Data Engineer” started to crop up in the circles of new data-driven companies such as Facebook and Airbnb. Sitting on mountains of potentially valuable real-time data, software engineers at these companies needed to develop tools to handle all the data quickly and correctly.

As big data grew, “data engineering” came to describe a kind of software engineering that focused deeply on data — data infrastructure, data warehousing, data mining, data modelling, data crunching, and metadata management. Essentially, Data Engineers are the ones enabling the effective usage of data by downstream data professionals such as Data Scientists or Machine Learning Engineers. To reuse the oft-quoted analogy of data being the new oil, employing a Data Scientist without a Data Engineer would be akin to attempting to produce gasoline without a working oil pipeline. Following, the same analogy one could expect Data Engineers in the 21st century to take up a position akin to Petroleum Engineers in the 20th century. That is: absolutely indispensable for the functioning of our modern economy.

Conclusion

Following the failures of big data investments in the 2010s, the 2020s present themselves as the decade of the rise of the Data Engineer. Indeed, it has now become clear to organisations that to reap the benefits of data and machine learning it is not sufficient to hire data scientists and to invest in fancy models: one must first ensure a robust data infrastructure and for that one needs Data Engineers. Whether you’re running an AI-driven company, or a student simply interested in AI, it is vital to see the bigger picture: AI is powered by data. How that data is managed and used is vital to the success of any ML initiative, and that’s what data engineers are for.

More and more SaaS companies are proposing solutions to manage your data infrastructure — for a deep-dive into existing solutions check out Matt Turcks review of the MAD-landscape in 2021 (Machine Learning, AI, Data) featuring amongst others Stockholm’s very own Validio, tackling the massive but underappreciated problem of data quality. Next, become a data engineer yourself — a good starting place is Stanford’s free course CS 329S: Machine Learning Systems Design. If you’re a PhD student at KTH, enrol in the course FID3024 Systems for Scalable Machine Learning. And of course, become a member of KTHAIS or contact us at business@kthais.com for more content and events related to Data Engineering. Cheerio!

--

--