Data & Data Engineering — the past, present, and future

Zach Wilson
7 min readAug 3, 2021

--

Data has been part of our identity as humans since the time of the ancient Romans. Our growing interconnectivity caused us to bump into scale issues during the 1880 US census. We’ve gone from 0% of the world on the internet to 59.5% of the world. 4.32 billion people with cell phones generates quite a huge data feed. How has humanity dealt with the need to analyze this unprecedented amount of data? We’ve come up with punch cards, relational databases, the cloud, Hadoop, distributed computing and even real-time stream processing to try and manage this gusher of 1s and 0s. Before I get too caught up in talking about yellow elephants, let’s talk about all of this in the history of data and data engineering.

The pre-Computer Era: By Hand
When a lot of people think about data engineering, they automatically think of computers and pipelines. Yet data and data engineering as a concept have existed for a long time. The human-driven pipelines before computers existed had really terrible SLAs and were riddled with data quality issues (much like many computer-driven pipelines are today).

The first data table that we have historical evidence for is Ulpian’s life table created by the ancient Romans around 230 AD. This table was used to describe Roman life expectancy. The data quality of this table is unknown; we can probably guess they didn’t use automated data quality checks with awesome libraries like Great Expectations.

The first use of the word “data” in English was in the 1640s. Back then the definition meant, “a fact given or granted.” I’m sure people back then would be surprised to learn that the new definition is “potentially sketchy byte value in a computer.”

Humans didn’t really have any significant data engineering achievements until around 1880. The US census in 1880 took EIGHT YEARS to complete because it was all done by hand. Herman Hollerith, aka data engineer #1, decided that waiting 8 years for the data pipeline to finish was far too long. He invented a tabulation machine that cut the processing time for the 1890 census from 8 years to 6 years! According to the U.S. Census Bureau, the census results were “… finished months ahead of schedule and far under budget.” Herman definitely deserved a promotion from junior data engineer to senior data engineer for his huge impact!

Hollerith 1890 tabulating machine

Punch cards and tabulation machines were the Apache Spark of data engineering for incredibly long! It wasn’t until humans invented the relational database that punch cards were really dethroned!

The Past: Electronic Spreadsheets & the RDBMS

If you talk to the average person today and ask them what data is, electronic spreadsheets are the first thing that come to their mind. Microsoft didn’t come out with Excel until 1985 though! Let’s take a step back to the 70s and talk about the ramp up of relational databases.

In 1970, an IBM computer scientist named Edgar F. Codd invented the relational model for database management. A few years later in 1974, the programming language SQL was born. The initial version of SQL was called SEQUEL, which is what a lot of modern data engineers still call it! They changed it to SQL because Hawker Siddley had SEQUEL as a registered trademark and they wanted to avoid litigation. It’s amazing to think that we still use SQL for a lot of modern day data engineering 47 years later!

Relational databases revolutionized how we store, manage, and even think about data. It came with mechanisms like primary keys that guaranteed uniqueness, foreign keys that guaranteed relationships, and indexes that made data access much quicker. We could start analyzing large amounts of data in a trustable, fast, and reliable way!

One of the problems with relational databases is they were still very technical to work with. Unless you were a computer scientist, you couldn’t really utilize them. That’s where Microsoft saw a huge opportunity to bring data to the masses with Excel. Excel made data management and analysis so easy that anyone could do it with minimal training!

Bill Gates obviously excelling at data analysis!

Once the Internet was born in the 1990s, we started amassing data at an exponentially accelerated rate! We started to see some of the drawbacks to relational databases and Excel. For vast amounts of data, these tools became unusable. We were drowning in so much data that no one computer could handle it all!

The Present: Distributed & Cloud Computing

So how did we solve this huge data problem? Teamwork makes the dream work. Instead of making bigger, beefier computers, we decided to use regular computers working as a team, a concept known as horizontal scalability. This gave birth to the field known as distributed computing. This innovation allowed for extremely large scale analytics to happen for the first time.

In 2006, the first technology to leverage this brand new paradigm was Apache Hadoop.

Looking at the Hadoop elephant brings most data engineers joy, terror, or a mix of both

Hadoop worked by splitting up the large amounts of data into bite-sized amounts that could be processed by individual computers and then recombined using a framework called Java MapReduce. This made it so analyses that would’ve not even been possible on traditional relational databases could actually be run.

Around the same time that Hadoop was born, the cloud computing segment of Amazon Web Services (AWS) was founded. AWS Cloud Computing meant that you could borrow someone else’s team of computers to process your big data instead of managing the team yourself! Netflix in 2008 abandoned their servers in favor of the cloud and became a “cloud native” company, around the same time they were destroying Blockbuster, a late-fee native company. Cloud computing lowered the barrier to entry for every company to jump into the big data space.

Hadoop definitely had its drawbacks though. Everything had to be written to disk and RAM wasn’t really used efficiently anywhere. The Hadoop File System (HDFS) had to replicate your data by a factor of 3 in order to operate efficiently. It was very technical and challenging to write big data pipelines with MapReduce.

To lower the technical barrier to entry for big data processing, Facebook created Apache Hive in 2010. Hive allows you to write MapReduce jobs using a SQL-like interface called HQL. While Hive lowered the technical barrier to entry, it didn’t do anything to address all the other drawbacks of Java MapReduce.

The current king of data engineering, Apache Spark, was released in 2014 to address the drawbacks of JavaMapReduce and Hadoop. Spark is a distributed computing framework that can actually leverage RAM as much as possible and still process things in a distributed way. Many companies saw huge savings when migrating from Hive/MapReduce to Spark!

Humanity has come a very long way since we started this data journey back in 230 AD! So what’s coming next? How can we anticipate and learn the skills needed for the next data revolution?

The Future: Drowning in Privacy and a River of Data

In 2017, Equifax was hacked and the sensitive data of 150 million Americans was leaked. This event made Americans feel very vulnerable, with many of them changing their banking information and credit cards to avoid identity theft. It made us realize that data is incredibly valuable! Big parts of our identity have become locked up in the 1s and 0s of these computers.

We’ve realized that protecting our data is protecting our humanity. Data engineers who are privacy-focused will be in very high demand in the 2020s and beyond!

As our world keeps kicking into a higher and higher gear, we have to write pipelines that keep up with it. Currently, most data pipelines are written to run on daily or hourly batches of data, which causes a significant latency between when the data is collected and when it is processed. Fascinating technologies like Apache Flink and Spark Streaming allow data to be processed as it is generated instead of waiting for a daily or hourly batch of that data to be collected. These technologies allow us to perform truly real-time analytics and machine learning! Data engineers who learn how to build and manage streaming data pipelines will definitely see their career prospects grow!

The cloud is growing into something incredibly powerful. A lot of data engineering tasks that were cumbersome in the Hadoop world are simpler to do with tools like Snowflake or BigQuery. They allow you to do what took hundreds of lines of Java in dozens of lines of SQL. The cloud will continue to develop new offerings that will reduce the overhead of doing machine learning, data quality checks, and so much more! The future of data is incredibly exciting! I hope we can harness the immense opportunity in front of us to the fullest!

The Flink squirrel wishes you a lifetime of prosperity so long as you don’t use Spark Streaming!

--

--

Zach Wilson

Founder @ EcZachly Inc. 250k followers on Linkedin, data engineering and mental health influencer https://www.linkedin.com/in/eczachly