The Evolution of Data Engineering: A Journey Through Time (Part 2)

Udaya Chathuranga
11 min readNov 26, 2023

--

Photo by Piotr Cichosz on Unsplash

This is the second part of data engineering evolution multi part series, check out the first if you have not already.

As business intelligence captured the world in 1990s by propelling organizations towards unlocking their business potential through data analysis, the escalating volume of data collection created the need for the applications with rapid computational capabilities.

Data intensive applications

The exponential growth of the web, aka .com bubble, which expanded from just 130 websites to 6.5 million between 1993 and 1999¹, resulted in plethora of technological innovations in the decade that followed 2000. In early 2000s, Google was already winning its battle against Yahoo and becoming the technological giant as we know today. The internet companies like Friendster, MySpace, Twitter and Facebook were founded and re-invented the social media. The world was slowly starting to have its digital self.

The growth of big tech companies pioneered the early industry requirements for the distributed data processing.

The birth of hadoop

As web became popular, so did the search engines. Yahoo was already ruling the web since mid 1990s. The industry requirements for search engines incubated projects on full text search like Lucene followed by web crawling projects like Nutch.

In early 2000s, the creators of Nutch, Doug Cutting and Mike Cafarella, were facing problems due to a lack of proper distributed file system and a computational model. Shortly thereafter, Google released two of its most renowned papers: one on the Google File System in 2003, followed by a paper on MapReduce in 2004. These publications solved the problems Doug and Mike were facing. In 2006, HDFS and MapReduce parts were taken out of the Nutch project and moved into a new project called Hadoop!

Hadoop was ideal for batch processing with massive amounts of unstructured data. Yet it could also be fully extensible to support structured data as well. i.e support for Parquet/Avro by writing custom InputFormats. These capabilities led internet giants like Facebook, LinkedIn, and Yahoo, which operate on petabytes of data, to early adoption of Hadoop. Later, they made great contributions back to Hadoop by open-sourcing much of their internal development, which became integral components of the Hadoop ecosystem.

Hive, originally developed by Facebook and open sourced in 2008 (see HADOOP-3601 ), was a data warehouse built on top of flat files (you could also write custom SerDe). The support for SQL in hive encapsulated the complexities of MapReduce and made data analysis with Hadoop 100x times easier. This accelerated the industry adoption of Hadoop; even the data analysts without could directly benefit from Hive thanks to its SQL interface.

Similarly, Pig; a project from Yahoo was open sourced in the same year 2008. It introduced a high level programming language for MapReduce.

In the same year 2008, Cloudera, the first commercialized distribution for Hadoop was founded; which followed by the release of Hortonworks in 2011. Hortonworks offered apache products under open source license. In 2018 Hortonworks merged with Cloudera and became a trademark of Cloudera.

Some parts of the Hadoop evolution have been sourced from the history of Hadoop by Marco Bonaci. It is an impressive article and the story has even verified by the Doug Cutting himself.

NoSQL

With the rising use cases for unstructured data as well as the requirements for scalability has given birth to the NoSQL databases. Even though the term NoSQL was first appeared in 1998 by Carlo Strozzi, to describe Strozzi NoSQL database which did not use SQL interface yet still relational, it was re-invented by Johan Oskarsson in 2009 to describe non-relational databases.

The whole idea behind NoSQL was to design extremely scalable databases that supports unstructured data such as documents, KV and graphs. The fundamental for scalable processing is to partition database engines. But the CAP theorem, which was also published in 2000, states when data is distributed, you are left with either availability or consistency but not both.

To implement consistency with partition tolerance you have to follow the ACID model. The ACID is implemented on distributed database engines by a two-phase commit(2PC) but the challenge was 2PC is not scalable. The availability of a 2PC implementation of a distributed system is equal to the product of the availability of each individual instances. As an example, six database instances each with 99.9% availability results in a reduced 99.4% availability of the whole system. In addition the performance of 2PC decreases in contrast to the increasing number of database instances.

As a result, NoSQL engines implements BASE model which offers eventual consistency instead of strong consistency but offers infinite scaling without compromising availability or performance of the entire system.

NoSQL became a hot topic in late 2000s as its implementations such as CouchDB, HBase, Cassandra became available to the public.

Distributed relational databases

Big tech firms like Google, Facebook, LinkedIn operate in decades ahead of their time and uses many proprietary technologies and infrastructures to function their businesses. The data needs of the remaining non-mega tech companies got addressed by commercialised products from vendors like Microsoft, IBM and Oracle who dominated the RDBMS market in early 2000s. When databases started to hit the limits of the vertical scaling due to increase of data volumes and the analytical requirements, it created industry demand for the distributed database engines.

The roots of earliest commercialised distributed SQL database runs deep as 1987 when NonStop SQL released. It was one among many other projects originated from Ingres. NonStop SQL was popular because its performance scaled linearly with the number of CPUs. The NonStop SQL line was followed by HP Neoview in 2007²; a distributed database geared towards business intelligence workloads but the line was retired in 2011.

Oracle was leading the RDBMS market with a share of 33.8% in 2000³. The Oracle RAC, a feature that enabled clustering, was released into the market with Oracle9i in 2001 and followed by the release of Exadata in 2007 with 11g. It was an specialized hardware appliance which supported horizontal scaling and high-availability and was ideal for a mix of mission critical OLTP and OLAP workloads. Even the first release of Exadata supported 168 TB storage and 64 CPU cores⁴.

Microsoft SQL Server (MSSQL) on premise did not introduce any scale out features and continued to maintain its simplicity. It supports only the scale up functionality even in the present day; although it has HA and DR features. One reason for MSSQL not getting any distributed features could be the Microsofts early focus on Azure. Microsoft has started Azure development since mid 2000s and they probably wanted to push customers towards the Cloud when they hit distributed data processing requirements.

However the lack of scalable features did not make MSSQL go out of business; while Oracle focused big enterprises, MSSQL was ideal for small to mid scale businesses on emerging markets which required smaller initial investment. Also MSSQL was surrounded by rich set of tooling like SQL Server Reporting Services (SSRS) and SQL Server Management Studio(SSMS) that got released in 2004 and 2005 consecutively. Installation and configuration of MSSQL was entirely GUI driven and required much less specialised knowledge. Microsoft being a industry leader in operating systems made all these possible and MSSQL is still a very popular choice for on-premise deployments even in 2023. Nobody can really beat Microsoft in terms of tooling and usability specially within the Windows platform!

MySQL also prepared itself for distributed workloads by releasing MySQL Cluster in 2004 while ProgreSQL, being a much more extensible than MySQL with many features, gave birth to many distributed database projects⁶ like pgCluster, EnterpriseDB and Greenplum.

The distributed databases engines certainly created the industry demand for database administration due its increased complexity.

SQL further enhancements

In early 2000s, SQL was further receiving enhancements, particularly with the standardization of window functions ( use of OVER clause) in SQL:2003. Soon, database vendors began to introduce analytical functions like ROW_NUMBER() and RANK(), which have since become invaluable in data analysis contexts.

Maturity of data integration tools

As businesses becomes more data aware, the requirements for data transformation and integration began to increase since 1990s. This created the huge demand for data integration or ETL tools.

Since Informatica and IBM released PowerCenter and InfoSphere DataStage in late 1990s, plenty of ETL tools has emerged into the market in 2000s.

Microsoft also released their ETL tool, Data Transformation Services (DTS) along with SQL Server 7.0 in 1998¹² which later got renamed into SQL Server Integration Services (SSIS).

Oracle has acquired Sunopsis, a company specialized in ETL tools, and branded their product suite as Oracle Data Integrator (ODI) in 2006.

Pentaho Data Integration (aka Kettle), Talend Open Studio for Data Integration, SAS Data Integration Studio are few more among many integration tools that surfaced the market in 2000s.

Business intelligence 2.0

The whole idea of business intelligence that started in 1990s, referred as business intelligence 1.0, revolved around data warehousing and reporting. The IT department or BI department was responsible for developing and maintaining a data warehouse. Then reports were developed on top of DWH that will be delivered to the business users on a schedule. This whole process is handled by a centralized team. Whenever other departments meets a new data requirement, they have to talk to the centralized data team to get it handled. Even though this approach works, it limits the freedom and flexibility for those who are supposed to explore and uncover the insights in their business domain using data.

The business intelligence 2.0 (BI 2.0) refers to the evolution of BI practices in the mid to late 2000s, with a primary focus on achieving data democratization. The data democratization stands for everyone's right to access data in an organization without any technical barrier. The idea of data democratization greatly benefitted by the growth of web which made browser based BI solutions to come to the market.

SQL Server Reporting Services (SSRS) is a paginated reporting platform released in 2004 as an extension to Microsoft SQL Server. SSRS allowed users to manage their reports by themselves using nothing but a web browser. The BI department could develop and publish data sources to the SSRS server and then business users could simply connect to these data sources and build reports themselves using a GUI based report builder application.

Visualization tools like QlikView also started to release their web based versions since mid 2000s⁹. Compared to standalone applications, web-based solutions makes it also much more easier to implement collaborative features, such as the sharing of reports and visualizations. Frankly, users can publish their reports to a centralized repository, allowing the entire organization to access these reports through a web browser.

The cloud

Cloud has influenced all the fields in IT, including data engineering, since its emergence in late 2000s which started to offer computing resources over the internet through a subscription based pricing model. Cloud primarily have three service models known as SaaS, PaaS and IaaS. While SaaS has been around for some time, the cloud as we know it gained popularity with the emergence of IaaS in late 2000s.

Brief history of SaaS and PaaS

In 1960s, computers were enormous as well as expensive and only a few businesses could own a computer. That`s when IBM and other mainframe providers promoted time-sharing computing to provide computer power for large organizations from their data centers. These systems involved connecting series of dumb terminals (keyboards and monitors without CPUs) to a mainframe which kept all the applications and data¹⁰. It was the earliest form of connecting computers together to provide a SaaS. However, In1990s the price of computers were greatly reduced and office workers started to have their own personal computers. The use of personal computers eventually led to the increase of bloatware which started to fill up the hard drives quite often. At the time hard drive space was still quite expensive; 15MB hard drive costed $2495 USD¹⁰. SaaS vendors responded by not only considering the storage requirements but also adapting to the increasing popularity of the web, offering their enterprise applications online.

In 1999, Salesforce launched their CRM system as a subscription service on internet by officially becoming the first SaaS solution.

Zimki, a development platform, became the worlds first PaaS solution in 2007. Zimki was far ahead of its time in many ways but lost its chance at becoming a major PaaS provider. Take a look at this nice article about how Canon almost became a major cloud provider. First to market is not always the winner (Google vs Yahoo, Facebook vs MySpace).

In 2006, Amazon Web Services became the first commercial IaaS provider by offering two of its initial cloud services; Simple Storage Service (S3) and Elastic Compute Cloud (EC2). After two years, Google released their Google Cloud Platform (GCP) in 2008. Azure was also announced in 2008 under the code name “Project Red Dog” and released as Windows Azure in 2010.

Even though Cloud has started to emerge since late 2000s, it took better half of the next decade for it to become widespread in the world. IaaS accelerated cloud adoption for many organizations, thanks to its ability to scale up and down resources on-demand, reducing upfront costs in contrast to on-premises setups. Once cloud providers realized that by improving their IaaS capabilities they could convince the remaining hesitant customers to make the transition to the cloud, they started to include more and more IaaS features in their cloud offerings; such as Azure Compute, Linux on Azure, VPC, load balancers and networking components.

Being a data engineer in 2000s

I see 2000s as the decade that layed off the foundation for the data engineering in present day. The rise of mega tech companies which compensated data world with Hadoop ecosystem and the early use cases for unstructured data as well as the requirements for distributed data processing applications which fuelled by the growing businesses in tech and the birth of NoSQL and Cloud; all these originated in 2000s. This is also the decade that computing power went exponential; from trillions to quadrillions flops. The web also created the early trends on visually appealing content, which eventually created the demand for data visualization followed by the invention of collaborative BI platforms to enhance the sharing and understanding of data-driven insights.

Apart from mega tech firms, The data domain pretty much revolved around relational databases and data warehouses as majority of the world was still working with structured data. Experience in Oracle database products and integration tools like PowerCenter, DataStage could land you in pretty amazing job offers as the role of DBA and DWH/ETL developer reached its maturity from 2000 to 2010. The role of business intelligence developer also gaining the popularity however I believe it was not widely recognized and established as it is in the following decade 2010 onwards.

TL;DR; In nutshell, as a data engineer in 2000s, you would have been working with relational databases, setting up MySQL clusters, developing ETL workflows with integration tools like PowerCenter and designing and talking about data warehousing a lot.…and also probably debating about relational vs NoSQL databases.

Conclusion

The 2000s marked a transformative era in data engineering. The advancement in distributed processing, the evolution of business Intelligence and the rise of cloud computing were pivotal aspects, among many others, that laid the foundation for further advancements in the decade that followed. It’s not an exaggeration to say that big tech companies have reshaped data engineering. They enriched the whole industry with a robust set of tools and frameworks to harness the power of data in unprecedented ways, accelerating the advancements and groundbreaking innovations in data engineering field for years to come.

References

[1] https://www.zakon.org/robert/internet/timeline/

[2] https://www.hp.com/hpinfo/newsroom/feature_stories/2007/07masterdata.html

[3] https://www.tech-insider.org/statistics/research/2001/0523.html

[4] Oracle, “Exadata Technical Overview”

[5] https://issues.apache.org/jira/browse/HADOOP-3601

[6] https://wiki.postgresql.org/wiki/PostgreSQL_derived_databases

[7] https://web.archive.org/web/20080519220323/http://www.maia-intelligence.com/Articles/BI-2-Technology.html

[8] http://support.sas.com/resources/papers/proceedings10/040-2010.pdf

[9] https://data-flair.training/blogs/qlikview-versions/

[10] https://bebusinessed.com/history/the-history-of-saas/

[11] https://www.zdnet.com/article/microsoft-launched-azure-10-years-ago-and-lots-but-not-everything-has-changed/

[12] https://www.sqlservercentral.com/articles/a-brief-history-of-ssis-evolution

--

--