The Evolution of Data Engineering: A Journey Through Time (Part 1)
Data Engineering! one of the most demanding jobs in recent years and it did not came into mere existence in a blink of an eye. Data Engineering has been there for so long. Actually it existed even before computers were invented. Of course, there were not any fancy titles like “Data Engineer” those days, yet the work was still getting done. To answer how, we need to look back and see how we dealt with data in pre-computer era.
Pen and Paper! Yes our very own pre-historical way of saving data is no other than pen and paper. Due to its convenience and simplicity it is still being used in the present day despite its origin runs back in time into very ancient age such as 3000BC. We did not stop there, we engineered those piece of paper into books, page numbers, indexes and to filing cabinets so that we could efficiently store and retrieve data.
Since then, with the rise of computers, from a piece of paper into a solid state drive, from a humble pen into a massively parallel data processing application, we evolved!
The world first database, 1960s
The introduction of the very first database dates back to the 1960s where Integrated Data Store(IDS) was developed by Charles Bachman. IDS used CODASYL data model which is also referred as network data model. CODASYL along with hierarchical model are two very first data models to be introduced. This is where two of the most frequently spoken words in data engineering came to the existence. Tables and Rows!
Both CODASYL and hierarchical models were navigational models — that means you have to navigate the database by record by record. Each record had a reference of the to the next and the previous records.
The relational model which we use even today designed by the scientist Edgar F. Codd in 1970s. He worked at IBM and he was not very happy with the navigational models primarily because they missed a search function.
Codd used primary keys to link records unlike in the previous models where references were purely physical disk addresses of the records. This introduced great deal of flexibility for users who wanted to define relationships in data. It also solved one of the greatest pain we have to endure when dealing with data. Data duplication! The relational model blessed databases with normalization to maintain the integrity and consistency by eliminating duplicated data.
The era of relational databases, 1970s
The release of Codd`s model was followed by a project called System R at IBM which started its development in 1974 and became the first implementation of SQL.
The relational databases were not commercially available at least up until 1980s where pioneered database engines like Db2, Informix, Oracle came into the market. Since then, relational databases have been spread like wildfire. The relational model was too practical to map into many business use cases and that made it very popular.
Those initial releases were tightly integrated with the operating systems and they ran on vendor specific hardware such as AS/400 or mainframes. The relational schemas of those systems has been pre-baked by the vendors themselves and solutions were specially engineered for range of industries such as financial services or banking. The customers mainly had to have roles for systems engineers and database administrators to operate and maintain those servers but evolving or customizing schema was very specialized and it was handled by the vendor itself.
The need for a data warehouse, 1970s
According to wiki, the early discussions on data warehousing predates early as 1960s where the terms dimension and facts were introduced. But things started to conceptualize when Bill Inmon, the father of the modern data warehouse, started to define the term in 1970s. The idea of data warehouse, herein referred to as DWH, was revolved around moving data from operational systems to decision support systems.
Data in its raw form sitting inside databases are nothing but series of bits and bytes and do not have any value. To create value you have to transform them into sensible information. When data is collected by multiple operational systems they often leads to creation of data silos. Data silo by definition is when data produced by one system becomes unusable in another system due to inconsistencies. Example would be city name of an address becomes New York in one system and NY in another. The data has to be evaluated, sourced, cleansed and integrated from multiple operational systems before making it to a targeted decision support environment. In the absence of centralized data store, the very same steps had to be repeated every time when a new decision making requirement appeared in a different environment such as marketing or finance. It’s important to understand that during 80s, computer resources were scarce and considerably expensive compared to the abundance we enjoy today. As a result, the extensive data redundancies caused by repeating the same steps placed a heavy financial burden on enterprises. Inmon tackled all these concerns by defining the centralized enterprise data warehouse (EDW) on top of relational model utilizing the third normal form(3NF).
However it is important to understand that Inmon`s definition of DWH did not come without its challenges and high complexity was one of them. Crafting an EDW that spans across the entire organization was not so easy. It required a pool of subject matter experts and data warehousing experts to work together to come up with a perfect solution. This has somewhat limited the early adaptation of DWH across the world.
Ever since Inmon has invented EDW there has been many changes and discussions along the way and Ralph Kimball`s definition of dimensional data warehouse in 1990s has contributed greatly to data warehousing world. Kimball`s definition of star schema and bottom up approach at designing DWH has made it much easier for an organization to adapt to DWH with significantly lower initial investment.
The rise of business intelligence, 1980s — 1990s
The relational databases were jumping through leaps of innovations since 1980s and businesses have already started to work in megabytes and gigabytes of data range¹. SQL was also gaining popularity due to its simplicity. Oracle released its first SQL RDBMS, Oracle v2 in 1979. Microsoft released their first version of SQL Server in 1989 with collaboration of Sybase. Followed by IBM’s System R development, another project known as Ingres emerged . Ingres did not use SQL but it has influenced the birth of a project called Postgres (Post Ingress) in 1986 which was renamed and open sourced 1996 as PostgreSQL after a decade. MySQL was also released in 1995 in a more of a parallel context.
SQL was also getting matured in 1990s. In 1992, SQL-92 standardized wider range of functionalities such as table joins, temp tables, indexes compared to its predecessors SQL-86 and SQL-89⁵ ⁶. But products like Oracle7 released in 1992 already had advanced features such as PL/SQL which later got standardized into SQL:1999 as SQL/PSM.
These technological advancements facilitated the widespread adoption of relational databases by organizations, and the increased volume of collected data created early use cases for business intelligence. The term business intelligence was invented by Hans Peter Luhn, an IBM researcher, in 1950s but the definition we are familiar today came from Howard Dresner in 1989.
business intelligence is an umbrella term to describe concepts and methods to improve business decision making by using fact-based support systems²
The growing use cases of business intelligence expanded the need for Data Warehouse (DWH) projects, consequently driving the demand for Extract Transform Load (ETL) tools. Informatica released their first ETL tool Informatica PowerMart³ in 1996 which was renamed into PowerCenter in 1998. IBM released their first version of IBM Infosphere DataStage in 1997.
Multidimensional Data Analysis
Edgar F. Codd also invented the term OLAP in 1993 in one of his papers “Providing OLAP to User-Analysts: An IT Mandate”¹ which illustrated industry requirements for multidimensional analysis. In early 1990s, analysis applications were using SQL with aggregate functions to generate multidimensional views. There were several issues with SQL when it comes to multidimensional analysis specially with Group By where it was comprehensively examined in a paper published in 1996⁴ by Jim Gray. One of them being N-dimensional summary required N number of SQL queries to be combined with Union as explained below sourced from the original paper⁴ itself.
Figure 1 illustrates a three dimensional summary table which resulted in combining three SQL queries as outlined below.
SELECT Model, ALL, ALL, SUM(Sales)
FROM Sales
WHERE Model = 'Chevy'
GROUP BY Model
UNION
SELECT Model, Year, ALL, SUM(Sales)
FROM Sales
WHERE Model = 'Chevy'
GROUP BY Model, Year
UNION
SELECT Model, Year, Color, SUM(Sales)
FROM Sales
WHERE Model = 'Chevy'
GROUP BY Model, Year, Color;
Despite writing such queries being tedious, the increased query complexity negatively impacted database performance and resulted in delayed query processing which made analytical operations quite time consuming. The Grays`s paper⁴ suggested the use of multidimensional cubes. The Cube operator works by pre-aggregating and building a table with all the possible values for each dimension combinations.
The CUBE operator is the N-dimensional generalization of simple aggregate functions. The 0D data cube is a point. The I D data cube is a line with a point. The 2D data cube is a cross tab, a plane, two lines, and a point. The 3D data cube is a cube with three intersecting 2D cross tabs.⁴
Eventually discussions on cube led to the inception of multidimensional expressions (MDX) and the release of Microsoft Analysis Services(SSAS) back in 1998.
Conclusion
I have summarized all the important defining moments in history, that was covered in the article, into a timeline.
The way I see it, the problems and the requirements that is too much for technological advancement to solve at a particular time, always emerges into a specialized branch in technology. The early requirements in history to solve data problems led to the emergence of the relational model and data warehousing principles during a time when computational power was scarce. Industry trends create market demands; market demands results in series of technological advancements. As new technologies emerge, so does the specialized talents that accompanies them. Similarly, the role of ETL/DWH developers and business intelligence developers began to surface in the 1990s as a requirement to bridge the gap between data sources and their consumers.
References
[1] E.F.Codd (1993), “Providing OLAP to User-Analysts: An IT Mandate”
[2] https://dssresources.com/history/dsshistory.html
[3] https://www.bloorresearch.com/company/informatica/
[4] Gray, Jim; Bosworth, Adam; Layman, Andrew; Pirahesh, Hamid (1996). “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”
[5] https://archive.org/details/federalinformati127nati
[6] https://archive.org/details/understandingrel00pasc/page/n10/mode/1up?q=Index