SQL/RDBMS: Set Theory in Action
Originally this post was going to be a comparison between various relational database manage systems (RDBMS), specifically MySQL and PostgreSQL. After some diligent research I came to this conclusion. Either you’re a seasoned developer working on a difficult scaling problems and you know your problem well enough to make an educated choice for which database you choose. Or, more likely, you’re just trying app off ground like the rest of us, in which case you should go to your near sofa reach in between the cushion find the coin with the most pocket lint stuck to it and flip it. Heads you chose MySQL, tails PostgeSQL. Then you can be free to focus on more important things like building your application. In reality, most RDBMS’s are more the same than they are different. at the end of the data if you’re at the point when you have to optimize the performance of your database then you’re in a good position. So instead of doing a comparison my plan is to break down the basic principle of RDBMS and SQL.
Edgar F. Codd is a famous computer scientist who started work at IBM in 1948 as a mathematical programmer. In 1965 he went on to receive a PhD from the University of Michigan, his thesis was about self-replication in cellular automata. 5 years later, Edgar publish a paper, “A Relational Model of Data for Large Shared Data Banks.” This paper laid the foundation for the modern relational database management system or RDBMS. Despite the retrospective genius of Codd’s paper IBM took no action. It wasn’t until after a team at UC-Berkeley began work on INGRES, a precursor to PostgreSQL, and Larry Ellison released the first commercially available relational database in 1977 (with the company that would eventually change its name to Oracle) that IBM finally decided to take action. Today the top 4 most used databases on the web are relational databases based off of Codd’s pivotal paper. Unfortunately Codd never made much money off of his ideas, but in 1981 he was awarded the Turing Award (the Nobel prize of computing) for his achievements.
In 1970 most databases typically had a combination of three dependencies “ordering dependence, indexing dependence, and access path dependence.” An ordering dependence is when an element is ordered based on a key attribute. For example, if all data entries were ordered by first name. This dependency causes issues to arise when a diferent ordering is needed. For example: when all elements are ordered by first, changing the ordering to last name could be computationally expensive. Indexing dependence happens with data structures like arrays. Each element in an array has a unique index to locate it, which is great for random access times, but gets complicated when data is inserted or deleted. Also, since each entry requires an index it creates more unnecessary data. Finally, access path dependence is when data is stored using pointers to locate data such as in a tree or map. In order to get to each entry a path of pointers must be followed. This makes certain queries such as fetching all entries with the first name “John” much more complicated.
In order to combat the issue of dependent and ad hoc databases Codd uses his mathematical knowledge of set theory to define a more abstract system. Edgar’s system would be a relational system that stores data in tables, like an excel spread sheet. Each row in the table is a new entry and the column defines the type. This allows engineers to treat tables as a set.
In mathematics a set can be any collection of definite and distinguishable data. This means that sets can’t just be defined as arbitrary ranges. For example all values less than zero is an invalid set, because there are an infinite number of values less than zero. An example of a set is all even numbers greater than 0 and less than 10. This set could also be represented as:
{2, 4, 6, 8}
2, 4, 6, and 8 are all independent members of this set. In this example each member stores one value. However, each member can contain more than one value. Exampling being if we have a set of people. As, the most modest person I know, I’m going use a subset of the people who have complemented me over the years for this example:
{(Jamie Yu), (Barrack Obama), (Caroline Wozniaki), (Channing Tatum)}
Each element in this set is called a member. Each member is made up of two values a first name and a last name. The order is of the two values is important; (Jennifer Lawrence) does not equal (Lawrence Jennifer). Since there are two values in each member, each member would be considered a relation of degree 2. This can scale up, and instead of each member having double the number of values, a member can have, triple, quadruple, sextuple, septuple, n-tuple the number of values. (If you’ve ever written code in Python or another language that uses tuples, this is where the term comes from). In relational databases every table is a set, every row is a tuple, and columns define the order of data in that tuple. One of the key principles that carries over from set theory is that each member has to be unique. For example this is an invalid set:
{(Jamie Yu), (Barrack Obama), (Caroline Wozniaki), (Caroline Wozniaki), (Channing Tatum)}
Since the member (Caroline Wozniaki) was repeated twice we violated set theory. In the real world repeat values happen all the time. In order to comply with set theory RDBMS will require that each row has its own unique primary key. Every entry in a table/set is given a unique primary key. This allows each member/row/tuple to be unique. In a RDBMS the above set might look more like:
{(111 Jamie Yu), (792 Jennifer Lawrence), (643 Caroline Wozniaki), (432 Caroline Wozniaki), (519 Channing Tatum)}
Each member now has their own 3 digit primary key that be can used to differentiate it. By using the principles of set theory Codd was able to define a whole new type of database. This method is much more abstract, which means goodbye ad hoc method. The ad hoc databases of the past were programmed uniquely for a single purpose or application. If a programmer needed to change the structure of the database, even if that meant changing a floating number field to an integer field, a programmer would have update the source code to make the change, which is never ideal. After Codd’s paper SQL was developed.
The programming language SQL was developed by Donald Chamberlin and Raymond Boyce. SQL stands for structured query language. It is a domain-specific language designed for querying relational databases. A domain-specific language is what it sounds like: “a computer language specialized to a particular application domain.” So for an example Python is not a domain specific language because it can be used to create a wide variety of applications. HTML is an example of a popular domain specific language that’s used only to create user interfaces. SQL was originally named SEQUEL for structured English query language. Despite the name change the pronunciation remained. SQL uses a natural sounding syntax so it’s easy to pick up. For example if I wanted to query for all emails in an Employee table I would write:
SELECT Email FROM Employees
This makes SQL easy to pick up and easy to read even for people with out a technical background. SQL creates a layer of abstraction to enable developers to create their own implementation of an RDBMS behind it. Not surprisingly, each SQL operation is designed around set theory in order to work well with any RDBMS implementation. So, for those with experience with SQL, SELECT, FROM, WHERE, CREATE, etc. are all set operations.
There’s no doubt that Edgar Codd had a huge impact on the world. Today relational databases are responsible for a large percentage of the persistent storage on the web. Applications like Facebook would most likely be agonizingly slow and many startups would fail to get off the ground if it wasn’t for Edgar F. Codd. Sadly he passed away in 2003 at the age of 79, but his contributions to society live on.
Fun fact: Edgar is often abbreviated to Ted. Maybe I’m just ignorant by I didn’t know that before writing this post.
Interesting Link:
- Ted’s Pivotal Paper: A Relational Model of Data for Large Shared Data Banks
- The history of SQL
- IBM Archives: Edgar (Ted) Codd
- New Yorrk Times: Edgar F. Codd, Dies; Key Theorist of Databases
- MariaDB: Set Theory and SQL
Disclaimer: I make mistakes. It’s getting late and I need to get this post out if you find any typos or errors let me know in the comments below.