Formal Conversations With the Data

Jakub Cabała
The Quantastic Journal
6 min readAug 14, 2024
Formal Conversations With the Data
Source: fotor.com

Introduction

If you’ve ever worked with databases, you’ve probably encountered SQL. It’s a user-friendly query language used to manage the data efficiently. Recently, while refreshing it, I wondered, “Is there a way to abstract it into a mathematical system, so I can use Greek letters and make it look nice?” It turns out there is, and it’s quite interesting. This system is called relational algebra. It allows you to express database operations formally, and in this blog post, we will explore how it does that. To follow along, one needs only a basic knowledge of SQL and relational databases.

What is relational algebra?

Relational algebra is a system introduced by Edgar F. Codd as a way to provide the theory behind relational databases. It allows one to reason about this model without binding oneself to a particular query language. The main object relational algebra works with is, you guessed it, relations which are just data tables.

Example data table

Set theory operations

The first two basic operations are well-known from set theory: union and difference. I believe that their work requires no further explanation. It is worth noting that for those two operations to make sense, the tables have to have the same columns (same data type, names, and numbers).

Another basic operation that can be found in set theory is a Cartesian product. This is an operation of taking all possible ordered pairs of elements from two sets (tables in our case) and is denoted by X. More formally:

How do those operations manifest themselves in SQL? The keyword UNION performs a union operation and the keyword EXCEPT performs a difference operation, as demonstrated in the query below:

SELECT customer_name FROM sales_db.sales_records
UNION
SELECT customer_name FROM marketing_db.campaign_responses
UNION
SELECT customer_name FROM support_db.support_tickets
EXCEPT
SELECT sr.customer_name FROM sales_db.sales_records sr
JOIN support_db.support_tickets st ON sr.customer_name = st.customer_name;

The Cartesian product operation is implemented by a CROSS JOIN:

SELECT e.employee_id, e.name AS employee_name, d.department_id, d.department_name
FROM employees e
CROSS JOIN departments d;

Other fundamental operations

We already used SELECT in the examples above and now have to make sense of it in our system. To do that, we have to introduce the selection operator. It's denoted in the following way:

In theory, it is a unary operator, but it can also be considered a binary operator that takes a formula and a table, returning all rows that satisfy the formula. Note that the returned rows form a table. For example, this SQL query:

SELECT *
FROM employees
WHERE department = 'Sales';

would be translated to:

However, we don’t always need all the data from each row. Sometimes, we only want specific parts, such as the name of a student or the price of a grocery item. This is where the projection operator comes in. It takes in a table and returns only the specified attributes. It’s denoted as:

Now, we can translate a slightly modified version of the previous query:

SELECT name, age
FROM employees
WHERE department = 'Sales';

It becomes:

That is almost all of the basic operators in relational algebra. There is just one more called rename, which does exactly what the name suggests, so we will leave it as is.

The beautiful part is that it is enough! Using only the couple of operators we introduced, we can create many more complicated ones. One example that we are going to look at is a JOIN operation.

Join operation

The join operation combines rows from two tables based on a specified condition. In relational algebra, there are several types of joins. Let's see what they are and how we can express them in terms of fundamental operations.

Natural Join

The natural join combines two relations in which all attributes with the same name are equal, and projects one of each pair of duplicate columns. It is denoted by the symbol:

Let’s break down a natural join of 2 relations R and S (R⋈S)in terms of basic operations.

  1. Cartesian Product: First, create a Cartesian product of R and S.
  2. Selection: Then, apply a selection condition where all common attributes are equal.
  3. Projection: Finally, project only the necessary attributes, excluding duplicate columns.

For example, given the relations:

this query:

SELECT 
Employees.EmployeeID, Employees.EmployeeName, Departments.DepartmentName
FROM Employees
NATURAL JOIN Departments;

translates to:

Theta Join

A theta join is a more general form that allows for a condition other than simple equality. It is represented by

where θ is the condition. In relational algebra, this is implemented as:

  1. Cartesian product
  2. Selection on condition θ

Thus, the relational algebra expression is:

where R and S are relations. For example, given the data from above, the following query:

SELECT *
FROM Employees
JOIN Departments
ON Employees.DepartmentID = Departments.DepartmentID
AND Employees.EmployeeName = 'Alice';

translates to:

Outer Joins

Outer joins extend the idea of joins to include unmatched rows from one or both tables. There are three types of outer joins:

  1. Left Outer Join: Includes all rows from the left table and matched rows from the right table. Unmatched rows from the left table will have NULL for columns from the right table.
  2. Right Outer Join: Same as the previous one but with the left and right tables’ roles switched
  3. Full Outer Join: Includes all rows when there is a match in one of the tables. Unmatched rows will have NULL for the columns from the non-matching side.

Expressing outer joins in basic relational algebra operations is complex, as relational algebra does not directly support NULL values or the idea of retaining unmatched rows. I'm not going to derive them in this article, but expressing them in terms of basic operations is possible.

Conclusion

Relational algebra might seem abstract, but it is a nice way to explain how we talk with data. It is a minimal system, and yet it allows us to construct all the operations we need. Next time you write a SQL query, try to translate it to relational algebra. You’ll learn a lot about how you exactly want the data to behave, and you will get a nice intellectual exercise.

--

--

Jakub Cabała
The Quantastic Journal

Software developer and student. Interested in math and CS.