SQL and aggregated data: is there a better way?

ZD
GoodData Developers
6 min readJan 26, 2021

SQL is great for working with raw data. SELECTs, INSERTs, DELETEs, and UPDATES work great for the CRUD (Create, Read, Update, Delete) operations. However, SQL sucks when you start working with aggregated data. Let me demonstrate it on a simple sales orders example. You’ll be able to try it yourself with a simple SQLite setup described below.

There are two tutorials that show how to setup DBeaver with SQLite to run all the SQL queries and GoodData XAE (Extensible Analytical Engine) to execute the metric queries.

Working with aggregated data in SQL

I have the following data model:

SQL uses the GROUP BY statement for data aggregation. So if I want to slice and dice sales by product, I'll end up with this SQL query

SELECT 
PRODUCT_NAME,
SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS ORDER_AMOUNT
FROM ORDER_LINES O
JOIN PRODUCT P ON P.PRODUCT_ID = O.PRODUCT_ID
GROUP BY PRODUCT_NAME
ORDER BY 1 ASC;

Similarly, if I want to slice the sales by customer’s city, I get:

SELECT 
CUSTOMER_CITY,
SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS ORDER_AMOUNT
FROM ORDER_LINES O
JOIN PRODUCT P
ON P.PRODUCT_ID = O.PRODUCT_ID
JOIN CUSTOMER C
ON C.CUSTOMER_ID = O.CUSTOMER_ID
GROUP BY CUSTOMER_CITY
ORDER BY 1 ASC;

Same for trending sales by month (I use the SQLite date formatting):

SELECT 
STRFTIME('%Y/%m', ORDER_DATE),
SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS ORDER_AMOUNT
FROM ORDER_LINES O
JOIN PRODUCT P ON P.PRODUCT_ID = O.PRODUCT_ID
GROUP BY 1
ORDER BY 1 ASC;

The aggregation queries have a lot in common. However, with SQL, I always need to write the correct joins, grouping, etc.

Is there a better way?

Let me look at the three scenarios above from a more high-level perspective. In all cases, we work with the same number, the SUM(ORDER_QUANTITY * PRODUCT_PRICE) that we want to aggregate by PRODUCT_NAME, CUSTOMER_CITY, or by the ORDER_DATE. Let's call it REVENUE for now. Wouldn't it be great if SQL had a good, simple representation for the REVENUE? A representation that I can reuse in a simple way in all queries above to make them much simpler? For example:

CREATE METRIC "REVENUE" AS 
SELECT SUM("ORDER_QUANTITY" * "PRODUCT_PRICE");

Getting rid of JOINs

The next thing that I’d like to get rid of are the joins. They are always the same. I’d like to define them in the SQL data model once and don’t specify it in every query. I can perhaps use referential integrity constraints (primary keys and foreign keys) for the definition and preprocess the simplified queries to enrich them with the JOIN conditions inferred from the foreign keys.

Removing GROUP BYs

The SQL GROUP BY clauses can also be generated. I can perhaps use the supported SQL syntax GROUP BY 1,2,3,... to make the preprocessing simpler.

Result

With the metric, no joins, and grouping, I can rewrite the aggregation queries to something as simple as this:

SELECT PRODUCT_NAME, REVENUESELECT CUSTOMER_CITY, REVENUESELECT STRFTIME('%Y/%m', ORDER_DATE), REVENUE

Multiple aggregations in one query

The simple SQL query generator idea looks interesting. But we can take it even further. What if I need to compute the REVENUE for two different product categories (e.g. 'Home', and 'Outdoor') and aggregate it by CUSTOMER_CITY? The best way how to do this in SQL that I found so far is following:

SELECT 
CUSTOMER_CITY,
SUM(CASE
WHEN PRODUCT_CATEGORY = 'Outdoor'
THEN ORDER_QUANTITY * PRODUCT_PRICE
END),
SUM(CASE
WHEN PRODUCT_CATEGORY = 'Home'
THEN ORDER_QUANTITY * PRODUCT_PRICE
END)
FROM ORDER_LINES O
JOIN PRODUCT P ON P.PRODUCT_ID = O.PRODUCT_ID
JOIN CUSTOMER C ON C.CUSTOMER_ID = O.CUSTOMER_ID
GROUP BY CUSTOMER_CITY;

If I want to aggregate the revenue by other columns (e.g. PRODUCT_NAME or ORDER_DATE) instead of the CUSTOMER_CITY, I need another complex query.

Metric’s WHERE clause

The metric can support WHERE clause. Then the views and queries above would read like this:

CREATE METRIC "REVENUE (Home)" AS 
SELECT "REVENUE" WHERE "PRODUCT_CATEGORY" = 'Home';
CREATE METRIC "REVENUE (Outdoor)" AS
SELECT "REVENUE" WHERE "PRODUCT_CATEGORY" = 'Outdoor';

and then the final query would be:

SELECT "CUSTOMER_CITY", "REVENUE (Outdoor)", "REVENUE (Home)";

Moreover, if I’d need the aggregation by ORDER_DATE, I can simply reuse both metrics

SELECT ORDER_DATE, "REVENUE (Outdoor)", "REVENUE (Home)";

Simple one-liners that reuse metrics across multiple aggregations unlike the complex SQL queries above. Beautiful!

Percentages and shares

Another great opportunity for simplification is aggregation locking. The most frequent use-case is the computing base for percentages and shares in one query with fractions. For example, computing a share of a certain product category for every city. I’ll again start with the SQL implementation and use views for decomposing the complex query to multiple steps:

First, I’ll compute the total revenue per city:

CREATE VIEW REVENUE_BY_CITY AS 
SELECT
CUSTOMER_CITY,
SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS BASE_AMOUNT
FROM ORDER_LINES O
JOIN PRODUCT P ON P.PRODUCT_ID = O.PRODUCT_ID
JOIN CUSTOMER C ON C.CUSTOMER_ID = O.CUSTOMER_ID
GROUP BY CUSTOMER_CITY;

then I’ll compute the PRODUCT_CATEGORY breakdowns of the revenue in different cities

CREATE VIEW REVENUE_BY_PRODUCT_CATEGORY_AND_CITY AS 
SELECT
PRODUCT_CATEGORY,
CUSTOMER_CITY,
SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS ORDER_AMOUNT
FROM ORDER_LINES O
JOIN PRODUCT P ON P.PRODUCT_ID = O.PRODUCT_ID
JOIN CUSTOMER C ON C.CUSTOMER_ID = O.CUSTOMER_ID
GROUP BY PRODUCT_CATEGORY, CUSTOMER_CITY;

and divide the ORDER_AMOUNT by the BASE_AMOUNT in both view join:

SELECT 
RPC.CUSTOMER_CITY,
PRODUCT_CATEGORY,
ORDER_AMOUNT,
BASE_AMOUNT,
ORDER_AMOUNT / BASE_AMOUNT
FROM REVENUE_BY_PRODUCT_CATEGORY_AND_CITY RPC
JOIN REVENUE_BY_CITY RC ON RC.CUSTOMER_CITY = RPC.CUSTOMER_CITY
GROUP BY RPC.CUSTOMER_CITY, PRODUCT_CATEGORY;

Again, I needed to create two very aggregation specific views and join them.

Alternatively, if the target database supports analytics/window functions, this can be done in one query:

SELECT * FROM (
SELECT
CUSTOMER_CITY, PRODUCT_CATEGORY,
SUM(ORDER_QUANTITY * PRODUCT_PRICE)
OVER (PARTITION BY PRODUCT_CATEGORY, CUSTOMER_CITY) AS ORDER_AMOUNT,
SUM(ORDER_QUANTITY * PRODUCT_PRICE)
OVER (PARTITION BY CUSTOMER_CITY) AS CITY_ORDER_AMOUNT,
SUM(ORDER_QUANTITY * PRODUCT_PRICE) OVER (PARTITION BY PRODUCT_CATEGORY, CUSTOMER_CITY)
/
SUM(ORDER_QUANTITY * PRODUCT_PRICE) OVER (PARTITION BY CUSTOMER_CITY) AS PERCENTAGE,
ROW_NUMBER() OVER (PARTITION BY PRODUCT_CATEGORY, CUSTOMER_CITY ORDER BY O.ORDER_LINE_ID) as ROWNUM
FROM ORDER_LINES O
JOIN PRODUCT P ON P.PRODUCT_ID = O.PRODUCT_ID
JOIN CUSTOMER C ON C.CUSTOMER_ID = O.CUSTOMER_ID) x
WHERE ROWNUM = 1
ORDER BY CUSTOMER_CITY, PERCENTAGE DESC;

By the way, there is a shorter option of the query that uses DISTINCT, but I’m not convinced about its performance.

Metric’s BY clause — forcing aggregation

The metric can force (or lock) aggregation on a certain level. Let’s create a metric that computes the total revenue by CUSTOMER_CITY:

CREATE METRIC "REVENUE BY CITY" AS 
SELECT "REVENUE" BY "CUSTOMER_CITY" ALL OTHER;

then reuse the REVENUE metric created above and compute the share:

CREATE METRIC "REVENUE PRODUCT_CATEGORY % IN CITY" AS
SELECT "REVENUE" / "REVENUE BY CITY" BY "CUSTOMER_CITY";

And the final aggregation query:

SELECT 
"CUSTOMER_CITY",
"PRODUCT_CATEGORY",
"REVENUE",
"REVENUE BY CITY",
"REVENUE PRODUCT_CATEGORY % IN CITY"

Again, much simpler and both new metrics can be reused for computing the fractions not only by the PRODUCT_CATEGORY but by any other column (e.g. PRODUCT or ORDER_DATE).

My take is that the key problem is that SQL doesn’t have any representation for aggregated numbers. Do you have any idea how to elegantly implement the metric in SQL? Do you know a product or technology that somewhat extends SQL with a concept similar to the metric concept described in this article? Write me over email.

I use GoodData MAQL syntax in this article. You can try this metric implementation yourself. Sign up for the FREE tier of GoodData platform. You can also use the experimental JDBC driver that I wrote to create and execute metrics from a SQL query console, like DBeaver or JetBrains DataGrip.

I also wrote a brief tutorial based on this article that shows how to set up GoodData XAE, connect it to DBeaver to create and execute the metrics described above. Another tutorial shows how to setup DBeaver with SQLLite to run the SQL queries from this article.

Originally published at https://github.com.

--

--