Active Counts in SQL Using a Date Dimension

What NFL Hall of Fame data can teach us about business.

Eduardo Senior
Learning SQL
11 min readApr 7, 2024

--

Source: Neil Leifer/SI

Welcome in.

I want to show you a little trick. It’s a join in SQL that will help you solve a key business question that comes up again and again across an array of functional areas:

How can we measure the active status of an entity over time?

While it might seem trivial, having the ability to work through these types of problems can dramatically expand your analytical toolkit. Think about the following use cases:

  • HR is trying to gauge the number of active employees at the company. They want to do a year-over-year analysis.
  • IT leadership is asking for a report showing how many open cases the company’s help desk has. They’re interested in seasonality.
  • Sales directors are trying to count active opportunities in their sales pipeline. They’re looking to compare trends before and after a product launch.

These are classic business intelligence questions that virtually every company is asking. Many times, leadership will have a grasp on today’s numbers, but things tend to go awry when we talk about those very same metrics over time. And if you’re reading this article right now, I’m going to wager and say that you’re the guy or gal at your company responsible for making this happen. A good approach to tackling these problems can go a long way for you and your stakeholders. And I have you covered.

Here’s the plan.

We’ll begin this post by exploring the root of the problem: why exactly is it so difficult to answer these questions? Next, we’ll introduce our proposed solution by explaining the concept of a date dimension and its role in overcoming these obstacles. To wrap up, we’ll tie everything together with a comprehensive demonstration. We’ll interpret and visualize our findings and showcase our approach's practical application and benefits.

Let’s begin.

The issue.

Data availability is usually not the challenge. Most of the entities we mentioned earlier can typically be found in a dimension table or directory: a list of employees from Workday, a dump of cases from ServiceNow, or the opportunity object in Salesforce.

Upon inspection, however, you will quickly notice that many of these tables are structured so that each employee or opportunity is represented by exactly one record.

Sample employee directory. Image by author.

And therein lies the problem. Sure, these records have start and end dates associated with them, which allow us to determine whether a given employee is active or not. But in their current form, these datasets only allow us to gauge what is active as of today.

select 
count(distinct(employee_id)) as active_total
from
employees
where
termination_date is null

What we really want to know is what this number looks like over time: by year, quarter, month, or even day.

Enter the date dimension.

A date dimension is simply a table designed to have a record for each date in the calendar. These records contain a series of attributes that expand on the date in question: things like year, month, day, week, and quarter. Some even include flags to indicate special occasions, like that day being a holiday or the end of a fiscal year.

Sample date dimension. Image by author.

So let’s go back and think about our problem. We said earlier that the crux of the issue lies in the nature of most datasets. Typically, records are static, capturing only snapshots in time — such as the start and end dates of an employee’s tenure.

But what if we can change how the dataset is organized?

By leveraging a date dimension, we can join static date fields from any dataset to a continuous calendar. This method allows us to populate a record for each day an entity was active, providing a granular, day-by-day view of activity over time.

Concept of joining to date dimension. Image by author.

Show & tell.

Alright, let’s drive this idea home with a topic that really matters: football.

Now, I know what you’re thinking: “What does football have to do with our business questions?” Well, quite a bit, it turns out.

In this demo, we’re going to explore how we can use SQL and our date dimension technique to track and analyze the careers of NFL legends. The idea here is simple: if we can understand the dynamics of football careers over time, we can apply the same concepts to employees, cases, sales opportunities — you name it.

Here’s the question we’re looking to answer:

Which NFL season featured the most future Hall of Fame players?

Lucky for us, we have a great dataset available as part of the Kaggle community — shout out to Matt OP 🤙

To start, download the data and load it into your preferred data platform or RDMS. I will be using Apache Spark via the PySpark APIs as well as Spark SQL throughout the demo, but the same concepts should apply regardless of the technology.

After loading up the table (I named mine nflhofplayers), we can preview the data and make our initial observations.

Data preview. Image by author.

We can see the table has 306 records (one for each player inductee, 1963–2022) and is composed of 33 fields. It has things like position, all_pro_selections, and total_games_played. These are great data points that can help sharpen our analysis later on, but to answer our initial question, we will focus on two important date fields: from and to. These fields represent the first and last year of a player’s career in the NFL.

Let’s zoom in on a specific player to get a better understanding:

Calvin Johnson.

Yeah, this guy.

Source: Tenor

Calvin Johnson played 9 years in the NFL (2007-2015) for the Detroit Lions and was an unstoppable force at the wide receiver position. The clip you see above: he’s hauling in a clutch, game-tying catch in a showdown with the Bengals, where he put up 155 receiving yards and 2 touchdowns.

Here’s what his record looks like in our table:

select
id,
player,
position,
from as played_from,
to as played_to
from
nflhofplayers
where
id=8
Calvin Johnson’s record. Image by author.

Now that we understand what we are looking at, let’s reframe the question we posed earlier:

How can we produce a record of Calvin Johnson for each year that he played in the league?

If we can achieve this, we can group on that particular date column, and Johnson will be counted in each of the years he was active in the NFL. Do this across all of the players, and we have our answer.

At least in theory... Let’s take our date dimension out for a spin.

with 

calendar as (
select year(date_day) as dateyear
from all_dates
group by 1
)

, players as (
select
id,
player,
position,
from as played_from,
to as played_to
from nflhofplayers
where id=8
)

, joined as (
select *
from players
left join calendar
on dateyear between played_from and played_to
)

select * from joined;
Post-join record set for Calvin Johnson. Image by author.

Let’s break down what’s happening in this query.

  1. The first thing we’ll need to do is roll up our dates to the year level, so you’ll notice the calendar CTE does just that. We do this because it’s the level of granularity that we’re interested in. If your analysis is at the week or day level, just make sure to roll up your date dimension accordingly.
  2. Next, our players CTE is just a copy and paste from the previous code block. We’re still focused on Calvin here (notice the id=8 in the where clause). At this point, you’re welcome to grab as many columns as needed for your specific analysis. For now, we’ll just select position along with our two essential date fields.
  3. When we take stock of what’s happening in the joined CTE, the real game-changer is the between operator — it’s what lets us throw a range into the mix when we're joining data. And that's incredibly powerful. It means we can easily scoop up any dates that sit within our specified start and end points straight from the date dimension (i.e., calendar).

And that’s it.

Source: Tenor

Well, almost. Remove your id filter to apply this logic to all players in the dataset—group by dateyear and count(distinct id).

with 

calendar as (
select year(date_day) as dateyear
from all_dates
group by 1
)

, players as (
select
id,
player,
position,
from as played_from,
to as played_to
from nflhofplayers
)

, joined as (
select *
from players
left join calendar
on dateyear between played_from and played_to
)

, grouped as (
select
dateyear,
count(distinct id) as players
from joined
group by 1
)

select *
from grouped
order by dateyear desc
Active player count by year. Image by author.

Now, our problem seems easier than ever to solve. You can take a max, sort on the players column, or even graph it out.

Players grouped by year active. Image by author.

Here’s to 1971.

That season must have been like watching a live audition for Canton, with each game unfolding as a chapter in future Hall of Famers’ highlight reels.

The 1971 season featured a total of 81 players who would eventually become recipients of a gold jacket.

That year, the Dallas Cowboys clinched their first Super Bowl victory, a testament to the high-caliber play that defined the season. Players like Roger Staubach began cementing their legacies, contributing to a campaign filled with epic comebacks and heart-stopping plays.

For those of us who weren’t around to see these legends live, it’s hard to deny that this feels right. We’ve all heard the anecdotes: all of those crazy stories from dad, uncle, or grandpa, telling us how ‘we used to watch the REAL football players back then!’ The data backs them up! And as much as we hoped we never had to, we may just have to admit that our obnoxious uncle was right: the 70s rocked 🤘

As promised.

I’m sure you can now see how, as advertised, this translates seamlessly into the world of business 😏… Just swap the player_from and player_to date fields in the join for whatever constitutes a “tenure” for a given entity’s state. Employees have ‘hire’ and ‘termination’ dates; cases have ‘open’ and ‘closed’ dates; opportunities have ‘created’ and ‘converted’ dates. Whatever the industry, it’s the start and end dates that define an entity’s ‘active’ season, and that’s what you feed the between operator.

Thanks for reading 🤙

References:

  1. Campos, G. (2023, October 30). Date dimension: How to create a practical and useful date dimension in dbt. Indicium Engineering. Retrieved from https://medium.com/indiciumtech/date-dimension-how-to-create-a-practical-and-useful-date-dimension-in-dbt-5ee70a18f3bb
  2. McEwen, J. (2022, June 5). How to easily create a SQL date dimension table. Learning SQL. Retrieved from https://medium.com/learning-sql/how-to-easily-create-a-sql-date-dimension-table-a3f46537c500
  3. OP, M. (n.d.). NFL Pro Football Hall of Fame 1963–2022. Kaggle: Your machine learning and data science community. Retrieved March 29, 2024, from https://www.kaggle.com/datasets/mattop/nfl-pro-football-hall-of-fame-19632022?resource=download
  4. Pro Football Hall of Fame. (n.d.). Roger Staubach. Retrieved March 29, 2024, from https://www.profootballhof.com/players/roger-staubach/

Appendix A: Create your own date dimension.

Excel. The easiest way to get started. Here are the steps:

  1. Determine the Date Range: identify your start date and end date. For example, 2023-01-01 to 2023-12-31.
  2. Create a New Excel Spreadsheet: open Excel and start a new worksheet.
  3. Enter the Start Date: in cell A1, enter your start date.
  4. Generate Sequential Dates: in cell A2, enter the formula =A1+1. Copy this formula down column A until the date displayed is your end date.
  5. Add Time Attributes: label additional columns for Year, Month, Day, etc., starting from column B. Use Excel functions like YEAR(), MONTH(), DAY() to populate these columns based on the date in column A.

You can set up a range that works for you and then bring it into your database of choice.

PostgreSQL. If you’re using Postgres, no worries: you have the generate_series function at your disposal.

  1. Create the date dimension table.
create table date_dimension (
date_key date primary key,
year int,
month int,
day int,
-- Add more attributes as needed
);

2. Generate the date range and populate the table.

insert into date_dimension (date_key, year, month, day)
select
date_series::date,
extract(year from date_series),
extract(month from date_series),
extract(day from date_series)
from generate_series(
'start_date'::timestamp,
'end_date'::timestamp,
'1 day'::interval
) as date_series;

This method of creating the date dimension also applies to MS SQL Server. For a more detailed explanation, please check out this article by Jon McEwen in the Learning SQL publication here on Medium.

Databricks/Spark. Python and DataFrames can be helpful here.

# Set start and end dates
startDate = "start_date"
endDate = "end_date"

# Generate date range
dateDF = spark.sql(s"""
select
sequence(
to_date('$startDate'),
to_date('$endDate'),
interval 1 day
) as date_array
""").withColumn("date", explode($"date_array"))

# Use Spark SQL functions to extract year, month, day, etc.,
# from the date column
dateDimensionDF = dateDF.select(
col("date"),
year(col("date")).alias("year"),
month(col("date")).alias("month"),
dayofmonth(col("date")).alias("day")
#Add more attributes as needed
)

# save as table
# ...

dbt. Use Jinja templating and dbutils for dynamic date ranges.

with
date_series AS (
select
dateadd(day, row_number() over (order by null) - 1, '{{ var("start_date") }}') as date
from
{{ dbt_utils.generate_series(start=0, stop=(datediff('{{ var("end_date") }}', '{{ var("start_date") }}'))) }}
)

select
date,
extract(year from date) AS year,
extract(month from date) AS month,
extract(day from date) AS day
-- Add more attributes as needed
from date_series

In dbt, you can pass variables in your dbt run command or set them in your dbt_project.yml.

If you’re interested in this approach and you’d like to go a little deeper, check out this post by Gabriel Campos.

Appendix B: Alternatives.

Conditional aggregation. You do have the option of forgoing the date dimension altogether. This means no need to create or maintain a separate table, which is always nice. If your period is well-defined, this approach can be more straightforward.

select
year(check_date) AS year,
month(check_date) AS month,
count(employee_id) AS active_employees
from
(
select '2023-01-01' + interval a + 10*b month AS check_date
from
(select 0 a union all select 1 union all select 2 union all select 3
union all select 4 union all select 5 union all select 6
union all select 7 union all select 8 union all select 9) as a
join
(select 0 b union all select 1 union all select 2 union all select 3
union all select 4 union all select 5 union all select 6
union all select 7 union all select 8 union all select 9) as b
) as months
join poc.employees
on
hire_date <= check_date
and (termination_date is null or termination_date >= check_date)
group by year, month
order by year, month

However, you’re taking a hit on scalability and flexibility. As your analytical needs grow more complex, this method can become cumbersome. Maintainability can suffer, too: if your period of interest changes, you’ll need to constantly update your logic. The code above only works for 2023, for example.

Snapshot facts. You could also build out a data pipeline to capture snapshots of these entities in a fact table. For example, you could create a table fct__active_employees that incrementally loads the current count of active employees.

merge into fct__active_employees as target
using (
select
current_date() as snapshot_date,
count(*) as active_employees
from employees
where termination_date is null
) as source
on source.snapshot_date = target.snapshot_date
when matched then update set *
when not matched then insert *

--

--