Grouping Continuous Values in SQL with NTILE

Dan Sullivan
SQL for Data Science
3 min readJun 24, 2018

When trying to understand a trend or pattern in our data it often helps to group data into categories or groups. Sometimes, the groups naturally follow from the data. For example, departments, cities, and product categories typically have discrete values that are easily grouped values. When we work with continuous value data, like salaries, temperatures, and time we don’t have discrete values that obviously map to categories. In these cases, we can create ranges of values and designate groups based on those ranges.

Let’s consider an streaming movie service. The service has some customers that watch very little in a month while other customers like to binge watch shows one after another. Most of the customers fall somewhere in between these extremes. Data scientists at the streaming service want to better understand the characteristics of customers who watch a lot of shows, specifically what distinguishes them from the customers who watch very little. That’s no problem, because every time a customer watches a show, their viewing time is tracked in a database.

We can generate a table listing like this using:

SELECT
customerID,
viewMonth,
viewMinutes
FROM
customerViewingByMonth

If we wanted to create categories, for example, a set of four categories such as Low, Low-Mid, Mid-High, and High we could use the NTILE function. The NTILE function takes a number designating the number of categories we’d like to define so we’ll use 4 to create 4 categories. We also need to tell SQL how to order the rows when calculating the members of each category. We want to order by the total viewing time so we’ll tell SQL to order by viewMinutes. We also want to categorize viewers based on their monthly viewing so we’ll also need to tell SQL to partition, or work with a subset of all the data in the table, and group rows only within a single month.

Here’s the SQL SELECT statement to do that:

SELECT
customerID,
viewMonth,
viewMinutes,
ntile(4) OVER (PARTITION BY viewMonth ORDER BY viewMinutes)
FROM
customerViewingByMonth

The results are shown in Table 2.

NTILE is a SQL window function. That means it works with a subset of ordered rows. The window specifies which rows SQL “sees” when it is executing a query. In this case, partitioning by viewMonth makes SQL work with just one month at a time when determining how to assign NTILE numbers to each row.

NTILE is used to add discrete value columns that group continuous values. These additional columns are especially useful when you need to group your data into distinct buckets instead of operate over a continuous range of values.

For more tips on SQL for data science, see the Advanced SQL for Data Science course at LinkedIn Learning. If you aren’t a subscriber, you can view this course and others using the 30-day free trial.

--

--

Dan Sullivan
SQL for Data Science

Dan Sullivan is a software architect for data science, statistics & machine learning. He is a LinkedIn Learning instructor for advanced SQL and machine learning