Why I went NUTS — and you should, too!

Tobi Hazur
Mobimeo Technology
Published in
6 min readApr 5, 2022

Well, at least if you are working as a data analyst in Europe or for the European market, you should consider it. I am of course neither speaking of my mental health nor some salty snacks, but the Nomenclature of Territorial Units for Statistics — or in short: NUTS.

This standard introduced by the EU in 2003 turned out to be really valuable for my work in several projects already. It is a simple setup best explained by the GIF below.

© eurostat

There are four levels of regions 0 through 3, which divide all European Union member states into meaningful geo-political subdivisions. Highest up in the hierarchy are the member states themselves at level 0. The finest granularity can be found in level 3, which in Germany is represented for example by districts. There is a level with even finer detail called LAU (for Local Administrative Units), however, I found it harder to get access to relevant data for it.

You can download the shapefiles of these regions along with socio-economic information and start making use of them in a range of ways (I have prepared a dataset for your convenience at the end of the article).

Meaningful grouping of points

Usually, when tracking locations you end up with latitude and longitude values on your events, resulting only in geographical points on the map. These rarely overlap and you end up with meaningless information when grouping them: every user is in a different place. Now you could for instance group them into equally sized squares or hexagons on the map, but this is not how the real world works. Local laws are influencing certain states, the cultural offerings in one city might differ from another one, and so on. What really is of interest for your business is how certain geo-political regions are performing. Using the GeoJSON or well-known text formatted shape files and some simple SQL, you can attribute each point to the region it is located in and then further calculate with these aggregates.

WITH nuts_data AS        (SELECT        nuts_id,
nuts_name,
pop_total,
ST_GeometryFromText(geo_wkt) AS geo
FROM your_dataset.dim_nuts_regions WHERE nuts_level = 3)SELECTb.nuts_id,
b.nuts_name,
COUNT(DISTINCT a.user_id) AS users,
AVG(b.pop_total) AS population,
COUNT(DISTINCT a.user_id) / AVG(b.pop_total) AS users_per_pop
FROM your_dataset.your_table_loc a
LEFT JOIN nuts_data b
ON ST_Within(ST_Point(a.lon, a.lat), b.geo)
GROUP BY 1, 2
ORDER BY 5 DESC;

a simple query counting users by region and ranking regions by users per population.

Please note that in the above example I am using the geo functions for Presto-based Athena queries, the syntax is slightly different for other SQL-dialects used in e.g. BigQuery or Redshift.

Another cherry on top is that you do not have to worry about mobile clients sending events with location information in the device language (think Munich vs München vs Münih). With NUTS, you use the granular lat/lon data and map it onto pre-defined regions, which are already neatly named. Thus, everything is tidied up automatically.

Benchmark your results

Does this sound familiar: your German boss wants to know where your business has the largest user base. You do your magic and it turns out that it is Berlin, followed by Hamburg, followed by Munich. Wait a second — those are the most populated cities of the country. Not exactly a surprising result here. Now that you have NUTS at your disposal, it is easy to relate your results to the general population of each area. This way, you can find out where you have an above or below average usage, which is probably a lot more insightful.

As a byproduct, you will now have a general baseline to see where your business is standing at and what the potential for growth might be.

The data analyst’s natural habitat

Metadata, metadata everywhere

Population size is only one of many data points available per NUTS region. The EU offers CSV downloads of many more on their eurostats website. Here are some examples of ones I found really useful:

  • area of the region
  • demographic information, e.g. population, gender split, median age and others
  • structural information, e.g. how remote, rural, urban or metropolitan a region is
  • economic information, e.g. what is the share of population who are currently employed
  • political information, e.g. if it is a border region
  • geographic information, e.g. if it is a coastal region, a mountainous region, an island

The downloads can be tedious, but it’s a one time task to prepare your personalized dataset and then keep reusing and extending it.

Furthermore, this standard has already been adapted outside the EU’s own statistics department. A lot of Covid-19 related data is available for different NUTS levels. There is an extensive Gesis study on the values of Europeans towards family, work, religion, politics and society. And the term metadata shifts to a new meaning when I tell you there is NUTS 3 level data for social connectedness of regions available by Facebook.

All of these can be used to enrich your analyses and embed them in a broader context.

No stops ahead

In my view, the real power lies in the fact that these metadata points are ever evolving. The NUTS standard provides the link between potential information on any topic. This can inspire new approaches to your analyses and leverage the academic work of previously unconnected authors. It’s only up to us now to make sure that the whole is truly greater than the sum of its parts. Feel free to let me know which resourceful NUTS based datasets you have used before in the comments section.

Aside from NUTS, there are also other great collaborative resources for location based analyses, like OSM (OpenStreetMap). If you are a user of Google Cloud Products, you can even directly connect to their public data set. What other tools or datasets do you personally use the most? Again, I’d be interested to hear from you in the comments!

* I hereby provide my own dataset as a CSV file for you to use. Please note there is no guarantee for completeness and correctness, but it might save you some collection time. It holds the following information for all NUTS regions:

  • country_code: which country is the region in
  • nuts_level: which nuts level does the region belong to, can be between 0 and 3
  • nuts_id: unique identifier for the region
  • nuts_name: name of the region in native language
  • nuts_name_latin: name from above in all latin characters
  • geo_wkt: the well-known-text formatted version of the (multi)polygon that defines the region
  • geo_json: the JSON formatted string version of the (multi)polygon that defines the region

These are the metadata I added to the mix. Please note that not all information is available for all regions and they do date back to slightly different points in time, the year is noted in brackets. All have been downloaded from the eurostat website:

  • area_square_km: what is the area of the region (2015)
  • is_border_region: only for level 3, if it is close to a country border (2021)
  • coastal_code / coastal_label: if the region is directly at or near the coast (2021)
  • is_island: if the region is an island (2021)
  • is_metropolitan / metro_code / metro_label: if the region is a metropolitan area and if so what the name is (2021)
  • mountain_code / mountain_label: if the region is mountainous (2021)
  • remoteness_code / remoteness_label: how far it is to the next city (2021)
  • urban_rural_code / urban_rural_label: how urban or rural a region is (2021)
  • median_age: of total population and per gender (2020)
  • population: number of inhabitants, as total, per gender or per age group with buckets of five years (2020)
  • pop_per_square_km: how many inhabitants per square km (2019)
  • employed_pop_in_thousands: how many inhabitants are currently employed (2019)

--

--