Analysing Dutch municipality data

Exploring the aging population

Published in

rond blog

7 min readMar 31, 2023

This blog is about retrieving Dutch municipality data, correcting for splitting and merging municipalities, predicting future trends and visualising the result. The blog takes a step-by-step approach with code examples to allow the reader to easily follow along. While the blog takes the aging population as an example, the methodology can be applied on many other social-economic and demographic topics as well.

Introduction

In the Netherlands a rich collection of data is gathered by the CBS (“Statistics Netherlands”), the national statistical office. The data consist of societal, demographic and economic topics and is based upon the micro-data collected on inhabitants, institutions and companies. The CBS subsequently aggregates the data on national and several regional levels, and makes this publicly available through the CBS statline portal.

Interestingly, for data scientists, or everyone else who wants to reproducibly analyse this data: one can also search and retrieve datasets from the CBS statline portal programmatically. CBS has kindly provided packages in R and Python which makes working with the statline API child’s play.

This allows easy exploration of municipal or national data to investigate societal relevant topics. The data can be used to guide decisions and adjust governmental policies based on data instead of opinions.

Data source

Let’s say we are interested in investigating the amount of elderly people in the Dutch municipalities in order to properly tune the municipal services such as public transport or domestic help. This example is worked out in the R language. Searching the statline database for key figures on municipalities (“Kerncijfers wijken en buurten”) can be done as follows (the full R-code used for this blog can be found here):

require("cbsodataR")
out <- cbsodataR::cbs_search(query="Kerncijfers wijken en buurten")
View(out[,1:4])

This results in a data frame that consists of details on several tables related to the requested key-figures from different years, see table below:

We are particularly interested in the tables with the title Kerncijfers. In the code below we take an identifier of one of the tables and download it to take a look at what is in there.

# take one of the Kerncijfers datasets and look at the contents
df <- cbsodataR::cbs_get_data(id="84799NED") 
dim(df)
# [1] 17341   118
View(df)

The downloaded data frame is rather large: it consists of 118 columns listing data of all Dutch municipalities (Gemeente), districts (Wijk) and neighbourhoods (Buurt). One or more neighbourhoods build up a district, and one or more districts build up a municipality. In this particular year there are 13808 neighbourhoods, 3177 districts and 355 municipalities. A small excerpt of the data is shown below.

The data consist of many details we do not necessarily need for studying the aging population at municipality level. To increase speed and efficiency it is therefore best to keep only data at municipality level and select only the columns we are currently interested in. This can be done using the code below:

require(dplyr)

# find all relevant identifiers - the title starts with 'Kerncijfers'
ids <- out$Identifier[grepl("^Kerncijfers", out$Title)]

# create empty data-frame to be filled
df <- data.frame()

# loop over all id's 
for (id in ids) {
  df <- rbind(
    df,
    # download data, filter on 'Gemeente' and select relevant columns
    cbsodataR::cbs_get_data(
      id=id,
      select=c("WijkenEnBuurten", "AantalInwoners_5", "k_65JaarOfOuder_12"),
      WijkenEnBuurten = has_substring("GM")
    ) %>%
      cbs_add_label_columns() %>%
      # add a 'Year' column to keep track of the year the data is from
      mutate(
        Year = as.numeric(
          gsub("\\D", "", out[out["Identifier"]==id, "Title"])
        )
      )
  )
}

# take a look at the result
View(df)

This code results in a data frame of which part is shown in the table below. Besides a region code and the name of the municipality the table contains yearly numbers of inhabitants and elderly. This is our starting point for the rest of the analysis.

Merging and splitting municipalities

With the help of the script above we have downloaded seven years of data which consists of 2565 rows, and no missing values. Interestingly, the number of rows is not evenly divided over the years. In fact the number of municipalities has gradually decreased over the last decades. Plotting the number of municipalities in our dataset results in the figure below.

The decrease of the number of Dutch municipalities over the last years.

The steadily decrease is caused by the split and merge of small municipalities into larger neighbouring municipalities. An example is shown below for the years 2020–2021. The municipalities of Appingedam, Delfzijl and Loppersum join to form a new municipality called Eemsdelta. The municipality of Haaren is split into four parts which are added to different neighbouring municipalities.

Visualisation of the change in municipalities for the year 2020 (left) to 2021 (right). The width of the flows is proportional to the number of addresses related to the change. Picture is taken from https://github.com/VNG-Realisatie/grenswijzigen

As a consequence, the demographic data can show sudden jumps when plotting trends over the years. An example is shown below where the number of inhabitants of Vught suddenly increases by about 5000 due to the partial uptake of the disappeared Haaren municipality. For newly formed municipalities such as Eemsdelta there is no historic data available. Clearly such interruptions make it difficult to compare data over several years.

The trend of number of inhabitants per year for five municipalities. Note the jump in number of inhabitants for Vught and the absence of historic data for Eemsdelta. The three other municipalities show a gradual growth in population.

For the plot above the total population is used. When investigating the amount of elderly (older than 65 years) the data shows similar behaviour leading to the same challenges. This severely complicates proper analysis of the aging population.

Estimates for boundary changes

When municipalities split or merge it is known which pieces of land move from one municipality to the other, in other words, how the boundaries change. Note that it is also known how many houses and how many inhabitants these pieces of land contain. Using this knowledge we can estimate demographic variables in the past, before the split or merge.

This is exactly the purpose of the grenswijzigen R package [written by the author]. This package tracks the assignment of addresses to different municipalities trough time. Additionally it estimates demographic variables for each address based on regional averages. Thus it can compute demographic estimates as if the split or merge of the municipality never happened. You can find more details on the method in the documentation (in Dutch) of this package.

The package can be applied as follows:

require(devtools)

# install 'grenswijzigen' package from Github
# Note: in some cases the 'renv' package erroneously interacts with devtools.
#   In those cases please deactivate renv by `renv::deactivate()` before installing from github with devtools.
devtools::install_github("https://github.com/VNG-Realisatie/grenswijzigen")
library(grenswijzigen)

# Transform the historic data to 2022
df_transformed <- wrapper_vertaal_naar_peiljaar(
  
  # the package requires 'wijkcode' and 'jaar' as column names
  as.data.frame(
    df %>% rename(
      wijkcode = WijkenEnBuurten,
      gemeentenaam = WijkenEnBuurten_label,
      jaar = Year
    )
  ),
  peiljaar = 2022,
  model="model.2",
  regionaalniveau = "gemeente",
  kolommen_aantal = c("AantalInwoners_5", "k_65JaarOfOuder_12")
) 

# show that we have equal numbers of municipalities each year
print(table(df_transformed$jaar))

View(df_transformed)

The transformed dataset shows gradual changes over time for all selected municipalities. Such gradual changes are more realistic. Note that we can now see the decline of the population for Eemsdelta and the slight increase for Vught more clearly.

The trend of number of inhabitants per year for five municipalities. Note the more realistic gradual change in population size for all cases.

Aging population

Having prepared our data we can now look into the aging population. We do so by calculating the percentage of elderly inhabitants (older than 65 years) with respect to all inhabitants. Doing so allows to create the plot below.

The population ages gradually. There is quite some spread in the data showing that the percentage of elderly is quite different between municipalities.

Additionally it is interesting to see how the trends will develop towards the future. To do so properly, one would need to make estimates of migration and take factors like fertility rate and life expectancy into account. Let’s take a somewhat less ambitious goal for this blog and try to estimate the future from past observations of elderly inhabitants.

Note that we are dealing with time-series, so that we need to apply a time-series method. From the graph the demographic development seems to be more or less linearly so we could apply a state-space model with a deterministic trend. This can be done as follows:

# forecast
require(dlm)

create_forecast <- function(municipality) {
  # state space model with deterministic trend
  build.dt <- function(theta) {
    dlmModPoly(2, dV=exp(theta[1]), dW=c(0,0))
  }
  # create vector with number of elderly (time sorted)
  d <- df_transformed %>% 
    filter(gemeentenaam==municipality) %>% 
    arrange(jaar) %>% 
    pull(k_65JaarOfOuder_12)
  
  # fit and filter the state-space model
  fit.dt <- dlmMLE(d, parm=log(var(d, na.rm=T)+1), build.dt)
  
  mod.dt <- build.dt(fit.dt$par)
  
  filter.dt <- dlmFilter(d, mod.dt)
  
  # forecast 5 years ahead
  pred.dt <- dlmForecast(filter.dt, nAhead=5)
  
  sqrtR <- sapply(pred.dt$R, function(x) sqrt(x[1,1]))
  
  # add prediction to data frame
  df.out <- data.frame(
    out=pred.dt$a[,1], 
    pl=pred.dt$a[,1] + qnorm(0.05, sd = sqrtR), 
    pu=pred.dt$a[,1] + qnorm(0.95, sd = sqrtR),
    jaar=2023:2027,
    gemeentenaam=municipality
  )

  return(df.out)
}

This approach is different from the more regular linear fit in the sense that the confidence margins take the time dependency of this time series into account. The figure below visualizes the forecast for Vught.

The historic and expected number of elderly inhabitants in Vught. The shaded area indicates a 90% prediction interval.

Visualize the future

Finally, one can now do cool things such as plotting the expected number of elderly inhabitants on a map. Note that we keep the boundaries of the municipalities fixed. This avoids visually distracting changing shapes and helps to focus on the aging population. From the visual below, it can be seen that the percentage of elderly population is on a rising trend.

Geographic map animating the percentage of elderly for the years 2016–2027 for each municipality.

Closing words

One can do many more interesting analytics and data science with municipal data. The statline portal of the CBS allows for easy downloading of a rich set of data. Please do not hesitate to contact me in case of questions or interesting ideas for further exploration.

The full R-code used for this blog can be found here.