UK Bank Holidays with Pandas on Jupyter

How I learned to manipulate JSON data with Pandas on a Jupyter Notebook and deconstruct it to a DataFrame ready for queries.

Simone Spaccarotella
BBC Product & Technology
7 min readSep 8, 2022

--

Image by author created from Jupiter photo by NASA and Pandas photo by Pascal Müller on Unsplash

A bit of context first

I started a self-study path to learn the theoretical fundamentals of Data Science and Machine Learning. I have also been playing with Python ever since, with coding exercises during the “Friday 10% time” here at the BBC and on my own time when I have some to spare.

I created this playground repository to refresh my Pythonic knowledge that was buried deep inside the funny head of mine. This repo is a mix of material and exercises to keep my Python coding skills sharp.

I also created a specific playground focused on Data Science and Machine learning. In this playground I practice with libraries such as NumPy, Pandas, Scikit-Learn, TensorFlow (et. all) with a Jupyter Notebook. It also contains notes and summaries of what I learn in forms of README or notebooks.

Nothing fancy or ground-breaking, just a couple of playground repos for me to practice and learn at my own pace.

The 10% time at the BBC

This happens every other Friday and it is about learning. Engineers use this time to learn new technologies and/or languages, discover new approaches and techniques.

We can organise the day as we want. It could be a self-paced learning session (reading, watching tutorials, etc.), a day of study for an apprenticeship or a university degree etc., an actual hands-on prototyping session to work on a long-running personal project, or a group collaboration with other like-minded people.

There is no limitation to how we plan our day, as long as we keep in mind the end goal. In simple terms, satisfy our own curiosity and creativity by strengthening our technical knowledge and acquire new skills to widen our horizon towards new engineering areas.

Unlike the hack day, we don’t have to present a demo to a panel. Nonetheless, we like to share with the other members of the wider team what we’ve learned during the day if we want to, discussing about different topics, Q&A and why not, showcasing a prototype if someone has built one. No prizes but praises.

The UK Bank Holidays data

I needed to find a dataset that I could use to practice with Pandas on a Jupyter Notebook.

You may be familiar with the Gov.uk page that shows the list of past and upcoming UK bank holidays (i.e. https://www.gov.uk/bank-holidays) and maybe less familiar with https://www.gov.uk/bank-holidays.json which is the same URL plus the “.json” part appended at the end.

The JSON URL returns the same information of the HTML page but in a machine-readable format (i.e. JSON response) and it’s that URL that I used to retrieve the information i needed to manipulate.

The data is released under MIT License and if you are curious about its implementation, you can find more details at the following link

This is my first Notebook by the way, which I wrote using plain Jupyter on my localhost. Once I mastered the craft, I then decided to publish it on Kaggle which is way better if you then want to practice on more interesting datasets and tasks: from simple data manipulation and visualisation to fully fledged machine learning competitions. You can find my published UK Bank Holidays Kaggle at the following link.

In this project I use Pandas to fetch, manipulate and query the JSON data. The main trick to work with JSON datasets and manipulate them as Pandas DataFrame is to flatten the portion of properties you need with the “json_normalize” Pandas function, no matter how deep the props are.

N.B. In this post I’m assuming that you are familiar with the tools and libraries I’m referring to. If not, it would be presumptuous of me to tell you where and how to find material on the Internet. But if I were asked, I’d recommend https://towardsdatascience.com, a publication of independent authors and volunteer editors of dedicated and curated Medium blog posts specialised in Data Science and Machine learning. I read it myself to support the material I am currently studying and it’s being really useful so far.

The Notebook

The JSON structure of the response returned by https://www.gov.uk/bank-holidays.json is the following

Pandas plays well with tabular data and it is able to read JSON too. The challenge with JSON though is hat it is hierarchical by nature, and Pandas can only go as far as 2 levels deep to cover the two dimensions (columns and rows). In order to get nested properties, we need to specify their path in order to flatten the data in tabular format.

Data collection

First of all, we need to import the Pandas library

and fetch the data from https://www.gov.uk/bank-holidays.json

Data exploration and transformation

Pandas DataFrame

The data frame has a row called events which contains data in JSON format. We need to deconstruct this data in tabular format, so that we are able to query it with Pandas.

With this approach we transform the bank_holidays DataFrame to a Python dictionary. The dictionary can then be manipulated with the json_normalize Pandas function, by selecting the nested property to flatten into a table.

The data frame is then returned with the appropriate data types set for each column.

The following examples refer to the bank holidays in England and Wales.

NOTE: This approach can be used with any JSON data no matter the complexity of its structure and how nested a property is.

Get all bank holidays within a specified time interval

The following examples return the 2022 holidays in England and Wales. Let’s declare the “start” and “end” dates.

Create a boolean mask

Query the data frame to select the subset rows

Bank holidays 2022

Another way to query the data frame is by using the query function. It gives you the same result as before but the syntax is powerful and more readable. Note that the @ syntax is used to reference the global variables previously defined.

If you are going to do a lot of selections by date, you may want to set the date column as the index for the data frame.

Now you can select the rows by date using pd.loc[start_date:end_date].

Bank holidays 2022 with the “date” column set as DataFrame index

Upcoming bank holidays in England and Wales

The following example returns the remaining holidays between “now” until the end of the year.

To do so, we need to import the date object from the datetime module.

Calculate the date range from today to the end of the year

Now you can query the data frame

Conclusion

This for me was a pretty good learning on how to use Pandas and Jupyter with a real example and how to manipulate and query data. Don’t get me wrong, I just scratched the surface of the iceberg by using just a few of the many functions that this library provides. It’s a powerful tool and it takes a bit more practice on many more projects to really get the hang of it. But the good news is that the documentation is comprehensive and well written plus there is plenty of examples and informations out there for you to find and support your work.

If you are already familiar with the tool and are looking for interesting datasets, I’d suggest to signup on Kaggle as I did, search for the right exercise that fit your level of expertise, and practice with it. If that is not enough, just remember that many big organisations, government agencies, public bodies etc. provide open data for the public to use: e.g https://data.nasa.gov and https://data.police.uk/data just to name two.

In conclusion, if you are in the UK and you want to check how long until the next holiday to have some time off, please just consult the Gov.uk website, it’s faster 😃. But if like me you want to learn something new while you do it, well you’ve found the right post to read.

--

--

Simone Spaccarotella
BBC Product & Technology

Software Engineering Team Lead for BBC Sounds and Data Scientist Apprentice for BBC iPlayer. Musician for myself.