How I used the Companies House Streaming API to find a newsworthy story

Ioanna Toufexi
4 min readSep 14, 2020

--

I recently was in the exciting position to use a journalistic tool I developed to do original research and produce a data-driven story.

In this post, I’ll explain how I went from idea to completion and how anyone can use the tool to take advantage of the data stream provided by Companies House!

The story published on Belfast Live.
The story published on Belfast Live.

The idea

After a previous analysis of Companies House data, where I found that the number of new food&drink business incorporations in the UK fell dramatically at the peak of the Covid-19 pandemic in April 2020, I wanted to see whether the trend changed in summer.

Since I was working on this story as part of my work experience at the Reach PLC data unit, I wanted to obtain granular data which would correspond to the various regional titles the company ran. If I found a newsworthy angle, I would produce a bulletin of articles based on a common, nationwide story, adapted with region-specific statistics.

Data sourcing

Companies House helpfully provide, among other data products, a monthly CSV with basic information on live UK companies. This is a snapshot of the last day of the previous month. There is also the main API, which returns up-to-date records, but is not designed for bulk data downloads, as per its developers.

As I worked on this story in the middle of August, I used the July snapshot to compare the number of June and July 2020-incorporated companies to the one of the same period last year. For August 2020, I used the Streaming API to find incorporations from the 1st till the current date (the 26th), and used the CSV to get the corresponding period in 2019.

Considerations

There were a couple of caveats that could affect the accuracy of the results. Firstly, the CSV stores data for live companies, which means it would miss any companies incrorporated and then dissolved. For this reason, I used an older CSV for the 2019 figures, closer to the dates in question, so that the potential number of companies being dissolved before the snapshot was taken would be negligible.

Another consideration was the fact that Companies House stores the registered office address, which does not have to be at the same place where the company operates (not even the trading address has to). Therefore, formation offices offer the option to companies to choose an address somewhere else (for example, at a prestigious London location). Having a look around, I confirmed that the registered office address was used in news articles anyway. We agreed with the editor that, with maybe the exception of London, it was safe to calculate regional results based on the registered office address.

Streaming

I ran the streaming/script_storing.R in my companies-stream R project to fetch data from the basic company information stream and save them in txt files. At the moment, the process is quite manual: I choose a timepoint in the past from which I want the stream to send me event updates, and set a timeout (otherwise the live stream will stay open and receive events as they come). As the number of events is huge, the stream by design will return an error if the timepoint is more than a few days back. I had saved in txt files all the streaming events between 1 and 26 August 2020.

Parsing and comparing

I then ran the processing/company_data_product_script_generic.R. It parsed the CSV into a dataframe and kept all food and drink businesses registered between 1 and 26 August 2019. It also parsed the stored files and created a similar dataframe for between 1 and 26 August 2020. To find which businesses were food and drink related, I looked for the relevant SIC codes, which were a subset of the Accommodation and food service activities. As a company can have up to four SIC codes, I noted in the article that new registrations might be counted several times, once per activity.

After that the script calculated % change nationally and per Local Authority District, using lookups from geoportal.statistics.gov.uk, and saved the results in a CSV. I then VLOOKUPed the results to map the LADs to regional newsroom areas.

Making a story out of the findings

The trend was clear: new food and drink companies registrations were on the rise in summer 2020 after the slump in April, and some business activities such as takeaways, food stalls and event catering witnessed a big increase compared to last year.

The results were in line with wider statistics from ONS. However, I needed to speak to industry experts to 1) check whether their experience on the ground confirmed the findings 2) find out the reasons for the rise, for example, had the Eat Out to Help Out scheme played a role?

I eventually got a good quote from one of the directors of the Nationwide Caterers Association, who confirmed the trend, but suggested it was probably due to people losing their jobs and giving startups a shot, rather than a sign of the recovery of the sector.

The End!

--

--