How to use OpenCorporates and Companies House APIs

Bitcoin — CC0

A few days after starting my new job, one of my colleagues asked me: “do you know how to use an API?” I almost hugged him.

Before that, I had spent a few hours explaining that not only I could make maps, but use programming languages such as R or Python to gather and analyse data. But when I mentioned the word “coding”, I got perplexing faces in return, like if I were talking in Spanish instead.

So, that colleague, who I had just met, by the way, gave me the perfect excuse to show what “coding” is and why that matters for journalism.

The aim of the project was counting the number of companies related to bitcoin dissolved during last year. Companies House offers easily downloadable bulk data (here), but it only includes “live companies.”

The API was, therefore, a better solution to build our database, but it is still in beta, and it has some limitations.

Another open database with companies’ information is OpenCorporates, which offers an interesting API, too. However, I also found another restriction which made me use both APIs and control for duplicated outcomes later to get the database of dissolved crypto-companies that we were looking for.

COMPANIES HOUSE API

import requests
from lxml import html
import pandas as pd

First, get the API key here, and then, store the URL.

apiKey = "YOUR API KEY"
url1 = "https://api.companieshouse.gov.uk/search/companies?q="
words = ("word1", "word2"…)
url2 = "&items_per_page=100&start_index="
numbers = (0,100,200,300,400,500,600,700,800,900)

The URL also works with just the first part, “https://api.companieshouse.gov.uk/search/companies?q=” plus the word you want to search (btc, for instance, which is the abbreviation of bitcoin). However, Companies House API does not allow to loop through pages, so the only way of getting results over the rest of the pages was by adding two extra pieces to the URL.

The first one is &item_per page=100. 100 is the highest number. But printing total_results I learned that there are more than those 100 results. So, the second piece is &start_index=. I didn’t add any number after this equal, but I created another tuple to loop through.

#This is the empty list where I will save the results
data_CH_ok = []
index = 0
for w in words:
for n in numbers:
url_full = url1+w+url2+str(n)
re = requests.get(url_full, auth=(apiKey, '')).json()
for i in range(0,100):

Some of the fields I was interested in were empty for some companies. I found the combination try/except quite useful to handle exceptions and errors, to guide the code to the next “element” and to avoid the code stops when it ran.

try:
index = index+1
word = w
name = re['items'][i]['title']
number = re['items'][i]['company_number']
company_type = re['items'][i]['company_type']
linkCH = re['items'][i]['links']['self']
status = re['items'][i]['company_status']
incorporation_date = re['items'][i]['date_of_creation']
dissolution_date = re['items'][i]['description']
address = re['items'][i]['address_snippet']
if len(re['items'][i]['matches']['snippet']) == 0:
previous_name = "Empty"
ceased_on = "Empty"
effective_from = "Empty"
else:
previous_name = requests.get("https://api.companieshouse.gov.uk/company/"+number, auth=(apiKey, '')).json()["previous_company_names"][0]['name']
ceased_on = requests.get("https://api.companieshouse.gov.uk/company/"+number, auth=(apiKey, '')).json()["previous_company_names"][0]['ceased_on']
effective_from = requests.get("https://api.companieshouse.gov.uk/company/"+number, auth=(apiKey, '')).json()["previous_company_names"][0]['effective_from']
except (IndexError, KeyError):
continue
else:
pass

I also used the if/else statement to test which companies have a different previous name. If the parameter was equal to 0, the company has no other name, and I established to write in those cells the word “Empty”. But if FALSE (if it isn’t equal to 0), the previous name of the firm is stored.

Finally, the element append added to the name of the list where I stored the results (data_CH_ok.append) “pastes” all the results.

data_CH_ok.append((index,word,name,number,company_type,linkCH,status,incorporation_date,dissolution_date,address,matches, previous_name, ceased_on,effective_from))

Once I had the results “pasted”, I transformed it in a data frame (the normal table of columns and rows that we have in our mind), and I saved it in a CSV file.

df_CH = pd.DataFrame(data_CH_ok,columns=["index","word","name","number","company_type","linkCH","status","incorporation_date","dissolution_date","address","matches","previous_name","ceased_on","effective_from"])
df_CH.to_csv('companiesCompanyHouse.csv', index=False)

The full code can be found here, and read the documentation for further details.

Everything went perfectly until I realised that I couldn’t get more than 1,000 results per word, and some of them returned companies over that number.

OPENCORPORATES API

OpenCorporates’ API has a wider limit. You can get 1,000 results per page and loop through 100 pages, what means 100,000 results. Although it is a worldwide database, it also gives the possibility of filtering the results by country.

As it happens with Companies House, a key is also needed (here). After getting one, I repeated the process, looping through terms and pages.

url = "https://api.opencorporates.com/v0.4/companies/search?q="
words = ("word 1", "word 2"…)
uk="&jurisdiction_code=gb" #This is the country code
per_page = "&per_page=100&page="
apikey = "&api_token=YOUR API KEY"
store = []
index = 0
for w in words:
first_get = requests.get(url+w+uk+apikey).json()['results']['total_pages']
pages = range(0,first_get+1)
for p in pages:
full_url = url+w+uk+per_page+str(p)+apikey
response = requests.get(full_url).json()
for i in range(0,99):

There are three loops in here.

The first one will build an URL to get the number of pages each word has. That information needs to be stored in another object, and the range element makes the second loop goes through every number between 1 and the last one, depending on each word.

Finally, the third loop will access every element in each page. As I established the per_page in 100 items, the first and last numbers are fixed in the range.

As before, I included the try/Except and the if/else expressions.

try:
index = index+1
word = w
name = response['results']['companies'][i]['company']['name']
number = response['results']['companies'][i]['company']['company_number']
company_type = response['results']['companies'][i]['company']['company_type']
linkCH = response['results']['companies'][i]['company']['registry_url']
status = response['results']['companies'][i]['company']['current_status']
incorporation_date = response['results']['companies'][i]['company']['incorporation_date']
dissolution_date = response['results']['companies'][i]['company']['dissolution_date']
address = response['results']['companies'][i]['company']['registered_address_in_full']
if len(response['results']['companies'][i]['company']['industry_codes'])==0:
industry1="NULL"
industry2="NULL"
else:
try:
industry1=response['results']['companies'][i]['company']['industry_codes'][0]['industry_code']['description']
industry2=response['results']['companies'][i]['company']['industry_codes'][1]['industry_code']['description']
except IndexError:
industry2="NULL"
else:
industry1=response['results']['companies'][i]['company']['industry_codes'][0]['industry_code']['description']
if len(response['results']['companies'][i]['company']['previous_names'])==0:
previous_name = "NULL"
effective_from = "NULL"
ceased_on = "NULL"
else:
previous_name = response['results']['companies'][i]['company']['previous_names'][0]['company_name']
try:
effective_from = response['results']['companies'][i]['company']['previous_names'][0]['start_date']
except KeyError:
pass
ceased_on = response['results']['companies'][i]['company']['previous_names'][0]['end_date']
except IndexError:
continue
else:
pass
store.append((index,word,name,number,company_type,linkCH,status,incorporation_date,dissolution_date,address,industry1,industry2,previous_name,effective_from,ceased_on))

After adding the append to the list, I repeated the same process I did in Companies House: creating the data frame and saving it in a CSV file.

df = pd.DataFrame(store,columns=["index","word","name","number","company_type","linkCH","status","incorporation_date","dissolution_date","address","industry1","industry2","previous_name","effective_from", "ceased_on"])
df.to_csv('companiesOpenCorporates.csv', index=False)

The full code can be found here, and I recommend reading the documentation.

I admit OpenCorporates was more comfortable and intuitive, but it also has a limitation. When you search for companies with the word “bitcoin” in their name, it only returns those which match exactly with that specific word. If searching by “bitcoin,” the API does not include, for instance, companies that use “bitcoins” or “bitcoin-“ or “bit coin”…

On the other hand, Companies House API does not have this problem. So, to solve it:

1. I created the list of words I was interested in.

2. I spent some time with the searchable databases on their websites to test the number of results per word.

3. I decided which API to use for each word.

4. I tried to narrow down some of the terms to be as specific as possible.

5. I include the name of some companies that I found doing some research about crypto-firms.

6. I make sure I get the ID of every company.

CLEANING AND ANALYSING IN R

Once I got my databases, I spent long hours cleaning them in R. Although I tried to be as specific as possible with the words, some of them returned results not related to bitcoin (see the process).

I also merged the datasets with do.call, which allows you to join several data frames and not only two, and removed the duplicated results with duplicated, a convenient function.

# Merging the two datasets from CH and OC plus the 23 companies found in lists of bitcoins companies and in the CryptoUK institutiontotal <- do.call("rbind", list(CH_cleaned, OC_cleaned, companies23))
#remove duplicated
total <- total[!duplicated(total["number"]), ]

A couple of others highly useful functions that I learned during this process was anydate and parse_date_time from the lubridate package to work with dates easily.

After all of this process, I had my database ready for the analysis which provided my colleague with the news lead to base his story. But this analysis also gave the media an original news piece which performed quite well in the digital platform.

As a result, emails congratulating for the job, happy journalists and more talks about new projects which involve “coding.” However, I still have to show “what” is coding, to make sure everyone understand that you can be stuck with a single problem for hours, and then find the key to complete the process in less than thirty minutes.