From company register to standardized open data, our processes explained — Part 1: Scouting for data

Published in

OpenCorporates

6 min readApr 11, 2017

Written by Alex Skene

OpenCorporates’ mission is to be able to list every company in the world, using only public sources to provide full transparency and provenance. In order to achieve this, the OpenCorporates Data team works constantly to expand its coverage of jurisdictions where companies can be registered (120 and counting so far!), whilst maintaining a rigorous set of data quality standards.

To provide some openness about how we go about this, we’re going behind the scenes in a series of data-focused blog posts intended to help explain what happens when we introduce a new company jurisdiction to OpenCorporates.

At a broad level, the process goes through these steps:

Scouting — finding new sources of data & choosing the most appropriate one
Analysis — understanding the data in depth and mapping attributes to the OpenCorporates data model
Development — write code to automate data collection and ongoing update
Quality Assurance — test the data to ensure it our quality standards
Pre-import readiness — final sign-off and configuration setup
Import — data is finally added to OpenCorporates, ready for use
BAU — Post-import and ongoing support activities, the “business as usual” process through which we ensure the smooth running of regular data updates

We’ll update this post with links to all parts of the series as they’re written.

This first post covers Scouting.

Why & when do we need to scout?

There can be a number of reasons that prompt us to start our investigations into a country that we’ve not already obtained company data for:

Demand from our open-data or commercial clients
Tip-offs about a new open dataset from our network of community partners, or increasingly from the register itself, when they publish as open data
Internal prioritisation of our roadmap

We also need to scout for sources when an existing source of data already in OpenCorporates, becomes unavailable, for example due to a registry website stopping free access to basic company records in the case of Spain or Gibraltar.

For all of these we generally carry out the same process outlined below.

Working with the local community

OpenCorporates could not do what it does, without the help of a wide network of corporate transparency campaigners, NGOs, open data activists, even government itself to glean local knowledge, translation/language skills and friendly advice. Liaising with the wider community (like we did in the case of Israel) at the beginning of in the process speeds up the overall process and leads to better quality data — working alone can lead to the risk of incorrect assumptions being made about the different sources of data available, and having a local subject matter expert on tap reduces this risk considerably. And both sides benefit, as we often ask detailed domain-specific question about the data that even the government haven’t thought of before. And of course, everyone benefits from the increased transparency that occurs when we publish and make available the data via the website and the API.

Finding New Sources

When looking to introduce a new jurisdiction, the first step is (fairly obviously) to work out where we should get our data from. OpenCorporates only obtains company data from freely available, publicly owned sources that authoritatively identify companies, and so we start with tracking down the official company registration authority (or authorities) for that jurisdiction.

Of course, OpenCorporates maintains its own list of these, the Open Company Data Index, which not only lists company registers, but also scores them for openness, and makes all the data available as open data too. There are also some third-party resources we occasionally use:

If this does not enable us to pin down the correct source, we then put our sleuthing hats on, and start searching with the help and collaboration of our community colleagues.

Source Suitability

So, what is the ‘correct’ source to use? This varies from jurisdiction to jurisdiction, and there may be multiple government bodies that make company data available. Here are some common examples:

Company registration bodies, e.g.a national register, or regional Chambers of Commerce
Government agencies acting as a data aggregator — for example national statistics bodies
Governmental Open Data agencies
Tax authorities, e.g. corporate tax databases
Business licensing bodies publishing data on companies licenced to trade in that country
Official government notices providing listings of new company registrations or amendments — for example gazettes or court judgements/listings

We’ll look into each data source and start to assess its suitability for use in OpenCorporates. The main questions we ask are:

How authoritative is the source?
is it the main originator of the data, or is it combining & republishing other authorities’ data?
Does it contain complete listings of companies in that country, or just a sub-set? For example only active companies, or just companies of a certain type might be available
Are there unique and persistent identifiers for each company?
How rich is the available data, e.g. what attributes are available?
How easy is the data to obtain? Are there any technical constraints?
Are there any restrictive legal terms & conditions regarding re-use of the data?

We compare the sources and pick the best one, using various criteria. First, having good, unique, persistent identifiers is a prerequisite. Second, the sources are judged ,higher weights given to the source closest to the company registration process, having the richest data, and with the most permissive T&Cs. We also prioritise the use of open, bulk data (e.g. in CSV or XML format) or APIs over other approaches to obtaining data such as web scraping.

What happens when we can’t find a good source?

It can sometimes happen that we are not able to find a good source of data, perhaps the company register is behind a paywall, or is simply not available as an online register. In this case we’ll put that jurisdiction on hold, and work with our community networks to support their efforts in working with politicians and government bodies to open up register data, by providing evidence on the benefits gained through increased transparency, thought leadership, or support for publicity or pressure campaigns.

Case Study: Texas, USA
The official company register in Texas is maintained by the Secretary of State. Laws in the State allow it to charge a $1 fee for each search (http://www.statutes.legis.state.tx.us/Docs/GV/htm/GV.405.htm Sec. 405.018), making company data only accessible to those able to afford it.
In contrast, all corporations registered in Texas are required to pay Franchise Tax, and the Texas Tax Code designates companies and most corporate officers and directors as public information. This allows the Texas Comptroller of Public Accounts (who manages Franchise Tax collection) to make company data information freely available, either via their search pages of taxable entities [https://mycpa.cpa.state.tx.us/coa/], as a series of opendata files of taxpayers [https://comptroller.texas.gov/transparency/open-data/search-datasets/].
Based on this we analysed the data available from the Texas Comptroller to validate how complete it would be in terms of numbers of companies, availability of other data (including company numbers issued by the the Secretary of State) and a permissive re-use licence, enabling us to make the easy choice to use it as our main data source for Texas.

Next steps

Once we’ve established the preferred source, we’ll update the Open Company Data Index, with any changes that are necessary. We’ll then analyse in more detail the data available and start to work with our developers to sample the data and map it to the OpenCorporates standard Company data model, a topic that will be covered in more detail in the next blog post — stay tuned!

Originally published at blog.opencorporates.com on April 11, 2017.