Why corporate structures need to be open data

By Chris Taggart, Hera Hussain

Thomas8047, Closed (Flickr)

While the corporate structures of many large companies are in theory publicly available, they are in reality stuck in the world of 50 years ago, consisting of dispersed paper documents in multiple locations. It’s as if a multinational company was trying to manage its accounts before computers, extracting and then collating data from paper reports from multiple locations around the world.

Today we live in a data-driven world, and this information must be available as data to be useful — without this, we all suffer:

  • Companies struggle to get good information on their customers, suppliers, competitors
  • Investors and analysts struggle to get good data to make investment decisions
  • The resultant opacity breeds suspicion and mistrust — NGOs and public-interest organisations are forced to manually collect filings, convert them into data, and collate them together. This is a considerable amount of work, and they naturally look for stories to justify that work
  • The opacity also disadvantages ethical, ‘clean’ companies compared with their ‘dirty’ competitors as comparisons are impossible. ‘Dirty’ companies benefit from the opacity, hiding their bad behaviour, and ‘clean’ companies are assumed to be dirty, thus failing to get public benefit for their higher standards
  • OpenCorporates, however, has created a methodology for working with NGOs and others to convert statutory filings into corporate structure data. While this works well for select large corporations, ultimately we believe a more collaborative approach with farsighted corporations would benefit both the corporations and society as a whole. We think this will ultimately become the default for any reputable corporation.

The steps and challenges of building corporate structures from statutory filings

Step 1. Finding the ultimate controlling company of a corporate group.

When we think of Morgan Stanley, we may think of it as company but in reality it is a complex web of companies that are diverse in names, registered jurisdictions and purpose of business. To begin investigating a corporate network, we need to find the parent company that controls all subsidiaries. Mostly, this is the original company but often due to structural changes, finding the parent company can be more difficult and would require access to company filings which aren’t public or available at all for most countries.

A useful way to locate this is to find the incorporation date and the legal entity name of the ultimate controlling company through Wikipedia, the Company’s official website and through OpenCorporates.com.

Step 2. Finding corporate filings to find subsidiary relationships.

To construct a corporate network, details of subsidiaries must be collected from official filings or corporate reports. In the case of some companies, these might be available through the company website but mostly, these filings will need to be found through authoritative sources such as corporate registers. In the UK, subsidiary information can be found on annual returns and annual reports. In the US, SEC filings may contain this information though in both cases, the disclosure of such information is discretionary. Finding the right filings can be laborious task and not all jurisdictions allow free access.

For instance, it is difficult to understand which document will have the subsidiary information we’re looking for:

Step 3: identify and extract the relevant information.

This is the most difficult stage of the mapping process. While data on subsidiary relationships is available to a sometimes quite detailed degree in official disclosure, it is often in the form of a PDF or scanned images which means this is not even text, let along well-formed structured data. Anyone wishing to work with the data would have to type it out again, map it to a standard form, which is expensive in terms of time and effort and is also prone to errors.

Typically, the process of doing this requires you to find the relevant section on subsidiaries, and then turn sometimes quite complex sentences into data — a genuinely non-trivial task. For instance, this paragraph from Monsoon Accessorize Ltd AR01. Even for people who are familiar with corporate networks, this is a difficult text to decipher:

In the example above, it’s not clear exactly which entities are being described — where are the entities referred to domiciled, for example; what are the exact legal names (are the jurisdictions in brackets part of the legal name, an indication of the domicile, or both); what is the relationship of Drilgreat Limited to Balmain Invest & Trade Inc? This is a good example of well-defined underlying facts that are made impenetrable by the form in which they are given (long, complex sentences).

Even if the information on subsidiaries is in the form of a table, extracting and interpreting it can be difficult.

For instance, here is a page of subsidiary undertakings from Unilever PLC’s recent filing. Even though it was clear output from a dataset of some sort, it has been printed and scanned which means you cannot ‘cut-and-paste’ data from it, but need to retype the whole table or use OCR software, which tries to recognise text in image and turn it into text.

In this case open-source OCR software would be defeated by the borders around the tables meaning proprietary systems that specialise in tabular data will need to be used — this both increases the cost, and the barriers, as in depth knowledge of OCR software is needed to make such decisions. In addition, there are inevitably going to be errors in recognition with any OCR program, leading to bad data and potential mismatches (this is true of rekeying the data too).

Even if you have managed to convert this table (and it is hundreds of rows long) there are problems in turning this into structured data. This is an example of such data (for a different company):

Notice how the data is in clean explicit columns, with no subheadings, and no additional data in the legal name of the company. To get to this sort of table from the Unilever PLC filing is a considerable task.

First, even assuming that you can get an OCR program to accurately turn the image into tabular data, there is much work still to do. First you need to remove all the subheadings “Company Name”, “Registered Office Address”, etc, that intersperse the actual data.

Then you need to separate the legal names from the status or annotations that have been merged with it in the Company Name column. For example, on the last of the company names in Turkey in the above image, we believe “mersin serbest bölge subesi” actually refers to the fact that this is a “Mersin Free Zone Branch”. Likewise, in the image below, “Accantia Limited (in liquidation)” is a concatenation of the company name and status. Extracting this information so that just the legal name is in the company name column is additional work and time, even assuming that this is clear, which it may not be in the case of different languages. For a computer program, however, this becomes exponentially more difficult.

Similarly the jurisdiction is problematic. Even though the column is headed “Registered Office Address” there is no address given, nor even a region or city. This means that we don’t have enough information to identify the jurisdiction of domicile/specific legal entity. For example, the UAE has several emirates, many of which have multiple company registers. For the US there is a similar problem: companies are incorporated by individual states, and as the states are quite distinct jurisdictions there is no barrier to completely unrelated entities have the same name in different states. OpenCorporates is making the problem of identifying US companies less difficult (we currently have 48 million entities in 46 states), but this accidental obfuscation makes the information lower quality, less useful, and more likely to be misinterpreted.

Finally, even the basic issue of the parent entity is problematic. First, the identity of the parent entity isn’t explicitly stated in the table, but indicated in the ‘NV or PLC’ column. This appears to indicate that those with an ‘N’ have Unilever N.V. as the ultimate parent, and those with ‘P’ have Unilever PLC as the ultimate parent. However, we would expect a researcher to check this interpretation, and then of course they would need to explicitly add this fact (there’s also no explicit jurisdiction or identifier for those parent entities). Neither is there any information on whether the parent entities control the entities listed directly or indirectly, still less the path of control — which some companies do in their subsidiary listings.

When trying to identify legal entities — it is crucial to use identifiers such as company numbers as company names are not very effective. In the AR01 (UK), Unilever discloses hundreds of subsidiaries over 11 pages. No company numbers are mentioned in this filing.

When you try running one of the names through OpenCorporates, you get many results and there is no clear way to know which company is the right now. This has many implications for Unilever’s reputation. It may be thought that these details are knowingly withheld to cover up something unpleasant and Unilever might be connected to a similarly named subsidiary by error which can carry reputational risk.

All of these tasks take time, and add barriers — for humans, particularly ones that have been trained to read such statements, this is a cumbersome and frustrating process; having computers do this automatically is incredibly difficult even assuming that all companies file in the same way (which they do not). Thus the information is publicly available, but not publicly usable.

Given that the information is typically stored as detailed well-structured data in a corporate secretary database program such as Blueprint, and that it is possible to export the data from such programs, the impression left is of corporations that do not wish others to have this data.

This is highly problematic, as it breeds mistrust, breaks the social compact between companies and the wider society, and creates a fertile ground for problematic behaviour and incentives, including making it difficult to distinguish between ethical companies compared with less clean competitors (‘dirty’ companies benefit from the opacity, hiding their bad behaviour, and ‘clean’ companies are assumed to be dirty).

That’s why we believe it’s in the interest of far-sighted ethical companies to reset the dial re transparency of corporate structures, and make this information available as open data. In doing so they will show their commitment to transparency is genuine, avoid being the target for NGOs using the above methodology to reconstruct their structures with their own agendas, and raise the bar for competitors that have something to hide.

What does open corporate structure data look like?

We believe open data about companies is fundamental to a better business environment and a fairer society. When organisations publish open data they are deliberately lowering the barriers for reuse, and saying that they wish to engage with the wider community in a positive, proactive and forward-thinking way.

OpenCorporates has been long recognised as the world leader in open data on companies, and has done extensive original research, including in the field of corporate structures, which it has put in the public domain, including the data models and schemas for corporate structures:


Visualisations of corporate structures are important for two reasons: first, they can provide insight and understanding of data which is hard to understand as a list, or even worse prose; second they are important tools in identifying data quality problems, showing at a glance bad data, or data gaps.

OpenCorporates has two visualisation tools that we use to display corporate structure data; it has also developed a methodology based on the processes above to research corporate structures of multinational entities and convert these into data — we do this both automatically (for example with the Goldman Sachs data, below) and manually (with the corporate structure of BP, for example, which was produced in cooperation with OpenOil). These are then fed into these visualisation tools, with sometimes startling results.

There are two kinds of visualisations:

Corporate structure effect: This is an interactive visualisation that shows company profiles in OpenCorporates when clicked. There are two views associated with it: Tree view and network view.

Corporate chains over a map: This visualisation is useful for showing the international nature of business.

Visualisations showing the corporate chains over a map

Visualisation of the corporate network linked with company profiles

The Gap Inc. (Tree view): https://opencorporates.com/companies/us_de/2157877/network
Pearson network (Network view): https://opencorporates.com/companies/gb/00053723/network