Using web content to better understand business activities.
In the UK, Standard Industrial Classification (SIC) codes are used to categorise businesses based on their activity. Policymakers and analysts use this official taxonomy to measure sectors, identify stakeholders to engage with, to develop policies, and to measure the impact of policies. However, SIC codes have three important limitations:
- First, a high proportion of UK businesses currently classify themselves as ‘Other’. At the moment there is limited evidence about what kind of activities businesses in the ‘Other’ SIC codes are engaged in, which means policymakers have little understanding about the activities of a significant part of the UK economy;
- Second, some UK businesses are engaged in types of economic activity which do not sit well within SIC. Examples include businesses engaged in ‘low-carbon’ activities or in the ‘immersive economy’. Official industry codes (last updated in 2007) fail to capture these new sectors, resulting in a lack of evidence that hinders policy making;
- And third, as businesses become increasingly more dynamic, innovative and technology-driven, they also perform cross-sector activities. In this context, SIC codes also fail to accurately capture the variety of business activities.
The UK government is aware of these limitations, and as a result of the Review of Economic Statistics¹, it has started investigating the economic activities undertaken by UK businesses and how these are reflected in the SIC codes.
At Glass, we believe that textual data from websites can provide deep insights into the economic activities of businesses. We’ve developed AI technology that reads and interprets the web, and in the UK our engine has mapped — for the first time — the entire economy based on its web presence, that is 1.4 million UK businesses across sectors and geographies.
With this new capability we decided to run a new experiment.
We investigated the activities of UK businesses classified as ‘Other’ and ‘None supplied’ within the official SIC codes taxonomy. To do this, we took a random sample of UK businesses and mapped their web data in Glass against their official information in Companies House (CH). Our process followed several steps within two main parts: core technology and mapping.
Reading the websites
To identify the UK businesses, our crawler was set to read websites that target a UK audience or have adopted the .uk domain address. Websites were considered if they were written in English, had mentioned a UK address on their pages, and had some depth of representation for the business in question.
Starting with over 200 million web pages, our engine identified approximately 1.4m UK businesses with a website. Each website was read and relevant text entities (e.g. business descriptions, addresses, people) were detected with state-of-the-art precision (> 95%). The different entities were identified using an AI model that considers multiple features such as location on the web page, use of specific keywords and phrases, sentence structure etc.
Based on the descriptions, key topics, links on the homepage and other attributes, the businesses were automatically classified into one or more economic sectors. The Glass sector taxonomy is comprised of 108 sectors and it has been trained using a sample of sector classifications from LinkedIn. Businesses with well-defined attributes were assigned a single sector, while those with diversified activities had multiple sector predictions. For the purpose of this research, we only considered the first (and the most representative) sector.
Companies House Mapping
After assigning the sectors, from the 1.4m UK businesses we had information on, we randomly selected a sample of 400k businesses with address¹ information. Then we used the CH data to get the name, addresses and SIC codes for the companies.
Pre-processing & matching
From the CH dataset, we selected only active and non-dormant companies. At this point, both CH and the Glass business names were cleaned/normalised (e.g., punctuation, stop words, whitespaces, company type abbreviations, etc.). We performed the matching exercise of official data with web data using a fuzzy match on name and exact match on postcode. Since the addresses represented a significant metric in mapping, we excluded businesses with Registered Addresses different from the Trading Addresses. To do the name matching, we used multiple similarity/dissimilarity metrics. The best ones were Jaccard Index, Cosine Similarity and the overlapping number of words.
Glass to SIC results
This exercise resulted in 100k organisations² that where successfully matched. The top matched SIC codes had accurate equivalents in the Glass sector classification (Table 1).
Analysis of ‘Other’ SIC codes
Approximately 6% of all the SIC codes are labelled as ‘Other’. More strikingly, on the full CH data set³, the current SIC taxonomy fails to completely capture activity information for almost one-third of UK businesses (that is, approx. 30% of businesses in CH are classified as ‘Other’⁴). This is strong evidence that many registered UK companies do not seem to have chosen — or could not choose — an accurate SIC code and are therefore miss-classified and misunderstood from a policy making perspective.
In our analysis, we saw that the SIC code Other business support service activities had the most matches with the web data (18.14%; 5175 businesses) (Table 2). This SIC code, along with Other service activities was also the most diverse when it comes to sector coverage (comprising 103 out of 108 Glass sectors).
We further examined the top two ‘Other’ SIC codes. First, we looked at their sector distribution with regard to the Glass sectors, and second, we inspected the proportion of ‘Other’ within each sector. The top two SIC codes were the most ambiguous about company activities, even though they were part of clearly defined SIC sections⁵.
We also discovered that businesses performing ‘Staffing-related’ activities had the highest proportion (5.2%) in all the ‘Other’ SIC codes (Table 3). This could mean that this sector has one of the poorest SIC descriptions or it could mean that Staffing-related companies tend to perform cross-sector activities. We noticed a similar situation, but at a lower proportion with ‘Hospitals’. By contrast, companies specialising in ‘Jewellery and Wholesale’ accounted for the lowest share of ‘Other’. The top two ‘Other’ SIC codes had a slightly different sector distribution, with most companies in Financial Services and Professional Training sectors.
Another insight was that more than a half of ‘Staffing’ and ‘Health & Wellness’ businesses would classify themselves as ‘Other’ (Table 4). Why is this figure so high? This is an area of further research that could be addressed using additional data.
Analysis of ‘None Supplied’ SIC codes
In CH, missing information on company activity is evidenced through the ‘None Supplied’ SIC codes. Choosing a SIC code at the moment a company is set up⁶ has been mandatory since 2016. Previously, the data was provided on the first annual return (now called the confirmation statement).
Based on our analysis, 5.7% of registered CH businesses did not provide a SIC code. In terms of our matching with the web data, we got a total 3.2% businesses with a ‘None Supplied’ SIC code. This could suggest that these businesses are less likely to have a web presence.
The sector ‘Law Practice and Services’ in Glass was the dominant sector among companies with a ‘None Supplied’ SIC code (Table 5). One possible explanation is that maybe law firms tend to be partnerships (i.e. not registered in CH) and as a result there isn’t an appropriate SIC for law firms in the official taxonomy. We learned that this is not the case, as the ‘Legal and Accounting’ SIC code can capture the activities of law firms. We noticed that the top Glass sectors in the ‘None Supplied’ SIC category are professional services sectors. Certainly with the use of text rich company descriptions and topics data from business websites we can get a better understanding of what these businesses actually do.
This quick matching experiment of web data with official data shed some light to the kind of activities UK businesses in the ‘other’ SIC codes are engaged in. Professional services businesses related to staffing and training seem to be the most poorly classified in Companies House. Also, we have learned that law, accounting and investment-related businesses do not always choose a descriptive SIC code. This in itself could be an interesting line of enquiry for another piece of research.
More detailed research could also be done with the UK Glass data. For example, we could look at the specific topics that companies use to describe their activities, we can help analysts categorise businesses that are active in various sectors and, as shown with several reports, the open web also allows us to better understand the sizes of emerging sectors that do not sit well within official statistics.
 We limited the number of addresses to ten per company.
 96% accuracy.
 Active and Non-Dormant companies.
 SIC codes labelled ‘Other’ capture more or less Industry information. For example, Other service activities do not offer enough information on company activity, whereas Other manufacturing gives a clear indication of the Industry.
 Each SIC code is part of a broader industry section. 82990 belongs to section N (Administrative And Support Service Activities) and 96090 is part of section S (Other Service Activities).
Bean, C. (2016). Independent review of UK economic statistics. HM Treasury, Cabinet Office, The Rt Hon Matt Hancock MP and The Rt Hon George Osborne MP, 11.