Mining for Data: How We Built This

Published in

Houghton Street Ventures

7 min readJun 13, 2022

Our last three blog posts have provided the world with a window into the LSE entrepreneurial ecosystem, first looking into the incredible year that was 2021 before zooming out and exploring the whole ecosystem as well as the LSE entrepreneurs themselves.

Today, we have something a little different, and, fair warning, readers, this one comes with a “Nerd Alert”. We will explain how we set up our data infrastructure that is powering the insight and sourcing engine we built at Houghton Street Ventures.

Infrastructure

The database

We wanted something which was secure, flexible, reasonably easy to use and scalable, so AWS RDS felt like one of the most straightforward and most well-known options to go with. Our data requirements were well suited for SQL, and prior experience led us to a MySQL instance which was easy to deploy.

The dashboards

Like most well-rounded teams, our technical proficiency varies, and so we needed something which was easy for everyone in the group to pick up and run their own queries. At Scalable Capital, whilst we were in our early years, we were using Metabase for all of our BI needs, and we were immediately drawn into using it again to speed up our time to insights. This product is highly recommended for setting up an easy-to-use window into your data.

The time saver

To ensure our time was spent focussing on our core data needs (sourcing and insights), we wanted a tool that made it easy to manage and maintain the core database. As such, Navicat was the perfect option for us to easily append, update and change our database tables without worrying about writing the correct SQL commands for creating tables, setting up the suitable data types and updating records.

Data Collection and Cleaning

Collection

We are interested in collecting data about our entrepreneurs, including basic attributes, their educations, their careers and any information we could on the companies they founded (including funding data). As such, it means we have to pull this from several paid and public sources to get everything we need to be able to create insights as well as keep on top of everything that is going on in the ecosystem — including reaching out to recent alumni who have recently set out upon an entrepreneurial journey.

Cleaning

Collecting data from multiple sources and across different areas of interest meant that our first major task was to clean the data and standardise it in a way which made it easy for us to store it within a relational database. For example, many collected records included complete career histories of varying lengths for sets of alumni and this all needed to be separated and cleaned. This was one of the most complex parts of the process and needed automating pretty quickly to ensure future updates were quick and correct. Any changes to the collected data structure will result in changes required to our automation here — further development down the road.

Data Model

As mentioned previously, our data was well suited for a relational database, which has so far resulted in fourteen connected tables. We will run through the general purpose of all of these below to indicate how we set things up to make querying and analysing as understandable and powerful as possible.

Core

This is the main table which holds all of the founder attribute data like location and gender. The initial setup of the table was straightforward, but the challenge comes with any update to the data. Founder unique identifiers like URLs or names are rarely stable, and so we have to use a “Fuzzy Matching” algorithm to match newly downloaded profiles to the existing database to work out if they are really new founders or just existing founders with a new name or with their data hosted on a different URL. Fun times… We also standardise founder location data at this point to help us with mapping the ecosystem across the globe.

Schools

This basic table contains little more than the name of all the academic institutions where the founders studied. The only complexity to deal with is that we are obviously interested in alumni of the London School of Economics, and there are many iterations to consider (LSE, London School of Economics, London School of Economics and Politics Science and so on).

Education

The link between founders in our Core and Schools tables. Each record has a FounderID and a SchoolID. N founders can be matched with M schools through different educational records. This gives us data on all graduation dates and degree courses. Some cleaning is undertaken in this table to accurately categorise a degree as Bachelors, Masters etc.

Companies

This is a basic table which contains data on all companies which are related to our founders. This table stores details like company name, website, headquarters location and foundation date. At this point, the database starts to get big. We store over 120,000 companies relating to a founder universe of approximately 23,000 individuals.

Career

The link between founders in our Core table and Companies. Each record has a FounderID and a CompanyID. N founders can be matched with M companies. This gives us all the data on job title and start/end dates. The complexity in this table is that it determines whether that career made the individual a founder of the company or not. Roles like “Founder’s Associate” or “Entrepreneurship Studies” make this a matching challenge.

Crunchbase Companies

Crunchbase is an excellent source for global early-stage funding round data, which is very relevant for our universe, and, as such, we have a table to store all of the company information we can pull from Crunchbase (outside of acquisitions and funding rounds which are dealt with separately). This table stores information such as company type (public/private) and descriptions of what the companies do.

Crunchbase Company Match

The vast majority of our career data matches against our Companies table, and all of our funding data matches against the Crunchbase Companies table. One of the challenges we face when cleaning and aggregating all this is that career data can change slightly, which moves the founder relationship from one company to another. For example, just because the company name changes doesn’t necessarily mean we need to link to a new Crunchbase company. One way to ensure the updated company data is linked back to the correct Crunchbase Company records is through this 1-to-1 matching table. We used to model our data without this matching table, and every update required a lot of manual rematching and intervention. This is here to prevent that.

Acquisitions

This table holds all the data on any acquisitions that have occurred within our universe of companies. If we have any details on the terms, these would also be stored here. This table links to the Crunchbase Companies table.

Investment Round

This table holds all the high-level data on any funding rounds attached to a given company, linking again to the Crunchbase Companies table. This table has been instrumental in gathering insights into the historical winners and the size of the funded LSE company ecosystem. As this is a vital part of the data, we save more details in the following three linked tables.

Investor Participation

Not only does a funding round link to a company, but it also links to an investor. This table is simple and holds both an InvestmentID and an InvestorID to match any investors who participated in an investment round with the company. It is an M to N relationship as investors invest in many companies. We have gathered insights into LSE companies’ most prominent VC investors from this data.

Investor Lead

This table is the same as the Investment Participation table but looks at the leads of the investment rounds rather than the whole list of investors who participate. It is basically a subset of the Investment Participation table.

Investors

Like our Schools table, this table once again contains little more than the name of all investors who have previously invested in an LSE company. We may choose to build more attributes out over time, but for now, we are focussing on other parts of our data collection efforts.

Industry Match

This table links up the Company table with the Industries table allowing for M to N relationships to potentially match companies across multiple industries.

Industries

Finally, another basic table which is a list of all of the industries that are related to our LSE company universe. Using our Industry Match table, we can map every company to several industries and verticals.

The Results

Setting up the framework above has been a labour of love as we set up the fund and discovered more and more about the size and strength of the LSE entrepreneurial ecosystem and how much it has changed over time. However, it is also an integral part of our outbound sourcing engine, and if it helps us to build beautiful graphics like the below, that is an additional benefit :)

Mining for Data: How We Built This

Infrastructure

Data Collection and Cleaning

Data Model

The Results

Written by Adam French