When you ask in index.co (one of the most interesting databases with information about companies and markets) for “Artificial Intelligence” you get 3 356 records (as for the end of May 2019). Crunchbase.com reveals more than 10 000. These numbers are growing rapidly — more and more companies either develop algorithms/infrastructure or try to apply AI to solve real-life problems. It has a huge potential, but its’ rapid growth makes investments on this market risky: not only for investors (Will this technology pay-off?), but also for entrepreneurs (What kind of product should I create?) and young people initiating their educational journey (Should I learn this, or that?).
Fortunately, the datasets on companies and investments in AI can clarify this a bit cloudy landscape. In this post I will try to use some basic data science methods to answer the most intriguing questions like What are the most promising technologies and sectors? Where are the strongest AI ecosystems? or What are the most valuable AI skills? All the datasets and a jupyter notebook used to generate tables and figures below are available here, so if you want to see more or try out Your own ideas you are welcome :)
So, let’s get started…
Data and Problem Understanding
Our dataset, generated on April 23rd 2019, consists of 3269 records, each corresponding to a company with an “Artificial Intelligence” tag in the index.co database. First look at data shows that not all the fields are equally populated:
What can we find in this data? What questions can we answer? After some contemplation, I’ve found the following set of potentially intriguing questions:
- What technologies and markets are the most popular, both in terms of number of companies and total investments?
- How do AI investors invest their funds? What are the basic investments statistics? Which countries invest most, and where are the strongest AI ecosystems (read: where is the best place to set up the AI startup)? What about investment round types?
- What are the employment patterns? What does small-medium-large mean in terms of number of employees? How much, on average, is the data scientist worth? And what are the most valuable skills?
Before we answer these questions, we must clean-up and prepare our data…
After some preliminary cleaning, the first real problem I encountered was quite a messy “market” field. For a sample company it looks like that:
As you can see, although a bit messy, it contains a valuable information both on technologies and markets. If only we clean it up, we can retrieve a very interesting data which may enhance our analysis.
In short (see my jupyter notebook for details, available here), I firstly removed unnecessary “stop words” (like empty spaces, “>” signs, etc) from a string, and then split it into “dummy variables”: a set of columns with clear TAGs (technologies or markets) as names and values either 1 (if this TAG appears in a company description) or 0 (if it’s absent). That allows not only to compute some interesting statistics, but also opens a window for future modeling (eg. prediction of the total investment based on a set of TAGs using regression analysis).
It shows up, that there are some keywords repeating in many TAGs. Eg. marketing can be found in app marketing, brand marketing, content marketing, etc. That made me to generate TAG categories in order to compute different stats not only on TAG, but also category level.
The most “populated” categories in my dataset are:
Finally, since not all the fields in a dataset are equally populated, I created a dedicated datasets for funding, geographical and employment analysis (removing empty (NaNs) entries).
Having data cleaned and prepared, we can start answering our questions. BTW: data cleaning and prepping consumed app. 80% of my total workload, what is typical in DataScience projects using CRISP-DM methodology…
And here we are! We are ready to answer our questions.
What technologies and markets are the most popular?
Let’s find the TAGs and their categories with the largest number of companies.
TOP 10 TAGs:
TOP 10 categories:
As you can see, AI companies still focus on fundamentals: algorithms and/or infrastructure: technology, AI, big data, machine learning, internet, software… Business application are not that popular, with finance topping in TAGs, and health and marketing in categories.
We are still building fundamentals. Business application are next to go…
OK, but what happens when we look at investments instead of number of companies? Let’s have a look at
TOP 10 tech and markets by investments:
and TOP 10 categories by investments:
Almost the same picture: technologies are far ahead of business applications. But look: security and mobile seems to be promising…
Now, let’s see what markets attract most of employees?
In total, more than 1.2 mln employees works in an AI industry. TOP 10 most popular TAGs are (look at the cloud computing and storage):
and TOP 10 categories:
That concludes our first set of questions. Now, let’s have a look at investment patterns.
How do AI investors invest their funds?
Let’s start with basic investment statistics:
For 1481 companies (I eliminated null or “0” entries) the average investment is 18 mln USD, 25% of companies has raised less than 0.78 mln USD, while 50%: less than 3 mln USD. Thus, investments are not that high.
Let’s look at the geography. Which countries has the largest number of AI companies?
US and UK dominate, followed by The Netherlands and Germany from EU, Canada, India, France and Finland. But that picture changes significantly if we look at the amount of $$$ invested:
Look at China! Now it’s on the top of the table.
The plot below is much better for our imagination…
Finally, let’s look at AI ecosystems. When we identify the most populated places, our TOP10 table (by the number of companies) looks like this:
while TOP10 cities by investments:
If you want to drink a coffee with an inspiring AI geek: choose London or San Francisco. Want to find an investor with a big pocket: fly to Beijing :)
Now, let’s check what are the most popular investment rounds?
By the number of companies, first stages of investments dominate:
But when you change the perspective to amount of USD raised, the picture is completely different:
Investors “like” start-up’s, but put their money into successful companies…
Now, finally, it’s time to look at the employment patterns.
What are the employment patterns?
Let’s start with the basic employment statistics.
For 2 350 companies (once again, I eliminated null or “0” entries) although the average number of employees is 514 (there are some huge outliers here…), 25% of companies has only 2 persons onboard, 50%: less than 8 while 75%: 11 or less.
AI companies are quite small in terms of people onboard.
The more detailed, and quite inspiring, statistics of these small, medium-small, medium-large and large companies are displayed in my jupyter notebook. Here I would like to focus on a bit more intriguing value of a data scientist.
Since AI companies are quite small, we can assume that the most of the employees are data scientists. Then, we can find how much one AI employee is worth, on average?
It appears that the average valuation of an employee (defined as total investment divided by number of employees) is at the level of 1.5 mln USD, with record highest value of 45 mln USD. Wow…
If we classify companies by employee valuation and take only those from the 4-th quartile (read: these with the highest valuation of the employee), we can identify TOP10 the most “valuable” technologies and markets:
Once again, no surprise: the most “valuable” competencies are just technological ones, with health care being the top business one.
Conclusions and Next Steps
That concludes our first statistics. What next? I encourage you to try my jupyter notebook (available here) and ask > answer new questions.
Thanks to dummy encoding, the dataset is ready for machine learning modeling. Why not to create the model which will forecast a potential investment by the keywords describing the company? Or location?
Moreover, if you just change the input dataset, you can perform a very similar analysis for a completely different industry. Possibilities seem to be endless.
PS. Once again many thanks to index.co team, especially Jelle, for the access to the database. That’s great, I’m keeping my fingers crossed for Your project!