Visualising Indian startup investments using Python — violin plots, heatmaps and sankey diagrams

Sometimes histograms and scatterplots arnt enough. Here I’ll cover some of the more complicated plots that you might need to use — violin plots, heatmaps and sankey diagrams. I’ll mostly use python and I’ve picked up this data from here, its data on the startup investment scene in India. The dataset has Indian startup funding information between January 2015 and August 2017.

First I’ll read the data and do some cleaning.

# importing some stuffimport pandas as pd
import numpy as np
import math
from datetime import datetime
# importing tsuff for plottingimport seaborn as sns
from matplotlib import pyplot as plt
from pylab import rcParams
# reading the datadf = pd.read_csv('startup_funding.csv')

The data has the following columns — ‘Date’, ‘StartupName’, ‘IndustryVertical’, ‘SubVertical’, ‘CityLocation’, ‘InvestorsName’, ‘InvestmentType’, ‘AmountInUSD’. I think the names are pretty self-explanatory. Now I’m not going to put every industry and investor in my plot, thats just going to be a mess. So I’ll just choose the top ones.

# picking the top cities and investorstop_IndustryVertical = df.IndustryVertical.\
top_CityLocation = df.CityLocation.value_counts()[:9].index.tolist()
# cleaning the AmountInUSD col to get convert strings into numbersdef gen_num(x):
if isinstance(x,str):
return float(x.replace(',',''))
return np.nan
df['AmountInUSD'] = df.AmountInUSD.apply(lambda x: gen_num(x))

I’ll start with a violin plot. Violin plots show histograms for different categories together. I’m going to plot the diestribution of investment amount per industry and equity type.

# picking the columns I'll useindus_amt = df[['IndustryVertical','AmountInUSD','InvestmentType']]\
# deciding the size of plotrcParams['figure.figsize'] = 20,5# plotting a violin-plot, only investements below 5 millionsns.violinplot(x="IndustryVertical", y="AmountInUSD",hue = 'InvestmentType',split=True, scale="count",\
data=indus_amt[indus_amt.AmountInUSD < 5000000],\
# plotting a swarm plot on top of the violin plotsns.swarmplot(x="IndustryVertical", y="AmountInUSD",hue = 'InvestmentType',\
data=indus_amt[indus_amt.AmountInUSD < 5000000],
size=2, color=".5", linewidth=0)
Image for post
Image for post

I hope the plot is intuitive. I’ll go a bit into details of it. Do you see the dotted lines in the distributions, they mark the 25th, 50th (median) and 75th percentiles of the distribution. For that Im using inner=”quartile”. For every industry we get two distributions — one for the private equity and one for seed rounds. Thats why Im using these two arguments in the code — hue = ‘InvestmentType’ and split=True. As expected seed round investemts are way lower than private equity ones. Also, private equity investments in finance are way higher than the other industries. The size of the histograms also tell us the size of samples in each distribution, thats why Im using scale=”count”.

Next Ill make a heatmap of industry investement per city.

heat_mp = pd.pivot_table(df[df.IndustryVertical.\
&df.CityLocation.isin(top_CityLocation)], \
values='AmountInUSD', index='IndustryVertical', \
columns='CityLocation', aggfunc='median').fillna(0)
rcParams['figure.figsize'] = 10,10# all investement figurs are in 100ksns.heatmap((heat_mp/100000).astype(int), annot=True, fmt="d" ,cmap="YlGnBu")
Image for post
Image for post
All figures are in 100k

Looks like Delhi is really popular for logistics and Noida si really popular for healthcare funding. Mumbai makes a splash in Finance.

Next Ill make some sankey diagrams. Sankeys are helpful to show flows from one category to another. I’ll first make a sankey for flow of investment between Venture Capitalists and industries. Then Ill make one for flow investement between city and industries. Im using this wonderful website to make the sankey plots. You need to put the data in a specific syntax for it to work. Thats why Im storing the data in text format in a file. And Im uploading the content of the file to get the plot.

# creating a table for the sankey sankey_investor = df[df.InvestorsName.isin(top_InvestorsName)\
# saving file in text formatsankey_investor.apply(lambda x: (str(x[0]) + str('[') + \
str(int(x[2]/100000)) + str(']') + str(x[1]))\
if int(x[2]/100000)>9 else str("''"),\
Image for post
Image for post
# creating a table for the sankeysankey_city.apply(lambda x: (str(x[0]) + str('[') + \
str(int(x[2]/100000)) + str(']') + str(x[1]))\
if int(x[2]/100000)>9 else str("''"),\
# saving file in text formatsankey_city = df[df.CityLocation.isin(top_CityLocation)&\
Image for post
Image for post

And thats it. Hopefully this helped you, if it did please leave a like and a comment.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store