Exploring the Indian Startup Ecosystem: A Data Driven Analysis of Funding Trends and Industry/Sector

Success Makafui Kwawu
11 min readMar 24, 2024

--

Introduction

In this article, I will be sharing my thought processes and insights gained following an indepth analysis of the funding of the Indian Startup Ecosystem using the Cross-Industry Process for Data Mining (CRISP-DM) framework.

Project Structure

The project is structured around the Cross-Industry Process for Data Mining (CRISP-DM) framework however, the modelling and evaluation phases are excluded since this project does not involve building ML models. This project is essentilly as an Exploratoray Data Analysis (EDA).

CRISP-DM Diagram. Inspired by Wikipedia

Business Understanding

The Indian Start-up ecosystem — ranked as the third largest in the world is a network of entrepreneurs, investors and other stakeholders working to build and grow technology-driven startups in the country.

India has seen an astronomical increase in startups and funding with over 16,000 new companies added in 2020 resulting in an unprecedented growth and funding.

Funding is generally provided by investment firms, angel investors, venture capitalists and private equity firms. In the face of market uncertainties, the Indian start-up ecosystem received $8.4 billion in 2023 indicating how resilient it is.

The objective of the project is to investigate the Indian start-up ecosystem by analyzing funding received by start-ups from 2018 to 2021 and propose the best course of action. The information contained therein will be useful to prospective investors, entrepreneurs and other stakeholders desirous of venturing into the Indian Start-up space.

The following are the business questions to be answered at the end of the project.

  • Which particular sector received the most funding over the time frame?
  • What are the distribution of startups in stages and the amount allocated each stage?
  • What is the distribution of fundings based on locations?
  • Which year had the most investors?
  • Who are the top 3 financiers in the Indian startups landscape?
  • What was the impact of COVID-19 pandemic on startup funding in 2020?

As part of the questions, a hypothesis test will be conducted on whether the sector a startup operates in has any impact on the amount of funding it receives.

Data Understanding

Data collection

I imported all the neccesary packages needed to effectively execute this project after creating a virtual environment to handle the project.

A virtual environment is an essential tool for managing dependencies and ensuring consistent execution contexts for Python applications

#import all necessary libraries
# data manipulation
import pandas as pd
import numpy as np
import missingno
from sklearn.impute import SimpleImputer
# data visualization libraries
import matplotlib.pyplot as plt
from plotly import express as px
import seaborn as sns
# statistical libraries
from scipy import stats
import statistics as stat
# database manipulation libraries
import pyodbc
from dotenv import dotenv_values
# hide warnings
import warnings
warnings.filterwarnings("ignore")
  • Pandas and Numpy: for data cleaning and manipulation
  • Matplotlib and Seaborn : for visualizations
  • Pyodbc: to access the database
  • Warnings: to deal with the warnings

Data for the project was collected from three sources. The 2020 and 2021 datasets were stored on a database for secutity reasons and that of 2019 was found in OneDrive. The third part of the data (2018) was hosted on a GitHut Repository.

Data description

After successfully loading the data, the following python funtions were excuted to gain a high level view of the datasets

.head()
.describe()
.info()
.shape

The following obsevations were made about the datasets

2018 Dataset

  • The dataset has 526 rows and 6 columns
  • It has one (1) duplicate
  • The amount column contains dollars, Rupees and some other non numeric characters
  • The location columns appears to contain state, regional and city information
  • There is a google document link in the Round/Series column

2019 Dataset

  • The dataset has 89 rows and 9 columns
  • It has no duplicates but contains null values
  • The amount columns is a float datatype and contains other non numeric characters
  • Columns such as Stage, HeadQuarter and Founded have a lot of NaN values

2020 Dataset

  • The dataset contains 1055 rows and 10 columns
  • It has three (3) duplicates
  • The amount columns is a float datatype and contains other non numeric characters
  • The HeadQuater columns has some locations outside of India

2021 Dataset

  • The dataset contains 1209 rows and 9 columns
  • The dataset contains 19 duplicates
  • The amount columns is a float datatype and contains other non numeric characters
  • There are instances where values have been recorded under the wrong columns

Data quality

All dataset has Null values with the exception of 2018. The dataset is riddled with with misplaced values, duplicates and missing values.

The general observation is that the dataset is messy and dirty and will require a thourough cleaning.

Data Preparation

In other to clean the dataset thoroughly some assumptions were made

  • All the amount values without any currencies are assumed to be in USD
  • Exchange Rate Between USD and Indian Rupee as of 2018

1USD = 68.4933INR

source:

The data was cleaned year by year focusing on a column at a time

Amount column

  • All commas were removed
  • Values in Rupees(₹) and Dollars ($) were extracted
  • Amount in Rupees(₹) were converted to dollars after removing the currency sign
  • All non numeric values were removed by running the code below
  # convert the clean column to numeric
amount_column = pd.to_numeric(amount_column,errors="coerce")
  • Where possible the median was used to fill missing values since the amount variable has a skewed distribution.

Stage column

To enable deep-level analysis, the data was processed by categorizing the Stage column into high-level funding stages based on the stages listed on the Indian Startup Ecosystem website. A dictionary was created to facilitate the groupings. A seventh stage named Undisclosed was added by the author to take care of missing values in the column. The google document link mentioned during the data understanding phase was repalce by Undisclosed.

HeadQuarter column

The HeadQuarter column was also cleaned by a creating a location dictionary to map the data to the appropriate locations. Similar city names were mapped together — Bangalore and Bangaluru, Delhi and New Delhi.

Cities outside of India were replaced with ‘Overseas’. This is very important as most investors view well established companies favourably when making investment decisons.

The regions and states inforamtion were dropped.

Sector column

The next major task was cleaning the sector column which provided information about the sector of the economy a startup operates — serves.

By now you guessed it right? Yep, another dictionary.

From research, there are 15 sectors in India where all these values in the sector column of the dataset can be generalised into. They are therefore mapped onto the 15 sectors which include:

  • IT & Technology
  • Financial Services
  • Healthcare & Life Sciences
  • Consumer Goods
  • Business Services
  • Media & Entertainment
  • Education
  • Manufacturing
  • Retail
  • Transportation & Logistics
  • Sports
  • Agriculture
  • Real Estate
  • Travel & Tourism
  • Energy
  • Others

Other columns

  • A new column named ‘funded_year’ was inserted into all the four datasets signifying when the fund was received
  • The 2018 columns were renamed to agree with the other years and missing columns were added as well.
#renaming columns
df_2018= df_2018.rename(columns=lambda x: x.lower().replace(' ', '_'))
# add founded column to the dataset
df_2018["funded_year"] = 2018
# add investors column
df_2018["investor"] = "Undisclosed"
# add founded column
df_2018["founded"] = 2018
  • I made an assumption that all the values within the founded column with nulls should be replaced by the corresponding year the company received funding (funded_year)
  • The founders column was dropped as it not that important to my analysis

All the datasets were concantenated into a final dataframe

#concatenating the datasets into a single dataframe
df = pd.concat([df_2018,df_2019,df_2020,df_2021],ignore_index=True)

Next, I checked for missing values in the final dataset and found none

Upon further checks, a duplicate and two null values were discovered and promptly dropped.

Checking for a basic information about the the dataset reveal the following

Next, I perform a descriptive statistics on the final dataset

Descriptive statistics

Hypothesis testing

The test was performed using One Way ANOVA (Analysis of Variance)

Null Hypothesis(H_o) — The sector of a startup does not influence the amount of funding it receives
Alternate Hypothesis(H_a) — The sector of a startup influences the amount of funding it receives

Significance-value(α-value) = 0.05%

One Way ANOVA (Analysis of Variance)

Since the p-value = 0.979 after the test is greater than the significance value of 0.05, we fail to reject the Null Hypothesis with an F-statistics of 0.402. In effect, the sector of a startup does not influence the amount of funding it receives.

Business questions

It is time to attempt the business questions we set out at the beginning of the project. I know this journey is taking longer than you thought but stick with me, we are almost there.

Which particular sector received the most funding over the time frame?

From the chart above it is evident the financial sector received the most funding over the years, amassing over 1.6 billion dollars, followed by the Retail sector. Real Estate, Agriculture, Sports and Energy sectors receive little or no funding.

This is not surprising as the financial services sector is highly regulated with entry barriers such as a stipulated Intial Capital Requirement aimed at protecting the general public.

Also, the top five (5) sectors namely; Financial Services, Retail, IT & Technology, Education and Consumer Goods were the the very sectors that were heavily impacted by COVID — 19, and as such attracted a lot of investment during and after COVID

What are the distribution of start ups in stages and the amount allocated each stage?

From the analysis, it can be observed that the early stages of a business does not receive much funding in the categories ideation, validation and early traction. It is normal to observe as a greater portion of the capital goes out from the savings of the investor or founder and the business is yet to validate its chance of succeeding.

A lot of trials has to be done in order to gain the attention of investors willing to invest . It is also interesting to observe that the influence of individual investors, government loan schemes and banks are infinitesimal as compared to the other stages.

It seems from the visualisation results that there is more amount of funding allocated to scaling stages as well as exit options. This could imply that a great majority of startup survive the first few stages, break even and start to rake in a significant amount of profits even to the point where the company is comfortable selling shares.

A greater portion of funding seems to come from external sources. There is an increase in private equity and investment firms providing funds for fast growing late-stage startups which could translate to startups in the scaling stages and exit options once more.

What is the distribution of fundings based on locations?

Bangalore is widely regarded as the Sillicon valley of Indian and the analysis of the startup ecosystem confirmed this as Bangalore has the highest concentration of startups.

However, the focus of this questions is to find ascertain the location that received the largest funding during the period under review

From my observations, Mumbai received the highest funding with a huge gap from the next top location which is Bangalore.

This is probably due to the massive population of about 20 million, translating into more clients and higher chance of recouping their investments.

Which year had the most investors?

From observing the line plot, 2021 received the most investors with about 1200 in number. It seems funding for start-ups in India started to increase after 2019.

Who are the top 3 financiers in the Indian start ups landscape?

This question sought to find out the major financiers of the Indian start up ecosystem.

The three top investors in the Indian start up ecosystem by order frequency are:

  • Inflection Point Ventures with a presence of 41% in most start-up businesses
  • Venture Catalysts with a presence of 38.6% in most start-up businesses
  • Mumbai Angels Network with a presence of 20.5% in most start-up businesses

These are the financiers worth considering when one is venturing into the ecosystem.

What was the impact of COVID-19 pandemic on startup funding in 2020?

Although COVID — 19 brought about a near total shutdown of the world’s economy, funding for the Indian startup ecosystem — a tech heavy ecosystem, rose to record highs in 2020 with retail (E-commerce) sector taking the chunk of the investment. As online businesses do not require physical infrasture, there was a boom of online businesses.

The funding trend continue its upwards trajectories in 2021 with Financial Services as dorminant sector in terms of funding.

This might sound horrible given the fact that many lives were lost to COVID-19, its seems COVID was a good omen for the ecosystem.

Recommendations

Anyone desirous of venturing into the Indian Startup space during take note of the following

Deployment

The final phase of the project is deployment — how do stakeholders access the results.

To facilitate a collaborative learning experience, I’ve made the entire project available on my GitHub repository, where you can find the comprehensive code, detailed documentation, and insightful analyses. Additionally, I’ve created an interactive Power BI dashboard presenting captivating visualizations and key trends. Do not hesitate to explore the GitHub repository and interact with the Power BI dashboard. Let’s embark on this insightful journey together, empowering entrepreneurs and enthusiasts alike to harness data-driven strategies for a thriving Indian startup ecosystem!

Power BI Link: Link to Published Power BI Dashboard

GitHub: Link to Project Here

Appreciation

I highly recommend Azubi Africa for their comprehensive and effective programs. Read More articles about Azubi Africa here and take a few minutes to visit this link to learn more about Azubi Africa life-changing programs.

Tags

Azubi Data Science

--

--