Data Analysis Project in Python

RP
Analytics Vidhya
Published in
5 min readDec 26, 2019
Photo by Markus Spiske on Unsplash

A company ABC wants to invest in startups and other companies.The CEO of the company wants to understand the global trends in investments so that he can make the investment effectively.

Business Understanding

ABC has two major constraints for investments:

  • It wants to invest between 5 to 15 million US dollars per round of investment
  • It only wants to invest in English-speaking countries (Countries with English as one of their one of the official languages.).

The objective is to find the best sectors, countries, and suitable investment type for making investments. Here best means where the number of investors is greater.

Data Understanding

The data is real world data taken from crunch base.com and contains three files:

Mapping file : Contains main eight sectors and their sub sectors.

Companies file : Contains different companies from all over the world with their basic information such as sector,country of origin etc.

Rounds file : Contains all the sub sectors along with companies details.

Methodology

First we have to load the data into our IPython notebook. Before that we need our modules.

We are going to use :

  • Pandas
  • Numpy
  • Matplotlib
  • Sea born

This is data analysis and cleaning project so we are not going to use scikit-learn.

First we have to clean the data therefore we will look into the data and as much information we can.

We used companies.head() to see the first rows of the companies data file.

Here we have 10 columns.

As you can see the permalinks contains different case letters, so we will need to convert them into one case. Therefore we converted them into smaller case.

Similarly, we peeked into rounds data set.

We have company permalinks in rounds data too. So we can first check whether they all are the same as in companies dataset.

Here we use isin function to check which permalinks are not in the intersection of both the dataset.

As you can see there are some weird characters in the company-permalink.

This is because of the python’s encoding. As the companies dataset is .txt file so it may have done some encoding differently.

To fix this we have to encode the file to utf-8 and then decode it to ascii.

This way we can fix this problem. But you want to see the proper explanation go to this stack overflow link.

https://stackoverflow.com/questions/45871731/removing-special-characters-in-a-pandas-dataframe.

We will use the same technique on the companies dataset.

I think now the issue with the encoding is resolved.Now we can move on with our data cleaning.

It’s time to check for the missing values and it seems that we have a lot of them.

This is for companies dataset. Let’s check for the rounds dataset also.

Since there is no missing values in the permalink or comapany_permalink columns let’s merge two datasets into one. This way it will be easier to clean the data.

Now we will drop the company_permalink because the permalink and company_permalink are same.

We will again check for the null values in our main data set(let’s call our new data set as master) using isnull() and sum() function.

Looking at the data funding_round_code, hompage_url,founded_at,state_code,region and city are needed according to our business objective so we will drop these columns.But note the raised_amount_usd, country_code and category_list are useful so we need to clean them properly.

After all the cleaning we would save our master dataset into other csv file so we can directly use that file for our analysis.

Now It’s time for analysis

Let’s take another IPython notebook for our analysis this way we can keep our work clean and sorted.

First we will import our required libraries that we mentioned in the beginning of the post, along with our master csv file.

We only need four main funding types so we will use only that data which contains these funding types.

These four main funding types are :

  • Venture
  • Angel
  • Seed
  • Private_equity

We need to compute the amount that we can spend for each funding type. We can either choose the mean or the median. Let’s have a look at the raised_amount_usd column to get a sense.

Let’s also look for raised amount and funding type.

The median investment for type ‘private_equity’ is 20M which is beyond ABC’s investment range whereas the median of type ‘venture’ is 5M which is in the range.

Now let’s compare total amount across the countries.

Now these are the top nine countries in terms of investment amount in venture type.

Among these nine countries USA,IND and GBR are top three english speaking countries.

Now it’s time for our mapping data file to come into play. It contains different sub-sectors with their main sectors.

We will merge our mapping and master file into one.

And applying the above tactics we can see that the USA will the country to invest in with Others as the sector to go for.

Please follow for more posts

And here is the link to the project.

https://github.com/Rushil2311/Projects/tree/master/Data_analysis

--

--