EDA for Linkedin Data-USA

Samprithi Ravisanker
4 min readJun 14, 2023

--

The main reason behind beginning any data analysis project is to extract information from our raw data and convert it into a viable product. Every business or company has one main goal: to increase its customer base, resulting in increased revenue.

The data which is collected is in different forms or files like JSON, CSV, and Excel, on the several cloud services that are now at our disposal, and much more. These data are in different formats and analyzing it will be a tedious task. The part where the data is collected, cleaned, and manipulated into usable content is called Exploratory Data Analysis (EDA).

The dataset which I have chosen is from Kaggle — LinkedIn Data Analyst jobs listings (https://www.kaggle.com/datasets/cedricaubin/linkedin-data-analyst-jobs-listings?select=linkedin-jobs-usa.csv). The data for USA is stored in a CSV file — “linkedin-jobs-usa.csv”.

Step 1:

Download the CSV file into the local system.

Step 2:

Import the libraries and the data.

Step 3:

Understand the components and structure of the data.

df.info() prints the information about the data frame which includes the column names, their data type, non-null, and the memory usage. In this data, all columns except salary have 2845 non-null values.

df.isnull().sum() shows that the salary column has 1916 null values.

df.describe() is helpful in getting the statistical information about the data. It gives us the number of values in each column, unique values, a value that appears the most, and the frequency of its appearance.

Step 4:

Data cleaning is done to manipulate the data types and to extract more information that has been dumped together.

The first column we clean is the salary column, it has multiple null values and different data types are used.

Here we have filled the missing places with “Nan” and split the data into starting salary and highest salary and added it back to our main data frame. Now we will drop the salary column since we have extracted its values to the relevant columns.

Next, we move on to the criteria column. This column has multiple values as a string.

We split this into four different columns: Seniority level, Employment Type, Job Function, and Industry.

Convert the date_added column from str format to date time.

Clean the location column to get the city names.

Step 5:

The data set is now cleaned for further analysis.

This data can be visualized in Python or any visualization tool.

Here is a simple visualization of the cleaned data in Tableau. You can check it out at Tableau Dashboard.

The jupyter notebook accompanied by this post is at Linkedin_data_EDA.

--

--