EDA for Linkedin Data-USA

4 min readJun 14, 2023

The main reason behind beginning any data analysis project is to extract information from our raw data and convert it into a viable product. Every business or company has one main goal: to increase its customer base, resulting in increased revenue.

The data which is collected is in different forms or files like JSON, CSV, and Excel, on the several cloud services that are now at our disposal, and much more. These data are in different formats and analyzing it will be a tedious task. The part where the data is collected, cleaned, and manipulated into usable content is called Exploratory Data Analysis (EDA).

The dataset which I have chosen is from Kaggle — LinkedIn Data Analyst jobs listings (https://www.kaggle.com/datasets/cedricaubin/linkedin-data-analyst-jobs-listings?select=linkedin-jobs-usa.csv). The data for USA is stored in a CSV file — “linkedin-jobs-usa.csv”.

Step 1:

Download the CSV file into the local system.

Step 2:

Import the libraries and the data.

Step 3:

Understand the components and structure of the data.

df.info() prints the information about the data frame which includes the column names, their data type, non-null, and the memory usage. In this data, all columns except salary have 2845 non-null values.

df.isnull().sum() shows that the salary column has 1916 null values.

df.describe() is helpful in getting the statistical information about the data. It gives us the number of values in each column, unique values, a value that appears the most, and the frequency of its appearance.

Step 4:

Data cleaning is done to manipulate the data types and to extract more information that has been dumped together.

The first column we clean is the salary column, it has multiple null values and different data types are used.

Here we have filled the missing places with “Nan” and split the data into starting salary and highest salary and added it back to our main data frame. Now we will drop the salary column since we have extracted its values to the relevant columns.

Next, we move on to the criteria column. This column has multiple values as a string.