Data Scientist Job Hunting? A Linked-in Data Analysis

Dante Caraballo
INST414: Data Science Techniques
5 min readFeb 12, 2024

Motivations

Often as a prospective graduate, I worry and think about what employers want from me to secure a job. Although this information is useful to many stakeholders my motivation behind this analysis was mainly job searching. Currently, I am searching for a data science-based position after graduation. My goal was to provide useful insight to others in the same position that I’m in, helping find data scientist roles.

Research Questions and Stakeholders

In my analysis of linked-in job listing data, first I wanted to determine if geographic location influences the availability of data science roles. If so I wanted to reveal where the most popular locations for data science job postings are. Additionally, I wanted to know what skills were the most valued for mid-senior level data scientist roles.

These research questions could be of interest to HR managers, policymakers, career development professionals, or anyone looking for a role in data science like myself. For example, a career development professional might use the insight from my analysis to advise their clients on where to look for jobs and what skills they need to develop before advancing to senior positions.

Decisions Informed

Career Development Professionals: Tailoring advice on which areas of the country to focus on in a job search based on desired skills in the industry.

HR Managers: Influence recruitment strategies, including where to find new candidates and what skills to prioritize for senior positions.

Policy Makers: Comprehending employment trends to advise economic development strategies and educational programs.

Job Seekers: Ensuring their job search and personal development are headed towards areas with high demand and understanding what skills they need to progress in their career.

The Data

To answer this question I needed a large quantity of job listing data isolated to data science roles. I obtained this data from Kaggle, specifically the data originated from Linked-In postings. The datasets that I collected came in the form of three Excel spreadsheets packed full of data. The libraries I used in my analysis included pandas and matplotlib mainly for the visualizations and manipulating the datframes. For my analysis, I only needed the location, skills required, and level of the job postings. For location data when used, I only included data in the United States. This was done to limit the scope of the data as well as ensure my analysis could be complete.

Job_link: This was used to merge the required datasets. It was the key to merging because each table included this column.

Location: To determine geographic distribution and job availability.

Skills: To determine the skills needed for mid-senior roles in data science

Level: To make distinctions between the skill level and seniority of the different listings.

Data Cleaning and Bugs

The first step of the data-cleaning process was merging the three datasets properly on the job_link columns to create a combined view. This process can cause bugs if there are mismatched URLs or if postings are missing from a dataset. I used a left-join to guarantee all postings were included, even if some fields were missing.

The second step was handling these NaN values. I checked for missing values in fields like job_skills and job_level. In this case, I opted to keep all rows to maximize the data points for geographic analysis purposes. Also, I converted job_skills into lists to standardize and make the skills lowercase to be properly counted. In this step, I used a function to also handle redundancies with the “communication” skill. Next, I filtered for “mid senior” positions and used a copy to avoid the setting with a copy warning.

Additionally, handling the cleaning of the location data was the greatest challenge of this analysis. I did not find a solution to cleaning the entirety of the location data. I managed to clean the US data. However, the data that included locations outside the US proved a much greater challenge. Redundancies like differing in the syntax for states were very prominent.

Visualizations

This visualization shows the top 10 skills that are desired in “mid senior” level positions in data science roles across the whole data set. These insights are useful for determining what skills are essential for growth in data science roles. I found it interesting that communication skills were more desired than data analysis in roles like this. This can speak to the importance of being an efficient communicator in workplace environments. Additionally, I thought it was insightful that proficiencies in Python and SQL are the most sought-after skills in the field.

This next visualization shows the distribution of job offers across the United States. This information is useful for determining what areas of the U.S. are hotspots for employment in data science roles. These insights are extremely useful in job hunts and the information could be useful to college students like myself looking for jobs. The hotspots are mostly located in California, Texas, Virginia, New York, Illinois, and New Jersey. After this point, the number of jobs per state begins to decrease.

Limitations

A potential limitation of this data is that it was collected solely from Linked-In. This can limit the analysis because there is no other source of job listings. If other sources were included the data distribution may change especially for the job listing by state distribution. Additionally, the best way to enhance this data and give a more accurate analysis would be to include job listing data from other sites like Indeed.

Lastly, my analysis was limited by time. If more was available I would have been able to include a visualization of the job listing across all of the locations mentioned in the dataset. The location data contained many redundancies in the data. There were at least three to four variations for locations that were not in the United States. This made that data difficult to clean and analyze. Given more time I would have been able to remedy this limitation. Overall my analysis was complete although in future analysis I would attempt to clean the non-US location data.

Click Here to View the Analysis

--

--