How One’s Degree Could Influence Their Future

Michael Campbell

Published in

INST414: Data Science Techniques

11 min readMay 18, 2022

By Michael Campbell and Michael Kelley

Work Breakdown

Problem Brainstorming — Michael Kelley

Data Collection — Michael Campbell

Data Cleaning — Michael Kelley

Data Visualization — Michael Campbell

Problem Statement

Have you ever wondered what jobs are obtainable given your major or skills? If you’ve ever asked yourself this question, then chances are that you’ve looked it up before. Depending on your major, you may find a website that has a list of job titles, but they may not be directly related to your colleges since not all colleges teach the same thing. Another option would be to consult your college’s website as they usually have job titles, but they might be outdated or not have all the variations of job titles as some companies change the name of some jobs. We thought it would be useful to college students to have a place where they could see all the variations of job titles available to them based on their major. We also thought that the best way to create a dataset to analyze this would be to collect data from alumni so that it could be specific to university. Given that there are different specializations in each college it would also be useful to be able to find jobs based on skills since not everyone takes the same classes.

Background

We decided to do this project because our major (Information Science) is one of the newer majors in the university and there are a few different pathways that people take once they graduate. The college’s advertised specializations are Data Science, Cybersecurity and Privacy, and Digital Curation, but there are a few other pathways that students can take. Popular pathways not really advertised by college are UI/UX, Software Development/Web Development, and a few others. Given that we are pretty close to graduation, we thought it would be important to find all the jobs accessible to us and to help people starting out find out what pathway they want to work towards.

Data Collection Decision Making

From the start of this project, we knew that we wanted to scrape user information from LinkedIn, but we didn’t know specifically how much we were going to need. In general, we wanted to gather experience, education and skills. Each of these sections has a lot of things that we could possibly take from it. For example, if we look at this experience tab there is the company name, position name, position type, area, and the description. For this project, I thought that it would be sufficient to just take the company name, and job title. The other information could also be useful for filtering the data, but it would take some work to clean the data.

The education section was easy since there wasn’t much to add. We chose to take the degrees a person has completed. However, we could have also taken the completion date for additional filtering. The biggest reason we decided against this is that a lot of people don’t add a completion date.

Finally, for the skills section, we collected the skills, but we could have also used endorsements as a way to filter the data.

Finding Users

Before we could start scraping data, we first had to find out who wanted to be our target users that we were going to scrape from. After weighing the different options, we decided to have our target users be the people recommended in the alumni section of The University of Maryland page. The main reason for this is that they give you up to 1,000 results as long as you keep scrolling down. Other methods only load 15 people and require you click a button to load more. The only downsides to this are that you are more likely to get shown connections with people that you know.

Image of the Alumni Section of The University of Maryland

The actual code for scraping the profiles was actually pretty easy. I used Selenium to load the page and scroll to the bottom of the page. Once it was loaded, I got the HTML and used BeautifulSoup to find all of the links which started with ‘/in/’ because all the profile links look like this ‘/in/firstName-LastName-a5444b195/’.

#Scroll to the bottom of the page
for i in range(1,150):                                      browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")
time.sleep(4)
#Save the HTML
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
#Save all the links in a list
anchors = soup.find_all('a')
links = []
for i in anchors:
if i.get('href').startswith('/in/') == True and i.get('href') not in links:
links.append(i.get('href'))

Scraping profiles

The process for scraping profiles was also pretty easy but there are a few variables that you must take into account. First, LinkedIn uses lazy loading, so you need to scroll the whole page to make sure that everything is actually in your html file. The code from above to scroll to the bottom can be reused with a smaller range of loops. Second, skills and experience sections might require clicking on a link to another page to reveal all of the experience or skills. This can easily be accounted for by checking if such a button exists in that section.

Example of skills that leads to another page

Third, you have to take into account when someone has had multiple jobs at one company since this will give you a different text.

Example of someone who has had multiple jobs at one company

Once you’ve taken those things into consideration, you can start looking at scraping the data. LinkedIn has containers for each section, so it is relatively easy to find all the content. LinkedIn also uses the web framework Ember, so all class names are combinations of the type of element. For example, the class name for the sections is “artdeco-card ember-view break-words pb3 mt2”. If we look up that class name in the devtools of the browser, we can see all the different sections.

Once I knew how I was going to find the data, I iterated through the list of profile links using Selenium to load the whole page and save the HTML. After this, I loaded the file into BeautifulSoup and searched each section using the class name above. The sections contain Bold Words which is what I used to differentiate the sections. If the section did not exist, I would add NA so that I could easily remove it in the data cleaning phase. From there, I would grab the data that I needed and append it to lists which would ultimately be put into a dictionary. The dictionary was then stored in a csv file. An important thing to note is that you might need to play around with the sleep timer to find what you are comfortable with to avoid being flagged as a bot. We had a sleep timer every time we loaded a new page and after each profile scraped. LinkedIn also has a limited number of profiles that you can view within a day. I tried not to go over 250 on any given day.

Data Cleaning

The final product of the data collection phase would look something like this:

Image of the dictionary from the data collection phase

During the data cleaning phase, there were some major decisions that we had to take regarding how we were going to progress with the project. We had a few people who either had no experience, no skills, or no major so we needed to choose whether we wanted to keep them inside the dataset or completely remove them. In the end, we decided to completely remove these people because our overall goal for the project is to help show what skills or majors can help you get certain job titles. Meaning that having all three sections is essential to our project. In the following code block, I am removing rows that are missing one of the sections and then converting the sections from strings to lists since they got converted when the dictionary was stored as a csv.

#Remove the rows that empty rows
df = df[df['skills'] != "['NaN']"]
df = df[df['titles'] != "['NaN']"]
df = df[df['major'] != "['NaN']"]#Remove the rows that empty rows
df = df[df.skills.apply(lambda x: not(x.isnumeric()))]
df = df[df.titles.apply(lambda x: not(x.isnumeric()))]
df = df[df.major.apply(lambda x: not(x.isnumeric()))]#Converts string version of lists into lists
df['skills'] = df.skills.apply(lambda x: literal_eval(str(x)))
df['titles'] = df.titles.apply(lambda x: literal_eval(str(x)))
df['major'] = df.major.apply(lambda x: literal_eval(str(x)))#Resets row counter
df = df.reset_index(drop=True)

After removing these rows, the data was almost clean enough to run our analysis. We needed to make sure that there were no duplicates and decide how we wanted to deal with people who had more than one degree. We opted to just use the bachelor’s degree since that was the degree held by our target audience. The final product after data cleaning looks something like this.

Graph Analysis

The first analysis method that we used was graphs. We figured that graphs would be one of the best ways to visualize this data given the size. If we look at the data, we have two independent variables, which are the skills and major with the dependent variable being the job titles. Given this information, we thought it would be good to have two different graphs. To create the graph structures, we used the library networkX. The code to implement it was pretty simple. We created a list of unique majors, then iterated through that dataset adding edges wherever it found the major. Once that was done, it was exported as a gexf file. The skills graph follows the same method with the only thing being changed being the unique majors list to the unique skills list.

major_graph = nx.Graph()
for major in unique_majors:
   for index, row in df.iterrows():
      if row['major'] == major:
         for cur_title in row['titles']:
            major_graph.add_edge(major,cur_title)
nx.write_gexf(major_graph, "majors.gexf")

Once we have those two files, we can now import them into Gephi to properly visualize them. In Gephi, we start by changing the color and size of nodes based on degree and visualize that with the fruchterman reingold layout. This gives us something that looks like this:

Jobs based on skills Graph

Jobs based on major

This could then easily be converted into a website so that a student could look up their major or skills to find jobs that they have the qualifications for. Here is an example using the skills graph.

Jobs based on skills (mcampb.me)

Job connections to the skill solid works

The advantage of using a graph is that it makes it very easy to see the connections for large datasets. Companies tend to have different names for similar positions so this would definitely help a student learn about new positions. There is also the possibility that someone may find a job they didn’t know was possible with their major.

Similarity Analysis

Next, we focus on completing a similarity analysis of the data. Specifically, we wanted to find people with similar skills to see what jobs they have. This will allow someone to find jobs that might not be the average path for their major, but they would still have the skills to qualify for. To start off, we created a sparse matrix from the skills that looked like this.

From here, we converted this matrix into a NumPy array to be able to calculate the distance. For our test, we decided to use the computer science major in row one so we subset the data frame to only have here and converted it into a NumPy array. I then used the spacial distance method from scipy to calculate the cosine. After running that we found that a mechanical engineering student had 39 skills in common. The jobs which the mechanical engineering major had were:

‘Manufacturing Operations Supervisor’, ‘Undergraduate Research Assistant’, ‘Associate Supply Chain Engineer’, ‘Corporate Relations Board Chair’, ‘Conference Planning Chair’, ‘Mechanical Engineering Intern, Global Engineering Service Center’, ‘How Stuff Works Teaching Assistant’, ‘Program Manager’, ‘Intern for Cellulose Esters and Specialty Plastics Department’, ‘NPI OLM Program Manager’, ‘Mechanical Engineering Intern’

Knowing this information, the computer science major may be able to find a job within this list where they are interested in and have the qualified skills.

Limitations

While this project provides ample data analysis on the matter of academic and vocational backgrounds, there are still some limitations to how useful it can be. First of all, we only focused on a relatively small subset of possible majors. Computer Science, Information Science, and related fields could benefit significantly from the findings of this data, but a more inclusive study would have to be done in order to produce an analysis that could benefit people with other skill sets.

Additionally, while the data that we have collected is extensive, the job titles posted online may not tell the whole story. It is likely not an extremely prevalent issue for the purpose of this analysis, but there are many professions with skills and typical work days that are not obvious from the job title alone. This analysis based on job titles could conceivably have missed out on some of the details of jobs and what skills would actually be required for them. While this wouldn’t stop us from achieving our goal of finding correlations between majors and job titles, it is worth noting that our findings may not provide the full picture regarding what professional skills one would specifically use after obtaining their degree.

On top of that, the analysis of the data does not take into account the extent to which major and job title are correlated, and the extent to which a causal relationship exists. While it is true that one’s academic background can greatly influence the career they are able to get, the reality is that people often have a general idea of what they want to do for a living when deciding their major, so one would already expect a strong correlation between certain majors and certain professional fields. A future analysis could attempt to examine research subjects to determine the extent to which one’s academic background directly influences their career.

Conclusion

Overall, this was a very productive analysis of the relationship between one’s academic background and their eventual job title. We have found a strong relationship between the observed majors and several professional fields of work. With the results of this study, students and academic institutions, particularly in Information Science, Computer Science, and related fields, can potentially have a better understanding of the careers that people from these majors can expect to have with their degree after graduation. The findings of this project have been produced both numerically, and in the form of visual graphs. While there are still some shortcomings with this project remaining, we are hopeful that not only will what is provided already be of great use, but that similar work can be done in the future to expand our findings to aid students and educators from a wide variety of fields of study.

Source Code

mac5617/JobFinder (github.com)

How One’s Degree Could Influence Their Future

Written by Michael Campbell