Playing with Free Code Camp Data Pt 2

Published in

The Data Logs

6 min readSep 24, 2018

Getting deep into the data

Heyo everyone! Happy that you’re back. Today we’re finishing this project. We’re analyzing the people who are interested in the data scientist and data engineer positions from the 2016 Free Code Camp survey to build a customer persona. We started the EDA process by finding out the age range of our participants. Let’s figure out how many already went to tech bootcamps.

table(DSInt_df$AttendedBootcamp)0     1
622   13

Don’t be alarmed by this. In R, 1 is considered to be True and 0 is False. I originally went through a super sleuth method of figuring this out before I actually google it. You live and you learn. For convenience, this is something that’s easily fixed:

DSInt_df$AttendedBootcamp <- factor(DSInt_df$AttendedBootcamp, levels = c(1, 0), labels = c('Yes','No'))table(FCC_df$JobRelocateYesNo)
Yes  No 
 13 622DSInt_df %>%
    ggplot(aes(AttendedBootcamp)) +
    geom_bar(fill = "green")

It is clear that a large amount of our prospects have never went to a bootcamp. This eliminates further questions on this topic such as if they graduated, if they got a job afterwords, how much they make now, etc. Sometimes these things happen. You just have to make the most of what you have. And with that we move on to more questions.

Are you a software developer?

It is very common for someone in a technical role to transition to another technical role with similar skill sets. Considering that a data engineer has a lot of skills a developer would have, it isn’t a surprise that software development was usually they position they held previously. Let’s take a look.

DSInt_df$IsSoftwareDev <- factor(DSInt_df$IsSoftwareDev, levels = c(1, 0), labels = c('Yes','No'))table(DSInt_df$IsSoftwareDev)
Yes  No
  0   0

Huh, that’s odd. While not impossible I highly doubt that there wasn’t at least 1 software developer within the group of prospects. Let’s see if we can find a cause of this:

summary(DSInt_df$IsSoftwareDev)Yes   No NA's 
   0    0  646

646 NAs. That would mean our prospects left this question black for one reason or another. Disappointing but we continue to move forward.

While I’ve already answered my initial 4 questions, I have a few more things I’m interested in looking into. For example:

How big of a city do our prospects live in?

DSInt_df %>%
ggplot(aes(CityPopulation)) +
  geom_bar(fill = "magenta")

It goes without saying that most prospects live within large cities. The biggest majority with populations bigger than a million. 229 to be exact. For a marketing push this is helpful in deciding where to focus. But I feel like there’s more we can get from this. Of the folks living within these areas, what is their employment status?

DSInt_df %>%
  ggplot(aes(CityPopulation, fill = EmploymentStatus)) +
  geom_bar()

This is the same plot as the previous one except each bar is now filled with colors that reflect their employment status. This is a neat way for you to see and compare where everybody stands. You can see that most people here are currently working but probably looking for work.

DSInt_df %>%
  ggplot(aes(CityPopulation, fill = EmploymentStatus)) +
  geom_bar(position = "dodge"

By changing the positioning of the bar chart by employment status you can view each employment status side by side by city population group. This way you can see how much the individuals who are employed are a bigger group in each city population regardless of how big. This is followed by those unemployed but looking.

All in all, there are 267 people looking for work and 137 people or are jobless but looking. It would have been really interesting to see the field in which the individuals were employed on top of that but alas, over 400 observations where missing. Seeing a running theme here? Now, there are a multitude of actions that can be taken for handling missing data. In every case the facts remains that it changes the story and/or the truth of data so you have to be careful and be aware of whatever trade off you might be making. For this project, I didn’t want to change the story that wasn’t there and let you all know what is and isn’t available. My github will show more of my code where I tried to look into things and discovered decent chunks of missing data.

We’re in the final stretch now with a few questions left.

What’s the education level of our demographic?

DSInt_df %>%
  ggplot(aes(SchoolDegree)) +
  geom_bar(fill = "Turquoise") +
  coord_flip()

So at the time the majority of all prospects had a bachelors degree. It would be interesting to see what this looks like now and have it matched up against the level of education different companies ask for. Maybe a future project?

Now, how much are our prospects expecting to make with data?

ggplot(DSInt_df, aes(x = ExpectedEarning)) +
  geom_histogram(fill = "dark green")

The data skews to the right. This is quite interesting. Considering that a quick search on Glassdoor shows the base pay for data scientists this year is $120,931. The average is half that according to this data. The highest counts appear to hover around 60k. I’m curious as to how much our prospects have researched these salaries or if they all just guessed something. The minimum is actually $6000 and the max $200,000(Which is possible if you work in an industry like banking.)

Considering that there is a relationship between educational level and wages, it’d be interesting to look at what people expect to make based on their degree:

DSInt_df %>%
  mutate(SchoolDegree = fct_infreq(SchoolDegree)) %>%
  ggplot(aes(SchoolDegree, ExpectedEarning)) +
  geom_boxplot() +
    coord_flip()

The is a great way to combine the data we’ve seen previously to give a bit more context. For example, the minimum for every education group is less than $50,000. The maximum for each group with the exception of those who have studies trades are over $100,000. Bachelor degree holders have the highest max at about $125,000. Phds expect a higher median pay than all groups excluding the small amount of non-GED holders but have a smaller maximum pay than the groups excluding the same non-GED groups and associate’s degree holders. It’s interesting to see that the few people with no high school diploma/GED actually expect to make more than people who have trade skills.

So with that we’ll create 2 basic marketing personas based on our data. We’ll call them Frank and Jane.

Frank is a 24 year old man who recently graduated from his university with a bachelors degree in computer science and is currently working. Frank is hoping to transition into a data science role in the future and hopes to make $65,000 in a starting salary.

Jane is a 26 year old woman who recently graduated with a Phd in statistics and is currently employed. She however would like to explore her options and is curious about a machine learning role. She hoping for a starting salary of about $80,000.

Thanks for sticking around for this analysis. The challenge of doing what you can despite missing data while tedious can be fun. I may compare this data to the 2017 survey to see if the story or personas have changed any. If you’re interested, David Venturi did his own analysis on this subset of data as well as two others articles based on other factors FreeCodeCamp data. I suggest you take a look!

Playing with Free Code Camp Data Pt 2

Written by Kerry Benjamin