Playing with Free Code Camp Data Pt 1
Analyzing data from 2 awesome groups
Hello everyone. I hope you’re doing well. Today we are going to explore data from Free Code Camp. Specifically, the data they released from their survey they did a few months ago with CodeNewbie. They were nice enough to release it to the public for us to visualize and analyze. As a big fan, I was super excited to start diving in!
For those of you who don’t know, Free Code Camp is an online program where you can start learning HTML and CSS all the way through making web apps with various JavaScript libraries. Towards the end of the program, you’ll use your new skills to help a non-profit organization. How cool is that?
CodeNewbie is quite simply one of the most supportive online communities on the web for programmers of all skill levels, in particular beginners.They have an amazing weekly podcast on a variety of topics as well as a really fun Twitter chat every Wednesday at 9pm eastern. Definitely check them out. I love them (in a totally platonic way).
This data set also doubles as Activity 3 for the Data Science Learning Club. The task is to ask business questions that the data will be able to answer. So I will pose a few of them throughout this post and see if there are suggestions I can provide based on my marketing and business background. Perhaps you may even have questions of you’re own?
Ok, let’s get things started. If you want to follow along, here’s a link to where the data is hosted on Kaggle. This is in the form of a CSV file so we don’t need a package for this. We can just use the read.csv function.
FCC_df <- read.csv("~/Data/2016-FCC-New-Coders-Survey-Data.csv", header = T)
We now have a new data frame. In this function we tell read.csv where the data is found, in this case, my data folder on my computer, as well as the name of the file. The header portion of the function tells if we’re going to use the column names found with the file or generic ones. These are either True(T) or False(F) respectively.
Let’s have a look see at what we got:
dim(FCC)
15620 113
There are over 15,000 responses across 113 observations or variables. I used dim instead of str because of the amount of said observations.
I’ve been thinking about how I can make this analysis relevant to me and still fulfill the activity 3 challenge. After looking over the responses of the survey for jobs that readers are interested in, I noticed that there was some for data scientists and data engineers. Eureka! I want to know about these people. So we’re subsetting this data frame and making this project about them. We’ll have a little help with from some packages
library(tidyverse)
library(stringr)
For budding R users, it is always a good habit to load the packages you’ll need to use towards the beginning of a project. You’ll see it when this is up on GitHub. Tidyverse is a series of packages combined into 1 that makes EDA convenient. Instead of doing something like loading ggplot2 on one line and dpylr on another, tidyverse will load 6 packages at once. Stringr is a package used to help with operations involving strings. R is a but fussy when it comes to white space so a package like this will save you a headache. So first, we’re going to tackle the white space problem.
FCC_df$JobRoleInterest<- str_trim(FCC_df$JobRoleInterest)
Str_trim is a function in stringr that removes spaces before words. Now to subset the FCC to get all our hopeful data scientists and engineers:
DSInt_df <- subset(FCC_df, JobRoleInterest == "Data Scientist / Data Engineer")
Subset is a function already in R that works like dplyr filter function. Let’s see what we go now:
dim(DSInt_df)
646 118
Now we know that 646 people have an interest in becoming a data scientist or data engineer. That’s about 4% of the original respondents.
Questions and Visualizations
Now we’re on to the EDA portion of the project. There are a few things I want to know:
- How many respondents already went to tech bootcamps?
- How many respondents currently work as software engineers?
- Where are most of the prospects located?
- How old are they?
These questions will no doubt lead to other questions and more things to explore. What fun!
Heads up if you viewed this dataset, you’ll see a few responses titled ‘NA’. This basically means that data is missing or unavailable. In the case with this survey, it simply means that some people didn’t answer particular questions. In some cases, NA won’t effect your analysis too much or at all. However, there are functions and options you can use to take care of them.
Let’s start with age
ggplot(data = DataP, mapping = aes(x = Age)) + geom_histogram(binwidth = 1.0, fill = "dark blue")
This histogram skews heavily to the right. Which means most of this group is pretty young with the highest concentration in the mid-twenties. In this plotting function as well as a few others, ggplot2 was nice enough to not include NAs in the graph so it doesn’t interfere. Let’s dig a little deeper about the age of our respondents
summary(FCC_df$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
14.00 22.00 26.00 27.72 31.25 65.00 74
The youngest person is 14 and the oldest is 65. You go y’all! The median age is 27. And there are a total of 74 people who didn’t tell us their age. Boo. If we wanted to get a count for the exact numbers of each age, we can use the table function:
table(FCC_df$Age)14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
1 2 9 9 18 32 28 24 41 40 37 42 27 25 29 23 24 18 18 12 11 15 8 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 54 55 56 58 6510 6 4 7 5 7 4 3 8 6 2 1 2 4 2 1 1 2 2 1 1
I know this may look confusing but trust me, table is effective. The numbers on the top that go from 14–65 are the ages a respondent inputted. The numbers below that are how many people gave that age. For example, while 1 person is 65, there are 42 people who are 25 years old, 41 that are 22 years old and so on. Most of the people in this survey are 25 from the table. And if you look closely, you can see this on the histogram as well. We’re done with the age. Another way to express this visually is to use a stem and leaf plot.
That’s enough for now. The EDA process will continue in part 2 which will be out next week. We’ll also build a mini customer persona for a would-be data science bootcamp. If you have any questions that this data may answer respond to this post and I may include it.
If you liked this post and learned something new, hit the recommend button ^__^