The Computer Science Experience at Harvard: A Data Visualization Project

The concentration (“major”) of computer science is rapidly increasing in popularity at Harvard. For a data visualization project, I set out — along with two other classmates, Alex Abrahams ’18 and Tommy O’Shea ’17— to visualize data about the computer science experience at Harvard, particularly in this unique time of rapid development and perhaps some growing pains.

The final data visualization project can be found here. More about our process and results can be found below.

Mission

Since 2008, the number of computer science concentrators at Harvard has quadrupled, and even more are earning secondaries or simply exploring computer science through a class. Although the size of the computer science population has escalated rapidly, we’ve noticed several interesting patterns, first through personal experiences and now quantified with data: the number of faculty has grown at a much slower rate, potentially creating larger class sizes and less accessible professors; the representation of women and minorities is low and has remained low despite the growing interest in computer science; and varying levels of prior experience can affect, perhaps falsely due to the large number of computer science students with a wider range of experience than ever, a student’s confidence in computer science.

Our motivation for this project is to break down and visualize this data, as well as the general computer science experience at Harvard. Using survey data of over 900 respondents, as well as a data set from the Harvard School of Engineering and Applied Sciences, we sought to examine the computer science concentration at Harvard by comparing responses from concentrators and non-concentrators of various different demographics.

Goals

The overall question we set out to answer is “What is the computer science experience like at Harvard, and how do students feel about computer science at Harvard?”

But to answer this question most thoroughly, we broke it up into multiple smaller questions we hoped to answer, many of which directly correlate to our visualizations:

  • What differing experiences have computer science students had at Harvard depending on their gender, race, and past programming experience?
  • How has the computer science department changed from 2008 to 2014 (number of concentrators)?
  • How accessible are TFs (teaching fellows, or teaching assistants) and professors in the computer science department?
  • What outside perspectives do students, based on their past experience, gender, race, etc., have about computer science concentrators?

Data Set

We have two primary data sets we’re using to visualize and give context to our project. Both are quite comprehensive, and both are unique to computer science at Harvard.

Our primary data was taken from a Women in Computer Science survey of 900 Harvard College students. The survey asks numerous questions about the respondents themselves and their experience with the computer science department at Harvard. The dataset includes data on respondents’ gender, race, and computer science background, as well as data on the respondents’ opinions and perceptions of computer science at Harvard.

Each respondent was required to provide gender, race, and whether they are concentrating in computer science or not. This allowed for us to filter the rest of the data in unique ways to find potential correlations between categories.

Survey questions and data that particularly stood out to us, and that were particularly helpful in finding results:

  • “How good are you at programming compared to others in the computer science classes you’ve taken?” Respondents selected an answer of 1 through 5 (from much worse to much better). We classified this as “programming confidence” since there was no evidence to back up how good respondents actually are at programming. As such, we wanted to explore if there might be differences in programming confidence between genders.
  • “What three words would you use to describe a computer science concentrator?” On first glance, this could seem like a petty question. However, when viewing our data set for the first time, many of the answers were fairly enlightening in regards to how people perceive computer science students. We classified this as “perceptions,” since it is all about someone’s personal stereotypes and perceptions for computer science students. Because these varied, we wanted to explore how perceptions of computer science students might differ based on respondents’ gender and race.

This data set is from the School of Engineering and Applied Sciences about the number of computer science concentrators each year from 2008 to 2014. Other than the total number of concentrators, it is also divided by the number of women and minority. All of this is also able to be compared to the total college enrollment number, women number, and minority number.

Data Wrangling

Our two data sets were given to us in a format that was easily translated to a csv format. The pie chart did not require much effort on the data-cleaning side, since all we needed to do was count some totals using simple functions in excel. We were able to use a COUNTIF function in Excel to find the total numbers of all demographics for the pie charts. We then created object arrays for each pie chart manually in the respective javascript folders. Additionally, we did not need to clean any of the data for the “Growth of CS Concentration” and “Coding Confidence Between Genders” sections, as we simply loaded it in the format it was given to us.

The bulk of our data cleaning/wrangling efforts came as a result of the word clouds. The original data was difficult to manage, as it featured responses with varying methods of punctuation and capitalization, as well as inconsistencies regarding the number of words. With the help of our TF, Niamh, we were able to clean the data with Python so that each response had uniform punctuation, capitalization, and number of words. We then used Python once again to convert the cleaned data from a csv to a json object. This made it easier to wrangle when it came to implementing the word-cloud using d3.json. Once the json file was loaded, we wrangled the data by implementing three different filters (one each for gender, race, and prior CS background) and a function that concatenates the three-word arrays onto each other to create one large, semicolon-separated array.

Designs + Implementation

The first overall design structure for our website:

When we stepped back away from our individual visualizations and looked at the bigger picture, we wanted our whole website to have a nice flow that resulted in an effective method of digital storytelling. At first, we split up our sections into perceptions, gender, and race. However, we realized this didn’t make sense because gender and race were factors often included in the same visualizations. Additionally, we had other topics we wanted to explore, such as the growth of the computer science department. This led to the near-final sketches above. (The structure of our website can also be seen under the “Feature List” section of our process, in written form, which we based these sketches off of.)

We ultimately changed the final design a little bit, removing the bubble visualization because we didn’t think it would add much to the overall data presentation and could better spend our time elsewhere, and we instead added more charts to the department growth section.

We added interactivity to our website, visible through the interactivity storyboard:

We implemented these visualizations using the JavaScript D3 library (reminder that the full data visualization can be found here).

Growth of computer science at Harvard: the number of students has grown four-fold since 2008, with the faculty growing at a slower rate and thus the student:teacher ratio increasing.
Men tend to be more confident in their computer science skills at Harvard (and beyond, but that’s old news).
Generally, access to the department is fairly decent. Students feel particularly comfortable approaching their TF (teaching fellows, or teaching assistants) with problems and for help. It seems they feel a bit less comfortable approaching professors.
The number of student concentrators in computer science is growing, and with that, the number of women and minority students are slowly but surely increasing as well. On the faculty end, however, that doesn’t seem to be increasing.
Word cloud data visualization, perhaps our most challenging one to code.

The word cloud comparison above, comparing stereotypes/judgements of computer science “people” at Harvard (based on gender, race, etc.), was one of the more difficult visualizations to code up. The goal with this visualization is to allow users to examine the differences in perceptions between computer science students of different demographics (male/female, computer science background, and race). We implemented the word clouds by using radio buttons that allowed the user to pick how to sort, and we created svg areas to display the word clouds themselves. In order to implement the cloud, we used (http://bl.ocks.org/ericcoopey/6382449) as a source to get us started. Filtering the data was done through three filter functions that handled each of the three categories. The words themselves had to be counted in some way, and we were able to create a function that created one large array of all of the words. After that, we were able to implement a function that counted the instances of each of these words, and we stored the number of instances in a “size” variable. This allowed us to assign the sizes to the words in the clouds based on frequency, conveying the information we desired.

Realized Trends

Based on the data visualization, we came to several interesting conclusions and realizations:

Perceptions of CS:

  • There are very few differences in perception between men and women, according to the survey question, which surprised all of us.
  • However, some of the differences we did find were that female respondents viewed computer science as more “creative” than males did, and that Asian American respondents viewed it as more “logical” than Caucasian respondents did.

Coding Confidence Between Genders:

  • 54% of men rated themselves as thinking they are better or much better at computer science than their peers, whereas only 22% of women did.

Growth of Computer Science:

  • The number of students in computer science has increased about 4x from 2008 to 2013, whereas the number of faculty in computer science has increased only about 1.4x from 2008 to 2013.
  • The student-to-faculty ratio, as a result, has increased dramatically since 2008.

Accessing the Department:

  • Around 74% of survey respondents find TFs to be accessible or very accessible (rated 4 or 5), whereas only 38% of survey respondents find professors to be accessible or very accessible.
  • We also see in the line graphs below a direct comparison of the exponential growth of both the students and the faculty. We see that faculty growth has not been growing at the same exponential rate as the growth of students, which is a potential reason as to why professors aren’t seen as overly accessible by a majority of survey respondents. It would be especially interesting if we had professor accessibility information from 2008, when faculty size was more appropriate for the number of students.
  • The two line graphs also show the disproportionately low number of women and minorities in both students and faculty, but also the stagnant rates of women and minorities in faculty.