Share the code: adventures in education data wrangling


When I transitioned from being a high school science teacher to becoming a data scientist, I thought it was a phenomenal stroke of luck that I was the only person within the first organization that hired me who:

  1. Had a solid knowledge base in analytics and statistics
  2. Was familiar enough with R to favor it over Excel

When you add to that the fact that I was working with proprietary data that under no circumstances could be shared with the public, it put me in a very safe position — I would never have to share my code, I could work at my own pace, and I never had to subject myself to criticism.

Fast forward to six months ago, when I finally came to terms with the fact that I was doing myself an utter and complete disservice on both a personal and professional level. Not only had I not improved in my programming skills, I simply wasn’t learning or growing.

I happen to like my comfort zone sometimes!

Why share your code now?

  1. I currently work with K-12 education data, which is publicly available and therefore has far less restrictions on what can be shared.
  2. I’m really leaning in to the whole idea of sharing my learning experience as a way to demonstrate that learning data science can be challenging, but it’s something you can absolutely do. Learning is a messy process, and you don’t just wake up one day being great at data science. Or maybe you do, and I’m going about this entirely wrong.
  3. I want to get better at programming! Outside of graduate statistics coursework from 10 years ago, I’ve had no formal training in R. Everything I’ve learned has been through a haphazard “grab and go” process, and I’ve only just started to strategically organize my learning.
  4. When I asked a question about this on Twitter, I realized that I had never shared my code, and that maybe now was a good time to change that!
This entire thread gave me all the feels — the responses are both diverse and incredible
This is the sentiment that has resonated with me the most

OK, so where is your code?

Right here!

What am I looking at?

The code goes through importing and wrangling the Texas Academic Performance Report (TAPR) data for students both approaching and meeting grade level, as measured on the State of Texas Assessments of Academic Readiness (STAAR) tests.

There’s a whole host of additional information as well, and you can find it here.

What were the project guidelines?

By November 27th, I needed to have data in a format that allowed me to accomplish the following:

  1. Compare growth from 2016 to 2017 in students approaching and students meeting grade level for math and reading, looking at all students, english language learners (ELL), and students from lower socioeconomic backgrounds. The data needed to be analyzed at the state and district level.
  2. Determine which schools were in the top 10% and top 25% in terms of both achievement and year-over-year growth at both the state and district level.
  3. Determine which districts were making notable gains with english language learners and students from lower socioeconomic backgrounds.

How I approached wrangling the data:

There were three main steps:

  1. Link campus information with student mobility data as well as student testing data.
  2. The student testing data contained a fair bit of information coded into column headers — getting the information decoded and into columns was going to be critical to the downstream analysis.
  3. Calculate additional metrics, including the average rate for each subgroup as well as growth from the 2015–2016 school year to the 2016–2017 school year.
This project felt a lot bigger than it actually was

The non-programming skills I relied on (heavily)

Knowing my data. I spent several hours reading through column headers and looking at how the data was organized before starting to actually wrangle the data. This time spent familiarizing myself with the data made it easier to understand what I ultimately needed to do with the data, as well as articulate my questions to others when something wasn’t working correctly.

Tenacity. Being confident in my ability to figure things out helped keep the impending sense of panic at bay. The deadline for this project was initially pretty tight (two weeks) and eventually got chopped in half, resulting in more than one moment of “oh my gosh I can’t actually do this.” It helped to remind myself that with a couple of good search terms and 20 minutes of reading, the answer can almost always be found.

Knowing when to ask for help. I got to a point in the wrangling where I knew what I needed to do with the data, but I couldn’t find an answer online that I both understood and could get to work. After working at it for 30 minutes, I reached out to a couple of friends who are more fluent in R than I am, and they had the issue sorted within minutes.

Recognize when I’m headed down a rabbit hole. This is where I’ll spend all of my time if I’m not careful, because it’s so darn easy to do. When I found myself reading through a weird online argument about coalesce vs. coalesce2, I knew that I had taken a wrong turn about 30 clicks back and needed to re-focus on the task at hand.

It may not be the prettiest code, but it works, and it’s done!

Questions you might have:

Why did you [insert R programming technique here]?

Because I didn’t know of another way to do it!

Can I make a suggestion about your code?

Yes, absolutely! Feel free to use GitHub, start a conversation here, and/or reach out to me on Twitter.

Why does it look like you simply uploaded your code into GitHub?

Because that’s exactly what I did.

The account and repository that I primarily work in is for my current job, and contains proprietary data and analysis. As such, I’ve created files that utilize nothing but publicly available data sets and uploaded them to my long-neglected public GitHub account.

How did you make it this far in data science while struggling with R?

There are two big components.

First, the organizations that I work with have relatively straightforward needs in terms of data science, because they have never had a “data person” on staff before. As such, a lot of my time is usually spent sorting out what the organization has in terms of data and analysis, what it needs, and then building organizational initiatives that helped bridge the gap between the two.

Second, my strengths and passions are in teaching, communication, and strategic work. I absolutely love working with others to help them better understand the results from data analysis, as well as assisting them with developing their own analytical skillsets. When you couple this with my strengths in organizational strategy — such as building departments, goal-oriented planning and execution, and developing staff and organizational capacity — it’s made me an ideal fit for the organizations that I’ve worked for.

Remember: being able to program is only part of being a data scientist.

I have more questions — what’s the best way to get in touch?

Twitter is a fantastic way to connect, and I would love to hear from you!