I love hackathons. I can learn more in a day or two of hard work with friends than I could in six months studying on my own. So when I heard about the first Social Data Science Hackathon in the Twin Cities, I was one of the first to sign up.
The hackathon took place on November 14th in a cavernous conference room at the headquarters of General Mills, just outside Minneapolis. With four teammates, I sweated over data science problems (and munched on General Mills snacks) for eight hours, then presented our work to a panel of esteemed judges. Here are a few of the most important lessons I learned from the experience.
1. Nothing beats help from a domain expert
Most hackathons are just for fun, but this one had a higher purpose in mind. In Minneapolis and St. Paul, there’s a stark gap in education performance between white students and students of color. The hackathon organizers partnered with Generation Next, a nonprofit that aims to close this achievement gap. Generation Next had tens of thousands of data tables, mostly in Excel spreadsheets. The spreadsheets captured hundreds of variables, from percentage of first-graders who have had school eye exams, to ACT test scores, to school budgets and teacher training levels. Every variable is grouped by school, ethnicity, gender, and year, so no individual student information is revealed.
Our challenge was to sift through this massive trove of data to make a compelling case to the judges on how we could close the achievement gap between white students and students of color. This would be impossible without help from an expert. Fortunately, we had one: Generation Next’s Data and Research Director, Jonathan May. Jonathan has spent years studying education data and was able to help us understand what it meant and where it came from.
2. A little advance preparation goes a long way
Even with expert assistance, we would have been lost without an enormous amount of prep work by the hackathon organizers. Co-organizer David Radcliffe spent weeks writing scripts to extract data from thousands of spreadsheets and load them into a single MySQL database. While he was working on that, we were planning our team.
The day before the hackathon, I met my teammates at a mall food court. Kevin Church, an independent statistical consultant who previously worked for the New York Times, was our group leader. I’m a software engineer, and was able to help by writing Python scripts to download data. Machine learning expert Steve Damer and data science consultant Pedro Medina contributed their professional expertise. GIS specialist Renee Huset rounded out our team.
We gathered around the mall food court table for nearly three hours, planning our day of hacking. Steve wrote a script to merge the data files that we would get in the competition. Renee showed us how to use GIS to join geographic data with tabular data. I wrote a script to download geocoded crime statistics from the Minneapolis Police Department. We picked a team name: Mine the Gap.
With five experienced members and a great game plan, our team seemed strong. But we knew there were five other, equally strong teams entering the competition.
We planned our hackathon team at at a mall food court.
3. Data cleaning is tedious but vital
When we arrived on the competition day, we were presented with a slip of paper with login credentials to a MySQL server. We logged in, and got to work on the actual data from Generation Next.
Right away we noticed problems. Schools were listed under slightly different names in different tables. Columns had null values. Tables were split into multiple pieces. The hackathon organizers had done an amazing job building the database, but we couldn’t join the tables together without painful work by hand.
If you analyze data professionally, you probably know that getting it into an analyzable form is often more than half the battle. But it was a surprise for me. We spent the first half of the hackathon just cleaning up the data.
The atmosphere was tense early in the hackathon.
4. Data mining is a powerful tool — but also a dangerous one
From the beginning, we decided to use a technique called stepwise regression. In traditional statistics, you start with a hypothesis, and try to prove that hypothesis right or wrong. But we weren’t education experts, so we didn’t have a solid hypothesis to start from. Instead, we decided to feed in as many variables as possible, and have the computer do the work of selecting a model. In general, using an algorithm to build a model automatically is known as data mining.
Stepwise regression is controversial, since it is prone to a problem known as overfitting. However, for our purposes there was a bigger issue: the variables that kept showing up in our model were interesting to the algorithm, but not interesting to a human being. In one case we discovered that attendance at a particular school was a strong predictor of low test scores, only to discover that it was a school for kids with learning disabilities — statistically true information, but not surprising for anyone with domain knowledge.
If we had a few days, we could have resolved these issues. But we only had one day, and time was running out.
The output of our stepwise regression run is a linear model like this one. The dependent variable was eighth grade math scores as measured by the MCA-III test for students in the Minneapolis Public School district. All of these factors are statistically significant, although many of them are not actionable to policy-makers.
5. There’s a geography side to almost every story
If you’re not familiar with geographic information systems (GIS), it’s a family of techniques and software for analyzing geographic data. My teammate Renee Huset was a GIS wizard, and she used her knowledge to incorporate local crime data into our analysis. We had crime rates for each of Minneapolis’ 86 neighborhoods, and we had catchment area maps for the city’s 76 schools. They didn’t line up — a school could draw from three different neighborhoods, and a neighborhood could feed into three different schools.
In just a few minutes Renee was able to merge the two data sources, producing a new “crime score” for each school based on an area-proportionate weighting of the crime rates of the neighborhoods in its catchment area. I was blown away — solving the same problem would have taken me 1000 lines of Python code and a week of work.
While it was a challenge to include crime rates in our analysis, it produced a fascinating result. For one school, the gap in MCA test scores between this school and a group of high performing schools completely disappeared when crime rate data were included in the analysis. We could make a concrete suggestion to policy makers that they should try to reduce crime near those schools in order to improve test scores.
Black lines are Minneapolis middle school catchment area boundaries, while shades of gray indicate crime rate by neighborhood (darker is higher crime). Data from Minneapolis Police Department via Minnpost’s Crime site, and Minnesota Department of Education (via MN Geospatial Commons).
6. When in doubt, find or create a metric
Team Gopher Geeks had a brilliant insight. All the teams were tasked with understanding the achievement gap, but didn’t have a definition of what it actually was. So they scoured the web to find the official achievement gap formula from the Minnesota Department of Education. It’s based on the percentage of students in each ethnic group in a school who meet a proficiency standard based on test scores. Instead of muddling through a variety of dependent variables, they had just one, and it was easy to explain and understand. They were able to correlate school funding, school diversity, and student-teacher ratio to statistically significant differences in the metric.
Given their wisdom in finding an applicable metric, it shouldn’t be surprising that the Gopher Geeks took home the first prize! The team members were all students in the Carlson School of Management’s MS in Business Analytics program.
While I was disappointed that my team didn’t win, I was proud of our strong showing and unbelievably impressed by the Gopher Geek’s work.
Heat map of the achievement gap metric from the Gopher Geeks team — black dots are schools, and red regions have the highest achievement gap.
7. A great team is everything
Data science is a truly interdisciplinary field. To do great data science, you need techies, statisticians, and people who know how to put it all together and tell a coherent story. I saw that in effect on our team: none of us had all the skills we needed individually, but combined, we were a force to be reckoned with.
In case you’re looking for a great team of your own, Dave Radcliffe, Kevin Church, and Pedro Medina are always open for consulting engagements, while Priyanka Saboo, Wenqiuli Zhang and Sharada Narayanan of the Gopher Geeks are looking for full-time positions when they graduate from their MS program in May.
The winning team, Gopher Geeks: Priyanka Saboo, Sharada Narayanan, Wenqiuli Zhang, and Shari Roling. All four members are students in the University of Minnesota Carlson School of Management’s MS in Business Analytics program.
If you’d like to participate in a data science hackathon in the Twin Cities, there are many ways to get involved! Social Data Science is going to improve the database they developed for the hackathon and give it back to Generation Next, which will require many more hours of volunteer work. They’re also in the planning stages for their next hackathon. Finally, I’m helping out with a new group, called Analyze This, which will run 3-month business-oriented data science competitions. I hope to see you there!
Thanks to the event sponsors (Veritas, Apex, Cloudera, General Mills, and phData), the event organizers, and the volunteer judges for making this amazing experience possible! As Kevin Church said after the hackathon, “The event was a selfless display of community and just one more reason to reside in the Twin Cities!”
I got a selfie with competition judge and former Minneapolis mayor RT Rybak.