Meet Professor Robert Gould
Professor Gould is the founder of ASA DataFest and Undergraduate Vice Chair of Statistics at UCLA. Here’s our Q&A,
How did DataFest start?
It started in 2011 when we were looking for a way for students to gain experience working with complex datasets that they couldn’t get in the classroom. We wanted to put them a situation where no one knew what the answer was, and let them apply everything they learned in an informal setting.
Where did each year’s dataset come from?
The first dataset was from the LAPD and it arose from a consulting project. We thought from then on DataFest would be a great way to work with nonprofits since they usually don’t have statisticians on staff. However, it turns out they also don’t have database managers, so they couldn’t get us the data. By default, we’ve worked more with businesses that have the resources to manage the data.
At the American Statistical Association (ASA) meetings, we’d go down the exhibition hallway and ask if any company has data for us to use. We look for a large data set — around 1.5 gigabytes — that has many variables and is unified around a common theme that is understandable without requiring too much effort or subject knowledge. Since it is a national event now, we also want the theme to be interesting nationally.
Some companies have to consult with their lawyers. This year, Ticketmaster was concerned that they could not reveal the names of their artists, so they had to work with lawyers to make it available.
What was the best part about this year’s DataFest?
Having it at Covel Commons. It was a great venue and there was a lot of space for people to sit down and think. The very first year we had it in a conference room in Ackerman, except we would get kicked out at night so we had to move to the hallway till the morning. This year’s room size was good but it can always be bigger.
Over the last several DataFests, do you notice any changes in the tools people use and the questions they explore?
Students have become much more sophisticated with programming. There are fewer questions about the basics, and it’s harder to determine who’s stats and who’s CS. Some teams now have a blend of both. The sophistication of graphics and presentation quality have improved dramatically. There used to be teams that were complete failures, but now pretty much every presentation is professional.
There are changes in the topics students explore, but mostly because the data is different each year. This year’s guidelines were the most specific our data donor (Ticketmaster) ever provided, whereas E-Harmony and Edmunds.com (from previous years) provided questions that were more open ended.
What are topics or tools that always come up in presentations?
It depend on the dataset. There were a lot of K-means clustering and PCA this year. One thing we saw in the early years was that many teams tried to fit a statistical model to the data without exploring if the model was suitable for the data. This year, teams were more exploratory with the data, rather than just fitting a model.
What makes winning teams win?
Having a very clear story — the problem, the solution, and why you believe in the story. If the story’s too complex, then it may get lost. Judges appreciate students who are careful about what their inference does and doesn’t do. They like teams that are thoughtful in communicating the robustness of their conclusions.
Are there changes you’re looking to make to DataFests in the future?
I’m hoping we’ll have more regional DataFests. This year, UCSB, Cal State Fullerton, USC and UCI wanted to send more teams but couldn’t. Having an Orange County branch would take some of the load off and allow us to have more students to participate.
We used to host workshops during DataFest on the context of the data, how to deal with certain data types, how to make maps, etc. It would be nice to do that again.
Which classes prepare students for DataFest?
One of our goals for DataFest was to pay more attention to what we’re teaching and see if we need to include or exclude certain topics. Stats 20 came out of this. We used to assume that people would know how to subset and arrange the data.
The question I’m always asking is: Is there a course we should be offering? I think the one skill students still struggle with is generating hypotheses and questions and understanding how to track them down.
We ask every professor, what’s your favorite distribution?
The Poisson distribution. It’s a distribution that arises from thinking about a practical problem and it’s fun to say the name.
This interview was conducted by Edward Fu and Tyson Ni on May 10, 2016, and edited for length and clarity. Stay tuned for more Q&A’s like this with other professors!