Moore Sloan Data Science Summit Recap: Day 1

Reproducibility, data journalism, and…Gossip Girl? Data scientists from NYU, University of Washington, and UC Berkeley share their research and discuss the future of data

Every profession has that one annual event where you drop everything to attend. For actors and actresses, it’s the Oscars. For models and designers, it’s New York Fashion Week. And, for data scientists? It’s the Moore Sloan Data Science Summit.

The Moore Sloan Data Science Summit 2017

Held in snazzy New Orleans this year, the annual event brings together data scientists who are supported by the Moore Sloan Data Science Environment, a five-year $37.8 million cross-institutional partnership that aims to advance data-driven scientific discoveries at NYU’s Center for Data Science (CDS), University of Washington’s eScience Institute, and University of California Berkeley’s Institute for Data Science.

The two-day summit is a chance for researchers to update each other on their progress and share their work. It began with the Software panel discussion consisting of Claudio Silva (NYU), Jacob Vanderplas (UW), and Stefan J. Van Der Walt (UC Berkeley), all of whom outlined major research ventures and new curriculum initiatives supported by the Moore Sloan grant, such as NYU’s Data Science Capstone Projects for its Master’s students, UW’s Data Science for Social Good summer program, and UC Berkeley’s Data Structures for Data Science workshops.

They were followed by the Data Reproducibility panel consisting of Juliana Freire, NYU CDS’s Executive Director of the Moore Sloan Data Science Environment, and Ariel Rokem (UW), who has had a long engagement with reproducing neuroscience research.

As technological advancements are growing at a rapid rate, with each software update to any platform comes the danger of losing all of the data that we have saved on older versions — which is why, as Freire and Rokem said, developing reproducibility tools is vital.

An example of such a tool is ReproZip, a powerful platform invented by Freire and her team that helps users to preserve and reproduce their data. ReproZip is especially useful for archivists and librarians — and it’s on the way to becoming a key tool for data journalists, too. In Meredith Broussard’s (NYU) roundtable, Broussard explained how she is collaborating with the investigative news organization ProPublica and the rest of the NYU ReproZip team to find ways of preserving interactive online news applications, such as data-driven demographic maps or interactive databases. After all, several of these applications are currently supported by softwares that are likely to be obsolete soon. (Did you know that Flash, for example, won’t be supported by Adobe after 2020? Yikes!)

After a quick lunch, the rest of the day was an opportunity for data scientists to attend short tutorials on new ways to use different computing platforms like Python, Baselayer, and Jupyter.

Of particular intrigue was Jacob Schreiber’s (UW) tutorial, where we discovered that the real die-hard fans of popular television show Gossip Girl should’ve turned to his new Python modeling package, pomegranate, to discover who the show’s absurd blabbermouth was. (It was Dan, for those who still don’t know.) Built as a fast and flexible package for data scientists who need to perform probabilistic modeling tasks, pomegranate is an additional option to using SciKit-Learn or custom packages in R.

The day closed with a series of quick four-minute lightning talks. Burning through sixteen (!) talks in two hours, some of the most fascinating projects included using data-driven image segmentation for improving brain scans (Anisha Keshavan, UW), applying machine learning to estimate the effect of medical treatments (Soren Kunzel, UC Berkeley), and using agent based modeling to learn more about violence in Chicago (Tom Laetsch, NYU).

More photos and videos to come. (Follow us on Facebook to get ’em once they’re out.)

by Cherrie Kwok