Open Data Science Conference 2018

Ema Kaminsky
3 min readMay 8, 2018

--

Peter Wang on stage speaking Anaconda.

Recently, I was lucky enough to score a free ticket for the Open Data Science Conference 2018 in Boston from Women Who Code. I got back from the conference two days ago, and here is my summary of the conference highlights, together with projects I’m inspired to replicate this month.

Presentations I Learned Most From

Most of the talks were amazing and it was really hard to make a choice on what to attend. I attended more than 10 talks and workshops, but I learned the most from the following three speakers.

Automatic Text Summarization of Documents at Scale by Guilherme de Oliveira from Dataiku.

The presentation provided me with enough information to replicate some examples on my own. I now know that I can use a publicly available dataset with Enron emails, eliminate its metadata, remove stop words, and map words to their base. Then, I can run a statistical model and see which words/topics are most frequently used in the dataset. Essentially, I learned the basic steps to algorithmically analyze large sets of documents, comments, or other text files.

Project Feels: Deep Text Models for Predicting the Emotional Resonance of New York Times Articles by Alexander Spangher.

Alex’s ability to captivate and connect with the audience was a sight to behold. The whole talk felt like an informal conversation between the presenter and 150+ people in the audience. That’s definitely a skill and a bit of a talent to manage such a big crowd in a very conversational way, encouraging questions and sparking curiosity.

The Project Feels aims to predict the emotional effect of NYT articles on readers with the goal of recommending relevant articles or ads. The initial dataset was obtained with the help of Amazon Mechanical Turks who tagged about 20,000 articles based on the emotions these articles evoked such as boredom, interest, love, fear, etc. The talk was very structured and provided me with a good understanding of how to approach a data question and what tools to use.

From Numbers to Narrative: Data Storytelling by Issac Reyes

This talk provided a summary of best practices for data visualization. It contained a lot of interesting examples of data charts from popular media and even the speaker’s dating life. Issac referenced the data visualization hero Edward Tufte and the Gestalt school with its laws of similarity, proximity, and enclosure. A fun formula presented was Data-Ink Ratio = Data Ink / Total Ink Used to Produce a Graphic. Ideally, the ratio should be close to 1. That means all the ink used to produce a graphic is used to depict the data within the graphic rather than coloring the background or adding other nonfunctional embellishments. This formula reminded me that above all, visualizations should show and not hide data. The talk was a great refresher on the core design principles to keep in mind while reporting data.

Takeaways

  • It was great to hear from some young speakers, both male and female. It showed me that everyone, regardless of their age or gender, can work on impactful data projects. What important is to have an idea, test it, document findings, reflect, iterate, and repeat until you have enough results worth sharing.
  • Several presentations inspired me to try to replicate projects related to natural language processing. For example, I am interested in parsing through the Enron emails dataset and learning how to summarize large documents with the help of a machine learning model called Latent Dirichlet Allocation.
  • I would like to learn more and try out the following Python libraries for analyzing large text files: beautifulsoup for parsing XML and HTML files (e.g., Facebook page comments) and gensim to examine recurrent patterns of words in large documents/text files (e.g., Enron emails).
  • Most importantly, the conference reminded me that all great projects start from a couple of small steps, followed by numerous iterations and ruthless dedication.

--

--

Ema Kaminsky

I use Medium to reflect on design of everyday digital products.