Being a Data Scientist in 2016

TL;DR: The field of data science has rapidly matured in the last two years. Finally, data scientists have the ability to easily analyze large quantities of data. In the upcoming blog posts, I will explain the methods I used to study large-scale datasets, such as WikiTree and Reddit.

Disclaimer: This blog post isn’t about “deep learning” — yes, I know that’s very rare for blog posts in 2016!


I have been working as a data scientist both in academia and industry for over a decade, long before the term data scientist became popular. In the last couple of years, the field of data science has matured and received a significant boost. It is a really exciting time to be a data scientist! Now when people ask me what I do, I simply answer “I am a data scientist” instead of launching into a complicated explanation.

Data scientists finally have both the software tools and the hardware available to tackle problems that require analyzing big datasets. No longer are we data janitors. For the first time in my career, I find myself spending most of my time actually analyzing the data, instead of cleaning it and figuring out how to deal with hardware limitations.

Only few years ago, analyzing large quantities of data was a lot more difficult. When I started my PhD at the end of 2010, doing data-oriented research was logistically challenging: First, before my colleagues and I started doing anything, we needed to discover a way to obtain the data for our studies. That meant we had to find online resources with relevant data for our research. Then, we typically developed a web crawler to collect the data. Next, we parsed and cleaned the crawled data by developing C#, JavaScript, or Python code. We crawled so many websites that in the end we even developed a crawling and cleaning framework.

Satellite image from Google Earth combined with accident and police report heatmap based on data collected from Waze

After collecting the data, we stored the data in rational tables using MySQL and finally started to analyze the data by executing SQL queries. In many cases, we used the data to build some type of prediction model, usually using Weka. It wasn’t able to construct models using all the records in the dataset, however, so in many scenarios we needed to discover a smart way to filter the dataset to construct the desired prediction model. In some of my studies, I ran out of places to put all the collected data and ended up asking favors from other faculty members to use their relatively strong servers to analyze the data. I am sure that easier solutions to store and analyze data existed back then, but I wasn’t aware of some of them and the solutions I was aware of were beyond my budget.

Now data science research is much easier. For the first time, there are many diverse publicly available datasets, such as datasets published on Kaggle, data.gov, and Yahoo! Webscope (which recently released the largest machine learning dataset). Moreover, I believe that companies have become more aware of the advantages of sharing their data; on several occasions when I needed a specific dataset for my research, I just asked permission to access the data and in most cases I got it. After I received the desired dataset, I still needed to clean it and sometimes parse it. In my recent work to parse and clean the data, I used the GraphLab SFrame scalable data structure that is similar to the Pandas DataFrame structure, only more scalable.

Twins most frequent first names based on data from WikiTree — for getting access to this dataset I just asked for permission.

For actually analyzing the data, I usually take a small part of the data and run prototype analyses on my personal laptop. Once the code becomes stable enough, I move it to a relatively strong server that can analyze the full dataset. In many cases the server is not fast enough to do the full analysis in a reasonable time, so after I’m sure my code is pretty much working on the full dataset, I rent a super strong server in the cloud by the hour (I usually use AWS spot instances) and run the full analysis on the data. In most cases, to perform the analysis I use various GraphLab toolkits (as a former data scientist at Dato, I am truly a fan of GraphLab), scikit-learn, igraph, networkx, and other tools such as Gensim Word2Vec implementation. In the past, I needed to filter the data to construct prediction models. Today, unlike two years ago, I don’t need to worry about that — with many of the current tools, I can create prediction models utilizing numerous examples with plenty of features.

Additionally, in the last two years, visualization tools such as D3.js and seaborn have become part of my work flow. These tools are now considerably more user friendly and they can create beautiful diagrams. I really love using these visualization tools to explain my research results and construct figures for my papers.

In my next couple of blog posts, I will tell you more about my experience as a data scientist, and I’ll give an overview and some code samples to describe how I actually perform my data-oriented research.

Michael

BTW, what hasn’t changed in my work over the past decade is the fact that I still use Latex to write my articles :)