I specialize in information retrieval and at the moment I’m using data science techniques to predict which information might be useful for a user before she actually opens a browser and starts searching for it. In this post I’m providing 7 essential resources & tips to help you get started with Data Science.
1. Data Science
Data science is an umbrella term for a collection of techniques from many distinct areas such as computer science, statistics, machine learning to name just a few. The main objective is to extract information from data and turn it into knowledge which you can base your further decisions on. It sounds easy, but it’s not necessarily always straightforward. Usually the process comprises many steps starting with a research question. Once you know what you want to study, you need to obtain the right data, clean it, explore it, create and evaluate a model, repeat this cycle a couple of times, and finally you are ready to start looking for a way how to properly communicate your results.
The Python for Data Analysis book is a great starting point, it guides you through all these stages and helps you to get this workflow under your skin.
A definition of ´Data Scientist´ by Josh Wills
2. Data Set
First of all you need an interesting data set to play with. Either you already have your own data (congratulations!) or you need to acquire some. We happen to be living in the age of information overload which probably means that data is everywhere and it’s easy to get it, right? Yes and no.
Data is wherever you look, however, it’s not always trivial to get what you want. The path of least resistance when searching for data is to explore publicly available data sets. People tend to organize them in curated lists such as ‘Awesome Public Datasets’ by Xiaming Chen, alternatively you can use one of data repositories like datahub.io. If you don’t succeed, you can try to find a public API and collect the precious data yourself. Chances are high that such an API is not available or is very limited, then you have to find a way to extract the data by other means, for example, by scraping webpages. This approach typically requires some data-cleaning steps, which might be costly in terms of time and effort.
Having a good understanding of statistics is extremely helpful when performing data analysis. A rule of thumb says that the first step after getting a data set is to have a quick look at it, and some basic descriptive statistics is a good friend of yours here. If your data set contains numerical variables, you might be interested in their distributions — their center (i.e., mean) and how spread they are (i.e., variance).
In short, statistics offers you a toolbox for understanding your data, distinguishing between causation and correlation, analyzing patterns, modeling, predicting, etc. Last but not least, statistics quantifies certainty of your outcomes and therefore gives you confidence in your results. In our ZEEF list you can find, among others, this awesome hands-on tutorial called “An Introduction to Statistics” prepared by Thomas Haslwanter.
4. Machine Learning
In layman’s terms, the goal of machine learning algorithms is to learn to make decisions based on data. This approach, contrary to designing hard-coded algorithms, has huge benefits in a sense that one method can serve many purposes. Moreover, machine learning systems are designed to improve as new data come in. That’s exactly why your Amazon account looks different when you’re logged in than when you’re not — as you’re browsing their catalogue, it learns your preferences. Google search, to mention another example, is constantly learning the importance of webpages. You don’t have time to manually inspect those X thousands of results it returns, all you want is the ten blue links to be the best hits.
If you want to start with the machine learning right away, then you should visit the Joseph Misiti’s GitHub repository with a great hack-first-get-serious-later tutorial called Dive into Machine Learning. It uses Python and one of its most popular ML libraries, scikit-learn.
I've already mentioned the descriptive power of statistics. Let me illustrate the importance of visualization on one example, where simple statistics is not enough: Anscombe’s quartet is a collection of four different data sets with two variables x and y. Interestingly, these data sets (despite looking very different visually) appear nearly the same through the lens of statistics. They share almost identical values of the following properties: mean of x, sample variance of x, mean of y, correlation between x and y, and linear regression line, yet in fact they’re very dissimilar.
Anscombe´s Quartet (Avenue/Wikipedia)
Data visualization is important both when analyzing data and when conveying your findings. Human eyes and brain are great co-workers when it comes to recognition of patterns. They make it easy for us to immediately spot relationships, trends, outliers or anomalies in visualizations, especially for low-dimensional data. Whenever possible, you should try to leverage the enormous bandwidth of human’s visual system and explain your data in graphical form. I’d recommend you to first get some inspiration in this amazing overview of visualizations based on D3.js library.
An animated visualization of cultural mobility in the world between 600 BC and present, revealing migration patterns of people. Animation is based on publication of M. Schich et al., with data extracted from publicly available Freebase knowledge base.
Data science in various forms is being introduced as a new program on many universities around the world. Massive online courses go hand-in-hand with this trend and already you can find a plethora of free or very affordable courses that will guide you from Introduction Data Science, through Data Analysis and Statistical Inference, Data Mining or Data Visualization to Machine Learning lectured by Andrew Ng.
Now, when you have all the pieces together, it’s time to apply your knowledge in practice. And what can be more fun than participating in a competition? Data science challenges, such as Kaggle, are a great opportunity to test your own abilities and to learn from others (you’ll also get nice data for free). On top of that, if you manage to win you can be offered a dream job or at least a lot of money. If that doesn't tickle your fancy, there is also another, more noble, reward in some competitions (e.g., DrivenData.org): saving the world!
I hope you have found these tips and resource useful, especially if you’re starting your first data-related project. The field is evolving incredibly fast and new resources are popping up every day. Keep in mind that it’s good to keep up with latest trends, but it’s essential to learn the basics.