Data Analysis, A Complete Roadmap for Learning Data Science — Part 2
Before getting to the more interesting parts of projects like machine learning, a data scientist will first need to spend time locating, extracting, combining and reorganising data. This data will likely also need some cleaning and transformation before being in a usable state. Additionally prior to modelling it is important to analyse the data to understand the size, shape, and scope of the dataset, and to identify patterns, trends and correlations.
This is part two of a series of posts describing the roadmap I am following to learn data science, my previous post covered programming skills and can be found here part 1. Part 3, covering maths and statistics, is available here. In the following post I am going to list the important skills that I feel should be on a roadmap for learning data science in order to be able to perform these preliminary analytical steps.
As a data scientist it is highly likely that you will be using sql to extract data from databases before performing modelling tasks. You can even do some relatively advanced data analysis using sql alone, and if you are using Google BigQuery, discussed later on in this post, you can even do some basic machine learning. Knowing some sql and having an understanding of databases is an important skill in data science.
SQL, stands for Structured Query Language, and is the standard language used for querying relational databases. A relational database is a collection of tables consisting of columns and rows of data, that are related to each other via common key identifiers.
For data science it is important to have an understanding of select statements, where clauses and join operators. You will also need to know how to create tables, and how to update, insert and delete rows and columns of data.
There are many different types of relational databases and most have slightly different sql syntaxes, but once you have learned the basic concepts, you should be able to adapt queries regardless of the database you are using.
There are lots of free tutorials out there, but I always find that I need to put anything I have learnt into practice immediately. Whilst learning data science I was working full time as a data analyst so I was never short of access to databases to practice with. However, as most people don’t have that there is this useful website http://www.sqlfiddle.com/, which allows you to create tables, and query them across several common database platforms including MySQL, PostgreSQL and SQLite.
Cloud Data Storage
I have been working with Google BigQuery for a little over a year, the standard sql syntax is fairly similar to other databases, but BigQuery’s nested and repeated fields can take some getting used to. With more and more businesses opting for cloud based data storage solutions it is useful to know a little about how to interact with them. BigQuery has a free usage tier which supports up to 10 GB of storage, and 1 TB of querying which is enough to get some experience working with data on the platform.
There are a selection of readily available public datasets to use, and you can also create your own data sources here. Google are regularly releasing functionality which makes applying machine learning easier and easier. One of the latest releases BigQuery ML allows you to create machine learning models using standard SQL queries. Although the implementation is currently limited only to regression based models I think it has potential for quickly determining the predictive capabilities of your data. There is some good documentation for this https://cloud.google.com/bigquery/docs/bigqueryml-intro which I was just recently working my way through.
In python I have 3 preferred libraries that I use for visualisations they include the following:
Pandas has some basic plotting tools, most of the popular plots are available here including histograms, scatter plots and box plots. They don’t produce the most beautiful visualisations, as there not a huge variety of formatting options, but I find that when working with data frames they enable you to get a really quick approximation of patterns and trends in a dataset. The pandas documentation covers this very well https://pandas.pydata.org/pandas-docs/stable/visualization.html.
Matplotlib provides a greater variety of plots, and gives you greater control over colours, font properties, line styles and axis properties. Although the plots it produces are not the most visually pleasing it is important to have an understanding of matplotlib, as many of the more advanced plotting libraries are built on top of it. Again the documentation for matplotlib is very good and includes lots of example plots to experiment with. CodeAcademy also has a short course https://www.codecademy.com/learn/learn-matplotlib.
Whilst matplotlib alone can produce some reasonable looking plots, it takes a lot of lines of code to generally produce them. This is where seaborn comes in, seaborn is built on top of matplotlib and allows you to create some fantastic looking plots with just a couple of lines of code. Checkout this example below, here I was able to quickly, and attractively review the correlations in this dataset with very few lines of code. There is a really comprehensive tutorial on Kaggle https://www.kaggle.com/kanncaa1/seaborn-tutorial-for-beginners.
As well as extracting and combining data in SQL you will also need to know how to clean and transform it ready for modelling. Python again has a number of libraries for this. Pandas can be used to handle missing data, and merge and join datasets. The python datetime library is also useful to learn as you are likely to need to manipulate data containing dates and timestamps. Dataquest covers all of this really well in their data exploration path https://www.dataquest.io/course/data-exploration.
It is also useful to have an understanding of the different types of data, and how to process them prior to modelling. For example, having an understanding of how to turn a column containing categorical text data into numerical data. The Sci-Kit Learn library has a number of pre-processing functions which are also really useful to learn. This library has built in functions to standardise features, handle outliers in the data, normalise features and impute missing values. Again I found the documentation really good here http://scikit-learn.org/stable/modules/preprocessing.html#binarization.
In later posts in this series I will be providing a roadmap for other aspects of learning data science including maths and statistics, data engineering, and machine learning.