Someone asked me recently how he could get the knowledge and the skills necessary to become a Data Scientist. There are different ways to learn data science, go to university, follow a bachelor or master in data science, get into a Bootcamp program, or learn it by yourself. Nowadays a lot of material is available on the internet, often for free, to learn the skills necessary for Data Science.
There are three primary skills needed for Data Science 1) A programming language used in the data ecosystem, typically one of Python/R or Scala 2) SQL, for data manipulation and extraction, and 3) Statistics & Machine Learning.
Assuming this is your first programming language, Python is an excellent language to learn. It is a general-purpose programming language with a broad ecosystem of data libraries. It is also relatively straight forward to learn and is often taught in introductory classes for computer science.
It is good to start with a general introduction to computer science, rather than a more data focus Python course. edX’s Introduction to Computer Science and Programming Using Python provides a decent entry point for getting accustomed to programming principles. Despite having some problem sets to complete, it will be necessary to consolidate the knowledge acquired during practical exercises or with a project.
Hackerrank provides a practical learning path that I believe is an excellent followup to an introductory course — doing a few of the easy Python exercises should familiarize you with using the language.
One of the best ways to learn a programming language is first to learn some of its concepts, followed by some projects to consolidate the knowledge. A simple projects of your own, provides a way to see how everything fits together. At this stage, there is no need to look beyond Python’s standard libraries.
When you feel comfortable applying Python in your projects, it should be a good time to deepen your understanding of Python.
Some good topics to deep dive are
It is always a helpful idea to deep dive into some Python’s library reference. The topics around datatypes, file and directory access, generic operating system services are good chapters to get into, as well as the odd topis of pickle. csv, json and the unit tests topics.
With this acquired knowledge, it should be possible to work on some personal projects manipulating and processing files purely in Python. Creating a project leveraging the CLI (arg parse ) is an excellent way to make this useful and entertaining. An example projects is that of a file organizer that reorganize into subfolder based on filetype and content. It will be less efficient than if using specialized libraries, but it will allow to consolidate this knowledge.
Once you have some grasp of creating applications for processing files, the logical next step is to get some knowledge on the more specific data libraries Pandas and Numpy. DataCamp and Dataquest are some of the providers that offer interactive tutorials for these libraries. Understanding how to use these libraries is usually a prerequisite for learning the statistical and machine learning-oriented libraries.
The next step is to learn some of the other important libraries used in Python, such as requests, scrapy, sqlalchemy or django. This will give you the knowledge to really make full-scale applications that either fetch or surface data. After that, it will be a continuous learning path, and the best way to keep improving is to work on projects that allow you to have your code reviewed. Open source projects are usually great for this.
SQL is an important skill to learn for every data scientist. They leverage it for transforming and extracting data out of databases. It is one of the most commonly asked subjects in data science interviews.
Out of the different types of SQL out there, what matters the most for data scientists is to master basic analytical SQL. There are a few online resources that help get some background in SQL. W3School gives a great first overview, while Code academy, Hackerrank and khan academy provides a practical approach to learning SQL.
SQL is better learned through practical approach, and there is no better for that than to play and try to make sense out of different datasets. SQLite database provides a decent way to get some experience dealing with small-sized datasets at a relatively low effort. The main difficulty with practicing SQL is about finding the datasets to practice with.
Statistics & Machine learning
If you haven’t had many statistical courses during your prior studies, it is helpful to go through an introductory Statistics and Machine Learning course covering the following topic regression (linear/logistics), decision trees, random forest, k-means and KNN. EDX offers a good one in “The Analytics Edge”. One of the drawbacks of the course is. However that it uses R for a programming language, this means that it will be necessary to learn Python’s Statistical libraries, such as sklearn, separately.
A followup with a more in-depth theoretical course such as Introduction to Statistical Learning is centered around a free to download book of the same name ( hard copy also available). The course manages to get deep in the concepts without getting too deep in the mathematics of it.
It is worth getting a good grounding of the algorithms by coding them from scratch. A lot of data science interview you on the knowledge of some of the basic algorithms to see if you understand them. Coding them from scratch allows to know them inside out.
On a more practical note, Kaggle provides some decent projects to get accustomed to part of the data science workflows. They provide datasets and allows to see how other people approach the same problems. It is worth giving a try to some of the challenges, at the very least, for getting more familiar with the different machine learning libraries and preprocessing steps. Kaggle competitions tend to be very much focused on the modeling approach rather than on the data transformation and commonly use very clean datasets, which is very far from the work reality of a data scientist.
After having done a few Kaggle, the most beneficial is to get some practical experience on real (read unclean) datasets. Get some expertise cleaning datasets, debugging, and correcting the issues that arise in training on such datasets. For this either sourcing your own data by scraping information off websites, or getting some project experiences on websites such as freelancer.com.
Depending on what you would like to specialize in, there are multiple online resources available to be introduced to specific areas. Coursera’s Machine Learning with Python from Andrew Ng provides, for instance, an excellent introduction to deep learning, as well as other machine learning concepts. There are other topics from Stats and Machine Learning that are useful data scientists to learn, such as NLP, Computer Vision, or Bayesian Stats.
There are quite a few resources to help self learn the different aspects of data science. It is essential to get a good enough coding experience, both through the theoretical foundation and through working on projects. Couple this with a good understanding of the core machine learning models, some Kaggle exercises, and some work on real datasets and you should have the right foundation for
The learning journey, however, doesn’t stop there. Data Science is a continuous learning discipline in which it is possible to learn across many axes.
There are topics that data scientists can get into, such as programming and leveraging Spark, going deep in TensorFlow code rather than relying only on Keras, programming for GPU using CUDA, working with Graph technology …
And it is also possible to grow by getting a broader stats/ml background, specializing in specific domain areas, reading and implementing research papers, improving your engineering skills, or getting more towards product management.