How to Start your Data Science Journey?

Quickly set up your data science environment and use different python libraries which can speed up the learning process.

Published in

The Startup

7 min readJun 7, 2020

Different Python Libraries which are helpful for Data Science. — Source: Image by Author

Machine Learning seems to be fascinating to a lot of beginners but they often get lost into the pool of information available across different resources. This is true that we have a lot of different algorithms and steps to learn but starting with a strong base not only gives confidence but also the motivation to learn and explore further. In this story, we will go through the steps on how to set up your environment and start learning with the help of a known dataset i.e. IRIS dataset which is a multi-class classification problem in Machine Learning. We will also go through some helpful python libraries which can speed up the learning process which can help you even if you are a data scientist. If you have done the setup already, you can skip the setup steps. Let’s begin with the first step of your journey.

Setting up the Environment

We will use the Anaconda distribution to setup the data science environment. Download the latest version of Anaconda from here and open the anaconda prompt and run the following command-

jupyter notebook

The above command will start the Jupyter server and load the notebook directory in your browser.

Create a virtual environment

I hope you are aware of virtual environments, if not you can read about them here. Although Anaconda comes with a base environment which is having most of the libraries already installed, it is recommended to use virtual environments as they help us to manage different environments that can have different packages and if something goes wrong with one environment it will not affect others. Here are the commands you can use to create, activate, and install packages in the virtual environment.

Link virtual environment with the Notebook

By default, the new environment would not show up in the Jupyter notebook. You need to run following commands to link your environment with Jupyter client

Starting notebook and useful commands

Once you have a virtual environment, go to the browser and open a new notebook as shown below. Select the environment you just created.

Jupyter notebook provides many handy shortcuts. Below 2 are my favorite-

Tab- this acts as a autocomplete.
Shift + Tab- this will give you the command details and you do not need to go to the library documentation every time.

See, how these commands can be helpful

Exploring Python libraries and applying Machine Learning

We need to have different libraries for loading the data sets, visualizations, and modeling. We will go through each and install them in the environment. You can have a look at my notebook, feel free to download and import it in your environment and play around with it-

ankitgoel1602/data-science

In this project, we are evaluating a famous known IRIS dataset which is a multi-class classification problem. This…

github.com

Jupyter Contrib Nbextensions

We often need to share our notebook with different stakeholders or might need to present them, this library provides us a lot of different extensions. I will not go through the extensions here, but I recommend using this. My favorite ones are-

Collapsible headings.
Table of Contents.
Execution Time.

You can install it using

conda install -c conda-forge jupyter_contrib_nbextensions

Here is a short demo on how it can help-

Pandas- Python Data Analysis Library

This is the heart of data science with python and provides many different capabilities like

Data structures to work with the data.
Operations you can perform on the data.
Load and save data in different formats.

and many more. Many other libraries we use for machine learning with python have pandas as a dependency. Install it using-

conda install -c conda-forge pandas

The above command will install other libraries like NumPy which pandas uses under the hood.

Sklearn (Scikit-Learn)

We will use this library to download test datasets and apply different machine learning algorithms. Install using the following command.

conda install -c conda-forge scikit-learn

In machine learning classification problems, the problem can be understood as for X features (input variables) predict y (target value). Sklearn provides few test datasets which we can use to play with, we will take the IRIS dataset for this exercise but if you would like to play with others then you can refer to this.

Scikit-learn 0.23.1 added a feature by which we can return the test dataset directly into the X and y dataframe. Make sure you are running version 0.23.1.

from sklearn.datasets import load_iris

We will now go through the other libraries and we will use Sklearn for modeling later

Pandas Profiling

I am sure many of you would be aware of this library but if you are not please do give it a try. It provides a rich profiling report for the data which gives a lot of information from missing values to correlations. You need to install it using pip as conda-package downloads old version of it.

pip install --user pandas-profiling

This report provides many details, out of which few are-

Overview of different variables in the dataset.
Correlation between variables.
Interactions between variables.
Details about each variable.

The following commands can be used to generate and save the profile report-

Plotly Express

Although pandas-profiling provides a lot of useful information, we still need to visualize different information like for example we need to find how the target variable is distributed among multiple input variables. There exist many libraries for visualization, Matlplotlib and Seaborn are the famous ones you would have heard about. The main thing where Plotly stands out is the interactive plots i.e. you can interact with the generated plots. Install it using the following command

conda install -c conda-forge plotly

Below we plotted a scatter plot between sepal length with petal length and used ‘color’ to show how the target variable is related.

You can see below how we can filter out different targets.

This library provides a lot of additional functionality, maybe we can cover that in a different story.

Training and Test dataset

The idea of generating models is to predict values that are not known. If we learn the model on the entire dataset then we will not be able to evaluate how it performs on the unseen data. To achieve this, we split the dataset into training and test dataset. A training dataset is used to train the model and the test set is used to evaluate the model. Sklearn provides a function ‘train_test_split’ which split the dataset into train and test datasets. The following code can be used to split the datasets.

from sklearn.model_selection import train_test_split

Tuning Hyperparameters

One of the important tasks in machine learning is to tune hyperparameters, these parameters are the different attributes of the algorithm which control the learning process. Different values are suitable for different learning problems and it is important to find out the best parameters. Sklearn provides mainly two ways ‘GridSearchCV’ and ‘RandomizedSearchCV’ to find the best parameters. For large training sets, we might need to use RandomizedSearchCV as it will take a lot of time to learn all parameters. In the IRIS dataset, we have only 150 rows and hence we used ‘GridSearchCV’.

For this story, we will train the LogisticsRegression model which is well-suited for classification problems and have different hyperparameters like ‘solver’, ‘C’, ‘penalty’, and ‘l1-ratio’. Not every solver supports all parameters and hence we create different dictionaries for all different solvers.

The above code would search for different combinations of parameters and find the best one which best generalizes the problem.

Evaluating the model

As we mentioned, we need to evaluate the model on the test dataset, many different metrics are available. The common one is accuracy for classification problems. Here we will show the accuracy, classification_report, and confusion matrix which Sklearn provides.

The IRIS dataset is classified as an easy dataset which means data is already suitable for machine learning purposes and hence we were able to get a perfect score i.e. accuracy score of 1.0 with our model. This means our model predicted all the samples in the test dataset correctly. It will vary with the different problems you are trying to solve.

Conclusion

The idea of this story was to give you a head start on machine learning and a glimpse of different libraries that you can utilize to speed up the process. I provided a simple overview of many things to keep this story short and precise. There is still a lot to explore such as different types of machine learning problems, different models, different metrics, and where to use them. You can try different things, in the same way, I did here so that you can see how it works. I will try to add more stories to give you deep dive into specific areas to help accelerate your learning.

Update 16th June 2020: Recently I found a way to combine Sklearn Pipeline with GridSearchCV to search for best preprocessing steps. If interested, check out this- Are you using Pipeline in Scikit-Learn?