5 Libraries to learn in Python for Data Science! (Explained)

Yash Gupta
Data Science Simplified
9 min readOct 11, 2020

Starting out with Python for Data Science? you’ve come at the right place.

Disclaimer: This article is not meant only for students/academicians/ professionals who are well versed with python. *THIS IS NOT AN ARTICLE WITH CODES*, It is for anyone who wants to understand just how awesome python is and would like to chart out a path to learn it. It is in highly simplified terms and does NOT contain codes. It does have detailed information relevant to understand the way the codes work.

For anyone who is familiar with python, skip this paragraph about what python is (in very non-technical and simplified terms).

What is Python?

Python is a programming language. But you already knew that, didn’t you? So what exactly is a programming language? Let’s say… you have to make a cup of tea. What you need for making tea is just the right ingredients. Consider coding to be the ingredients and the cup of tea to be your outcome, where the only difference is that you just need the right amount of sugar, water and tea leaves to make the perfect cup or it just doesn’t work for you. The beauty of it? You can make something creative out of your cup that is useful to a whole new level. Like Lemon tea or Red tea. You can do anything you want. Try the same with coffee or cake or an Indian curry. It’s all in your hand.

In short, your codes are your ingredients and the outcome is all based on how well and accurately you combine them to make sense. Coding is simple, the codes are in English and are just like normal words and sentences.

Technically, Python is a open source programming language that has been around for more than a decade now and is the primary language used in Data Science and Analytics because of it’s wide application and the availability of open source libraries (or a particular set of ingredients for a particular task) for a multitude of tasks such as Statistics, all the way to Machine Learning.

What are Libraries?

Libraries are a set of functions and commands that are coupled together to perform a set of things that relate to a particular thing. For example, Scipy stats relates to Statistics used in the real world and has relevant codes to apply to datasets that are being analyzed using python. There are currently over 137K libraries available in Python and they go far and beyond our imagination and can do pretty much everything from work on numbers to astrophysics. (really cool, right?)

There are also new libraries being developed by developers everyday to help users around the world perform tasks using coding. Following is a non-code explanation of 5 amazing libraries to learn in python for data science;

Pandas

Numpy

Scipy Stats

Scikit Learn

Seaborn

PANDAS:

Pandas would be one of the cutest animals on Earth and here, one of the most powerful libraries in Python. Think of them as an imitation of Excel with upgrades for programming. Pandas are the go to, to handle datasets in python. Pandas work with tabular data and can handle enormous amounts of rows and columns of data and can manipulate them in seconds.

They are also highly preferred in the Data Analytics process because of their impact in the Preprocessing and Transformation of data. You can combine commands with other libraries in python and get your desired output within just a handful lines of code. You can take in data into pandas from many sources such as csv files and excel files and transform them as necessary.

To know how powerful pandas is, I’ll give you an example. Consider you have a table with 5000 rows of entries over 10 columns. To identify the correlation between these 10 variables in real life would be really difficult, but in pandas all it takes is to add ‘.corr()’ after the name of your dataset and you’ll be presented with a bivariate relationship analysis of all the 10 columns against each other in a correlation matrix which takes not more than 10 seconds to do if you have a clean dataset.

This also applies to processes such as imputation of data, cleaning null or empty values and adding/removing columns and taking subsets of data etc. You can do anything you imagine with tabular data in Pandas.

Link to Official Documentation:

NUMPY:

Numpy is what Data Scientists use to analyze arrays of data. It is important to understand how numpy works in order to have a stronghold on pandas too. These numpy arrays can be transformed using simple codes to Pandas Dataframes or tabular data which can then be analyzed effectively. The entire library has a separate section for Linear algebraic functions and can effectively work on Vectors and matrices with ‘n’ dimensions.

Numpy can transform lists to arrays and then perform calculations onto them like mean, transpose, inverse of matrix etc. to help you understand and transform your data easily. It also has high computational speeds as it was developed using C and Fortran languages.

Understanding how Numpy works will also take your pandas skills to the next level as they are highly interrelated.

Fun Fact: Numpy, Scipy and Matplotlib (a library upon which seaborn is built) helped in presenting to the world, the first ever image of a Black Hole using the Event Horizon Telescope.

Link to Official Documentation:

Scipy Stats:

Scipy is a very vast library, though for the purposes of this article we’ll only go over Scipy Stats. As the name suggests, Scipy Stats is another open source library for Python that has excellent commands relating to Statistics. It can perform multiple probability distribution functions and statistical methods to give insights into data. It can also work on correlation functions, multivariate analytics, univariate analytics and summary statistics.

Tests relating to Kendall Tau, p scores, z scores, student t-tests, one tail and two tailed tests and confidence intervals can all be conducted on data using just a couple of lines.

Following is the official documentation of SciPy stats, head into it and see the number of possibilities that exist to apply statistics on data. It is actually overwhelming.

Scikit Learn:

Scikit Learn is an open source Machine Learning library that python can use to create, set and evaluate machine learning models on datasets. These models include Linear Regression, Logistic Regression, Lasso Regression, Support Vector Machines, Decision Tree Classifiers, K Nearest Neighbors, K Means Clustering, Polynomial Degree Regressions etc. and a lot of other Supervised and Unsupervised learning approaches.

But that’s not just that. It also has an assisted mechanism to fit and transform data according to the models to be used. They can also be used to identify outliers in datasets using LocalOutlierFactor etc. with a specific amount of contamination in the dataset.

The best part about this library is something that I personally have been through and that is, the machine learning algorithms are not hard to perform but it is completely dependent on how well you prepare your dataset to go through that model. The model can then learn based on historical data presented to it and can also understand the data in terms of it’s analysis of patterns and help in predictive analytics of a similar dataset. It is also very easy to split data in training data and testing data using Scikit Learn.

Link to Official Documentation:

All in all, if you want to take predictive analytics to the next level, Scikit Learn is the library to go to.

SEABORN:

Seaborn is a library built on top of Matplotlib which was initially built off of MATLAB. Seaborn is a smaller but prettier version of Matplotlib that helps users create visualizations for datasets that are competitive to any other visualizing tool out there. While it would be easy to use a Visualizing tool, it is also possible to do so using Seaborn and here the visuals are highly customizable right from their size to the way markers are placed.

Users are given a wide range of plots like Boxplots, violin plots, swarm plots, KDE plots, 2D histograms, Histplots, Heatmaps, Cluster maps, bar plots, Scatter plots and others which are good enough to clearly make a easily understandable visualization out of any dataset. Seaborn is generally preferred in cases when the visualizations don’t have to be very specified to the minute details, in cases of which Matplotlib is the best alternative. I’ll not elaborate on this excessively and let the visuals do the talking. Following are some plots created by users using Seaborn:

Courtesy: Google Images

By now, you’ve been through the top 5 libraries to begin with in order to get into Data Science. If you observed them all carefully, you’d notice that each one actually is interdependent and is impactful on the next one. They integrate and can summarize a Data Analytics process. You can retrieve and work on datasets using Numpy and Pandas. You can understand their metrics and their meaning using SciPy Stats and then move on to fit machine learning models for predictive analytics using Scikit learn. Then you can go ahead to report your findings using Seaborn!

Python is like an all for one destination where you can improve your skills in any dimension. You could also integrate it with your pre-existent HTML or CSS knowledge and combine it with Flask to make your web development skills grow. You could learn how to use Pandas and integrate it with your everyday work with a spreadsheet or you could also learn how to go ahead and shift to AI and Deep learning using Tensorflow and OpenCV. The possibilities are infinite. What’s needed is just the right combination of code and you’ll have executions ready to perform almost any necessary task.

Mentioned in this article are just 5 out of the 137K libraries offered to users. So go ahead and find out such amazing libraries and get started with your coding journey today! For more such articles, stay tuned with us as we chart out paths on understanding data and coding and demystify other concepts related to Data Science and Coding. Please leave a review down in the comments. It was a long article, thank you very much for reading it all the way here! Great going!

--

--

Yash Gupta
Data Science Simplified

Lead Analyst at Lognormal Analytics and self-taught Data Scientist! Connect with me at - https://www.linkedin.com/in/yash-gupta-dss