5 Important Python Libraries and Methods For Data Scientists!

Published in

DataBulls

4 min readJan 17, 2022

Most of the python libraries are already written for data science but newbies working in data science and machine learning were not familiar with these libraries. In this article, I explain 5 different python libraries that make their job easier.

Some of the libraries that I am going to share with you that will surely be familiar to some of you according to how far you are into your career.

The dataset is titanic in this work.

Pandas Profile

Data analysis is one of the most necessary parts of data science and it took a lot of time. Therefore a lot of data scientists use the pandas_profiling library for this step of data science. You can see the python code below.

import pandas as pd
from pandas_profiling import ProfileReporttrain = pd.read_csv('../input/titanic/train.csv')
profile = ProfileReport(train, title="Pandas Profiling Report")
profile.to_file("profile.html")

Imblearn Library

In the real-world dataset, mostly there is no balanced of target values. For this reason, you have to bring balance to the dataset. There are two types of sampling methods that are oversampling and undersampling. Oversampling increases the number of labels that are less and undersampling decreases the number of labels that are much.

You can see the distribution of the target below.

import seaborn as sns

sns.countplot(data=train, x='Survived')

Over Sampling

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_train = train[['Pclass', 'SibSp', 'Parch', 'Fare']]
y_train = train[['Survived']]
X_res, y_res = sm.fit_resample(X_train, y_train)

sns.countplot(data=y_res, x='Survived')

Under Sampling

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)

sns.countplot(data=y_res, x='Survived')

Q-Q plot

Distribution of the features is so important for predictions. Most ml models predict better when features are close to normal distribution. Q-Q plot is one of the best plots for distributions.

from statsmodels.graphics.gofplots import qqplot
import matplotlib.pyplot as plt

train_age_dropna = train[['Age']].dropna()
plt.figure(figsize=(8,5))
fig=qqplot(train_age_dropna['Age'], line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()

Box-Cox Transformation

Box-Cox transformation is used to bring closer distribution of the feature to the normal distribution. As you can see below, before box-cox transformation skew of age is 0.39 and after box-cox transformation, it becomes -0.05. That is closer to zero.

from scipy.stats import skew, boxcox

skew_value = skew(train_age_dropna['Age'])
print('old skew: ', skew_value)

new_value, fitted_lambda = boxcox(train_age_dropna['Age'])
print('new skew: ', skew(new_value))

plt.figure(figsize=(8,5))
fig=qqplot(new_value, line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()old skew:  0.3882898514698657
new skew:  -0.04897110694154816

Q-Q plot After Box-Cox Transformation— image by author

Lazy Predict

All machine learning has its advantages and disadvantages. You can see a lot of ml model results by using lazypredict library. After predictions, you can select the best ml model for your problem.

from lazypredict.Supervised import LazyClassifier, LazyRegressor
from sklearn.model_selection import train_test_split

# load data
X, y = train[['Pclass', 'SibSp', 'Parch', 'Fare']], train[['Survived']]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)# fit all models
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
models

*LazyClassifier* Results — image by author

👋 Thanks for reading. If you enjoy my work, don’t forget to like it, follow me on Medium and LinkedIn. It will motivate me in offering more content to the Medium community! 😊

References

[1]: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/index.html
[2]: https://imbalanced-learn.org/stable/user_guide.html#user-guide
[3]: https://scipy.github.io/devdocs/tutorial/index.html#user-guide
[4]: https://lazypredict.readthedocs.io/en/latest/

Hasan Basri Akçay - Data Engineer - İnelso Energy Systems | LinkedIn

View Hasan Basri Akçay's profile on LinkedIn, the world's largest professional community. Hasan Basri has 6 jobs listed…

www.linkedin.com

More…

Welcome, 2022🎉. What Has Changed in Data Science in 2021?

Best Data Science Tools, Methods, and Techniques such as Cloud Computing Product, Automated ML Tools, Courses, IDEs…

medium.com

Application Security Automation Part 3

Automated Static Code Analysis

medium.com

What Are The Differences Between Data Scientists That Earn 500💲 And 225.000💲 Yearly?

This article is about important talents, tools, features of the country, and features of the company for high income in…

medium.com

E-Commerce Forecasting Fbprophet + Optuna

A quick article about how to use Optuna with Fbprophet.

medium.com

Olympic Medal Numbers Predictions with Time Series, Part 2: Data Analysis

Fbprophet, Darts, AutoTS, Arima, Sarimax and Monte Carlo Simulation

medium.com

5 Important Python Libraries and Methods For Data Scientists!

Pandas Profile

Imblearn Library

Over Sampling

Under Sampling

Q-Q plot

Box-Cox Transformation

Lazy Predict

References

Hasan Basri Akçay - Data Engineer - İnelso Energy Systems | LinkedIn

View Hasan Basri Akçay's profile on LinkedIn, the world's largest professional community. Hasan Basri has 6 jobs listed…

More…

Welcome, 2022🎉. What Has Changed in Data Science in 2021?

Best Data Science Tools, Methods, and Techniques such as Cloud Computing Product, Automated ML Tools, Courses, IDEs…

Application Security Automation Part 3

Automated Static Code Analysis

What Are The Differences Between Data Scientists That Earn 500💲 And 225.000💲 Yearly?

This article is about important talents, tools, features of the country, and features of the company for high income in…

E-Commerce Forecasting Fbprophet + Optuna

A quick article about how to use Optuna with Fbprophet.

Olympic Medal Numbers Predictions with Time Series, Part 2: Data Analysis

Fbprophet, Darts, AutoTS, Arima, Sarimax and Monte Carlo Simulation

Written by Hasan Basri Akçay