5 Important Python Libraries and Methods For Data Scientists!

Hasan Basri Akçay
DataBulls
Published in
4 min readJan 17, 2022

--

Photo by Carlos Muza on Unsplash

Most of the python libraries are already written for data science but newbies working in data science and machine learning were not familiar with these libraries. In this article, I explain 5 different python libraries that make their job easier.

Some of the libraries that I am going to share with you that will surely be familiar to some of you according to how far you are into your career.

The dataset is titanic in this work.

Pandas Profile

Data analysis is one of the most necessary parts of data science and it took a lot of time. Therefore a lot of data scientists use the pandas_profiling library for this step of data science. You can see the python code below.

import pandas as pd
from pandas_profiling import ProfileReport
train = pd.read_csv('../input/titanic/train.csv')
profile = ProfileReport(train, title="Pandas Profiling Report")
profile.to_file("profile.html")
ProfileReport Results — image by author

Imblearn Library

In the real-world dataset, mostly there is no balanced of target values. For this reason, you have to bring balance to the dataset. There are two types of sampling methods that are oversampling and undersampling. Oversampling increases the number of labels that are less and undersampling decreases the number of labels that are much.

You can see the distribution of the target below.

import seaborn as sns

sns.countplot(data=train, x='Survived')
Target Distribution — image by author

Over Sampling

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_train = train[['Pclass', 'SibSp', 'Parch', 'Fare']]
y_train = train[['Survived']]
X_res, y_res = sm.fit_resample(X_train, y_train)

sns.countplot(data=y_res, x='Survived')
Target Distribution After Over Sampling — image by author

Under Sampling

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)

sns.countplot(data=y_res, x='Survived')
Target Distribution After Under Sampling — image by author

Q-Q plot

Distribution of the features is so important for predictions. Most ml models predict better when features are close to normal distribution. Q-Q plot is one of the best plots for distributions.

from statsmodels.graphics.gofplots import qqplot
import matplotlib.pyplot as plt

train_age_dropna = train[['Age']].dropna()
plt.figure(figsize=(8,5))
fig=qqplot(train_age_dropna['Age'], line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()
Q-Q plot — image by author

Box-Cox Transformation

Box-Cox transformation is used to bring closer distribution of the feature to the normal distribution. As you can see below, before box-cox transformation skew of age is 0.39 and after box-cox transformation, it becomes -0.05. That is closer to zero.

from scipy.stats import skew, boxcox

skew_value = skew(train_age_dropna['Age'])
print('old skew: ', skew_value)

new_value, fitted_lambda = boxcox(train_age_dropna['Age'])
print('new skew: ', skew(new_value))

plt.figure(figsize=(8,5))
fig=qqplot(new_value, line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()
old skew: 0.3882898514698657
new skew: -0.04897110694154816
Q-Q plot After Box-Cox Transformation— image by author

Lazy Predict

All machine learning has its advantages and disadvantages. You can see a lot of ml model results by using lazypredict library. After predictions, you can select the best ml model for your problem.

from lazypredict.Supervised import LazyClassifier, LazyRegressor
from sklearn.model_selection import train_test_split

# load data
X, y = train[['Pclass', 'SibSp', 'Parch', 'Fare']], train[['Survived']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)# fit all models
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
models
LazyClassifier Results — image by author

--

--