5 Important Python Libraries and Methods For Data Scientists!
Most of the python libraries are already written for data science but newbies working in data science and machine learning were not familiar with these libraries. In this article, I explain 5 different python libraries that make their job easier.
Some of the libraries that I am going to share with you that will surely be familiar to some of you according to how far you are into your career.
The dataset is titanic in this work.
Pandas Profile
Data analysis is one of the most necessary parts of data science and it took a lot of time. Therefore a lot of data scientists use the pandas_profiling library for this step of data science. You can see the python code below.
import pandas as pd
from pandas_profiling import ProfileReporttrain = pd.read_csv('../input/titanic/train.csv')
profile = ProfileReport(train, title="Pandas Profiling Report")
profile.to_file("profile.html")
Imblearn Library
In the real-world dataset, mostly there is no balanced of target values. For this reason, you have to bring balance to the dataset. There are two types of sampling methods that are oversampling and undersampling. Oversampling increases the number of labels that are less and undersampling decreases the number of labels that are much.
You can see the distribution of the target below.
import seaborn as sns
sns.countplot(data=train, x='Survived')
Over Sampling
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_train = train[['Pclass', 'SibSp', 'Parch', 'Fare']]
y_train = train[['Survived']]
X_res, y_res = sm.fit_resample(X_train, y_train)
sns.countplot(data=y_res, x='Survived')
Under Sampling
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_train, y_train)
sns.countplot(data=y_res, x='Survived')
Q-Q plot
Distribution of the features is so important for predictions. Most ml models predict better when features are close to normal distribution. Q-Q plot is one of the best plots for distributions.
from statsmodels.graphics.gofplots import qqplot
import matplotlib.pyplot as plt
train_age_dropna = train[['Age']].dropna()
plt.figure(figsize=(8,5))
fig=qqplot(train_age_dropna['Age'], line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()
Box-Cox Transformation
Box-Cox transformation is used to bring closer distribution of the feature to the normal distribution. As you can see below, before box-cox transformation skew of age is 0.39 and after box-cox transformation, it becomes -0.05. That is closer to zero.
from scipy.stats import skew, boxcox
skew_value = skew(train_age_dropna['Age'])
print('old skew: ', skew_value)
new_value, fitted_lambda = boxcox(train_age_dropna['Age'])
print('new skew: ', skew(new_value))
plt.figure(figsize=(8,5))
fig=qqplot(new_value, line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()old skew: 0.3882898514698657
new skew: -0.04897110694154816
Lazy Predict
All machine learning has its advantages and disadvantages. You can see a lot of ml model results by using lazypredict library. After predictions, you can select the best ml model for your problem.
from lazypredict.Supervised import LazyClassifier, LazyRegressor
from sklearn.model_selection import train_test_split
# load data
X, y = train[['Pclass', 'SibSp', 'Parch', 'Fare']], train[['Survived']]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)# fit all models
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
models
👋 Thanks for reading. If you enjoy my work, don’t forget to like it, follow me on Medium and LinkedIn. It will motivate me in offering more content to the Medium community! 😊
References
[1]: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/index.html
[2]: https://imbalanced-learn.org/stable/user_guide.html#user-guide
[3]: https://scipy.github.io/devdocs/tutorial/index.html#user-guide
[4]: https://lazypredict.readthedocs.io/en/latest/