20 Useful Python Libraries for Data Science Projects

Published in

Data Science Earth

5 min readJun 6, 2021

Scikit-learn, Numpy, Pandas, Matplotlib, Plotly, Bokeh and Seaborn are some of the common Python libraries used in the field of data science. Let’s see some other libraries that can be useful for data science projects.

1-Feature Engine

Feature-Engine is a library for feature engineering and it allows us to select the variables we want to transform so it’s very easy to apply different engineering procedures to different feature subsets. Also, Feature-Engine transformers can be assembled within the Scikit-learn pipeline.

Feature-Engine includes transformers for:

Variable transformation
Variable selection
Categorical variable encoding
Missing data imputation
Discretisation
Outlier capping or removal
Variable creation

For the documentation and more:

Feature-engine: A Python library for Feature Engineering for Machine Learning - 1.0.2

Feature-engine is a Python library with multiple transformers to engineer features for use in machine learning models…

feature-engine.readthedocs.io

2-Yellowbrick

Yellowbrick is a library for machine learning visualization. It extends the Scikit-Learn API to make model selection and hyperparameter tuning easier. It’s using Matplotlib for visualizations. Some visualizers: Precision-Recall Curves, Confusion Matrices, Residuals Plot, K-Elbow Plot, Silhouette Plot, Learning Curve, t-SNE Corpus Visualization…

For the documentation and more:

Yellowbrick: Machine Learning Visualization - Yellowbrick v1.3.post1 documentation

No matter your level of technical skill, you can be helpful. We appreciate bug reports, user testing, feature requests…

www.scikit-yb.org

3-PDPbox

Partial dependence plots shows the marginal effect one or two features have on the predicted outcome of a machine learning model (J. H. Friedman, 2001). PDPBox is a library for partial dependence plots.

For the documentation and more:

SauceCat/PDPbox

python partial dependence plot toolbox Update for versions: xgboost==1.3.3 matplotlib==3.1.1 sklearn==0.23.1 This…

github.com

4-Eli5

Eli5 is a Python package that helps to debug machine learning classifiers and explain their predictions. It helps to explain predictions of black-box estimators. It provides support for Scikit-Learn, Keras, xgboost, LightGBM, CatBoost…

For the documentation and more:

Welcome to ELI5's documentation! - ELI5 0.11.0 documentation

ELI5 is a Python library which allows to visualize and debug various Machine Learning models using unified API. It has…

eli5.readthedocs.io

5-Researchpy

Researchpy creates Pandas DataFrames that contains relevant statistical testing information that is required for academic research. It uses Pandas, Scipy, Numpy, Statmodels…

For the documentation and more:

Welcome to researchpy's documentation! - researchpy 0.3.2 documentation

Researchpy produces Pandas DataFrames that contains relevant statistical testing information that is commonly required…

researchpy.readthedocs.io

6-LIME

LIME (Local Interpretable Model-agnostic Explanations) is an algorithm to explain predictions of black-box estimators and LIME library is one of the most popular Python libraries for model explainability.

For the documentation and more:

marcotcr/lime

This project is about explaining what machine learning classifiers (or models) are doing. At the moment, we support…

github.com

7-SHAP

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model and Shap library allows us to visualize shapley values.

For the documentation and more:

slundberg/shap

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model…

github.com

8-Missingno

Missingno provides a small toolset of flexible and easy-to-use missing data visualizations and it helps us to get a quick visual summary of our dataset.

For the documentation and more:

ResidentMario/missingno

Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data…

github.com

9-Imblearn

Imbalanced-learn (Imblearn) is an open source library relying on scikit-learn and provides tools when dealing with classification with imbalanced classes.

For the documentation and more:

User guide: contents - Version 0.8.0

Edit description

imbalanced-learn.org

10-Dataprep

DataPrep allows us to prepare and visualize our data with a few lines of code. I think it could be an alternative to Pandas Profiling.

For the documentation and more:

sfu-db/dataprep

DataPrep lets you prepare your data using a single library with a few lines of code. Currently, you can use DataPrep…

github.com

11-Dython

Dython is a set of data analysis tools that can let you get more insights about your data. Dython automatically find which features are categorical and which are numerical, compute a relevant measure of association between each and every feature, and also plot it all as a heatmap.

For the documentation and more:

dython

Dython is a set of Data analysis tools in p YTHON 3.x, which can let you get more insights about your data. This…

shakedzy.xyz

12-Statsmodels

Statsmodels is a Python module that includes classes and functions for the estimation of many different statistical models as well as for conducting statistical tests, and statistical data exploration.

For the documentation and more:

Examples - statsmodels

This page provides a series of examples, tutorials and recipes to help you get started with statsmodels . Each of the…

www.statsmodels.org

13-Mlxtend

Mlxtend (machine learning extensions) is a library that contains useful tools for data science tasks.

For the documentation and more:

Home - mlxtend

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks.

rasbt.github.io

14- SciPy

SciPy is an open-source software for mathematics, science, and engineering. It contains useful functions in areas such as linear algebra, optimization, signal processing and statistics.

For the documentation and more:

Documentation - SciPy.org

Documentation for the core SciPy Stack projects:

www.scipy.org

15-Pandas Profiling

It generates reports from a pandas dataframe for a quick exploratory data analysis.

For the documentation and more:

pandas-profiling/pandas-profiling

Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe()…

github.com

16-Sweetviz

Sweetviz is an open-source Python library that generates high-density visualizations to Exploratory Data Analysis with just two lines of code. Output is a fully self-contained HTML application.

For the documentation and more:

fbdesignpro/sweetviz

In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code! Sweetviz is an…

github.com

17-Dtreeviz

Dtreeviz is a library for decision tree visualization and model interpretation. It currently supports Scikit-Learn, XGBoost, Spark MLlib and LightGBM trees.

For the documentation and more:

parrt/dtreeviz

A python library for decision tree visualization and model interpretation. Currently supports scikit-learn, XGBoost…

github.com

18-category_encoders

It is a library that includes scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. Some of the encoders : Count Encoder, CatBoost Encoder, James-Stein Encoder, Target Encoder …

For the documentation and more :

Category Encoders - Category Encoders 2.2.2 documentation

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques…

contrib.scikit-learn.org

19-tslearn

tslearn is a package that provides some machine learning tools for time series analysis . This package builds on scikit-learn, numpy and scipy libraries.

For the documentation and more:

Quick-start guide - tslearn 0.5.1.0 documentation

Edit description

tslearn.readthedocs.io

20-sktime

It is a unified framework for machine learning with time series. It provides specialized time series algorithms and scikit-learn compatible tools to build time series models.

For the documentation and more:

Welcome to sktime - sktime documentation

Edit description

www.sktime.org

To see other published articles : https://medium.com/datasciencearth

To see published Turkish articles : https://www.datasciencearth.com

20 Useful Python Libraries for Data Science Projects

1-Feature Engine

Feature-engine: A Python library for Feature Engineering for Machine Learning - 1.0.2

Feature-engine is a Python library with multiple transformers to engineer features for use in machine learning models…

2-Yellowbrick

Yellowbrick: Machine Learning Visualization - Yellowbrick v1.3.post1 documentation

No matter your level of technical skill, you can be helpful. We appreciate bug reports, user testing, feature requests…

3-PDPbox

SauceCat/PDPbox

python partial dependence plot toolbox Update for versions: xgboost==1.3.3 matplotlib==3.1.1 sklearn==0.23.1 This…

4-Eli5

Welcome to ELI5's documentation! - ELI5 0.11.0 documentation

ELI5 is a Python library which allows to visualize and debug various Machine Learning models using unified API. It has…

5-Researchpy

Welcome to researchpy's documentation! - researchpy 0.3.2 documentation

Researchpy produces Pandas DataFrames that contains relevant statistical testing information that is commonly required…

6-LIME

marcotcr/lime

This project is about explaining what machine learning classifiers (or models) are doing. At the moment, we support…

7-SHAP

slundberg/shap

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model…

8-Missingno

ResidentMario/missingno

Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data…

9-Imblearn

User guide: contents - Version 0.8.0

Edit description

10-Dataprep

sfu-db/dataprep

DataPrep lets you prepare your data using a single library with a few lines of code. Currently, you can use DataPrep…

11-Dython

dython

Dython is a set of Data analysis tools in p YTHON 3.x, which can let you get more insights about your data. This…

12-Statsmodels

Examples - statsmodels

This page provides a series of examples, tutorials and recipes to help you get started with statsmodels . Each of the…

13-Mlxtend

Home - mlxtend

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks.

14- SciPy

Documentation - SciPy.org

Documentation for the core SciPy Stack projects:

15-Pandas Profiling

pandas-profiling/pandas-profiling

Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe()…

16-Sweetviz

fbdesignpro/sweetviz

In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code! Sweetviz is an…

17-Dtreeviz

parrt/dtreeviz

A python library for decision tree visualization and model interpretation. Currently supports scikit-learn, XGBoost…

18-category_encoders

Category Encoders - Category Encoders 2.2.2 documentation

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques…

19-tslearn

Quick-start guide - tslearn 0.5.1.0 documentation

Edit description

20-sktime

Welcome to sktime - sktime documentation

Edit description

Written by ferhatmetin