20 Useful Python Libraries for Data Science Projects

ferhatmetin
Data Science Earth
Published in
5 min readJun 6, 2021

Scikit-learn, Numpy, Pandas, Matplotlib, Plotly, Bokeh and Seaborn are some of the common Python libraries used in the field of data science. Let’s see some other libraries that can be useful for data science projects.

1-Feature Engine

Feature-Engine is a library for feature engineering and it allows us to select the variables we want to transform so it’s very easy to apply different engineering procedures to different feature subsets. Also, Feature-Engine transformers can be assembled within the Scikit-learn pipeline.

Feature-Engine includes transformers for:

  • Variable transformation
  • Variable selection
  • Categorical variable encoding
  • Missing data imputation
  • Discretisation
  • Outlier capping or removal
  • Variable creation

For the documentation and more:

2-Yellowbrick

Yellowbrick is a library for machine learning visualization. It extends the Scikit-Learn API to make model selection and hyperparameter tuning easier. It’s using Matplotlib for visualizations. Some visualizers: Precision-Recall Curves, Confusion Matrices, Residuals Plot, K-Elbow Plot, Silhouette Plot, Learning Curve, t-SNE Corpus Visualization…

For the documentation and more:

3-PDPbox

Partial dependence plots shows the marginal effect one or two features have on the predicted outcome of a machine learning model (J. H. Friedman, 2001). PDPBox is a library for partial dependence plots.

For the documentation and more:

4-Eli5

Eli5 is a Python package that helps to debug machine learning classifiers and explain their predictions. It helps to explain predictions of black-box estimators. It provides support for Scikit-Learn, Keras, xgboost, LightGBM, CatBoost…

For the documentation and more:

5-Researchpy

Researchpy creates Pandas DataFrames that contains relevant statistical testing information that is required for academic research. It uses Pandas, Scipy, Numpy, Statmodels…

For the documentation and more:

6-LIME

LIME (Local Interpretable Model-agnostic Explanations) is an algorithm to explain predictions of black-box estimators and LIME library is one of the most popular Python libraries for model explainability.

For the documentation and more:

7-SHAP

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model and Shap library allows us to visualize shapley values.

For the documentation and more:

8-Missingno

Missingno provides a small toolset of flexible and easy-to-use missing data visualizations and it helps us to get a quick visual summary of our dataset.

For the documentation and more:

9-Imblearn

Imbalanced-learn (Imblearn) is an open source library relying on scikit-learn and provides tools when dealing with classification with imbalanced classes.

For the documentation and more:

10-Dataprep

DataPrep allows us to prepare and visualize our data with a few lines of code. I think it could be an alternative to Pandas Profiling.

For the documentation and more:

11-Dython

Dython is a set of data analysis tools that can let you get more insights about your data. Dython automatically find which features are categorical and which are numerical, compute a relevant measure of association between each and every feature, and also plot it all as a heatmap.

For the documentation and more:

12-Statsmodels

Statsmodels is a Python module that includes classes and functions for the estimation of many different statistical models as well as for conducting statistical tests, and statistical data exploration.

For the documentation and more:

13-Mlxtend

Mlxtend (machine learning extensions) is a library that contains useful tools for data science tasks.

For the documentation and more:

14- SciPy

SciPy is an open-source software for mathematics, science, and engineering. It contains useful functions in areas such as linear algebra, optimization, signal processing and statistics.

For the documentation and more:

15-Pandas Profiling

It generates reports from a pandas dataframe for a quick exploratory data analysis.

For the documentation and more:

16-Sweetviz

Sweetviz is an open-source Python library that generates high-density visualizations to Exploratory Data Analysis with just two lines of code. Output is a fully self-contained HTML application.

For the documentation and more:

17-Dtreeviz

Dtreeviz is a library for decision tree visualization and model interpretation. It currently supports Scikit-Learn, XGBoost, Spark MLlib and LightGBM trees.

For the documentation and more:

18-category_encoders

It is a library that includes scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. Some of the encoders : Count Encoder, CatBoost Encoder, James-Stein Encoder, Target Encoder …

For the documentation and more :

19-tslearn

tslearn is a package that provides some machine learning tools for time series analysis . This package builds on scikit-learn, numpy and scipy libraries.

For the documentation and more:

20-sktime

It is a unified framework for machine learning with time series. It provides specialized time series algorithms and scikit-learn compatible tools to build time series models.

For the documentation and more:

To see other published articles : https://medium.com/datasciencearth

To see published Turkish articles : https://www.datasciencearth.com

--

--