20 Useful Python Libraries for Data Science Projects
Scikit-learn, Numpy, Pandas, Matplotlib, Plotly, Bokeh and Seaborn are some of the common Python libraries used in the field of data science. Let’s see some other libraries that can be useful for data science projects.
1-Feature Engine
Feature-Engine is a library for feature engineering and it allows us to select the variables we want to transform so it’s very easy to apply different engineering procedures to different feature subsets. Also, Feature-Engine transformers can be assembled within the Scikit-learn pipeline.
Feature-Engine includes transformers for:
- Variable transformation
- Variable selection
- Categorical variable encoding
- Missing data imputation
- Discretisation
- Outlier capping or removal
- Variable creation
For the documentation and more:
2-Yellowbrick
Yellowbrick is a library for machine learning visualization. It extends the Scikit-Learn API to make model selection and hyperparameter tuning easier. It’s using Matplotlib for visualizations. Some visualizers: Precision-Recall Curves, Confusion Matrices, Residuals Plot, K-Elbow Plot, Silhouette Plot, Learning Curve, t-SNE Corpus Visualization…
For the documentation and more:
3-PDPbox
Partial dependence plots shows the marginal effect one or two features have on the predicted outcome of a machine learning model (J. H. Friedman, 2001). PDPBox is a library for partial dependence plots.
For the documentation and more:
4-Eli5
Eli5 is a Python package that helps to debug machine learning classifiers and explain their predictions. It helps to explain predictions of black-box estimators. It provides support for Scikit-Learn, Keras, xgboost, LightGBM, CatBoost…
For the documentation and more:
5-Researchpy
Researchpy creates Pandas DataFrames that contains relevant statistical testing information that is required for academic research. It uses Pandas, Scipy, Numpy, Statmodels…
For the documentation and more:
6-LIME
LIME (Local Interpretable Model-agnostic Explanations) is an algorithm to explain predictions of black-box estimators and LIME library is one of the most popular Python libraries for model explainability.
For the documentation and more:
7-SHAP
SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model and Shap library allows us to visualize shapley values.
For the documentation and more:
8-Missingno
Missingno provides a small toolset of flexible and easy-to-use missing data visualizations and it helps us to get a quick visual summary of our dataset.
For the documentation and more:
9-Imblearn
Imbalanced-learn (Imblearn) is an open source library relying on scikit-learn and provides tools when dealing with classification with imbalanced classes.
For the documentation and more:
10-Dataprep
DataPrep allows us to prepare and visualize our data with a few lines of code. I think it could be an alternative to Pandas Profiling.
For the documentation and more:
11-Dython
Dython is a set of data analysis tools that can let you get more insights about your data. Dython automatically find which features are categorical and which are numerical, compute a relevant measure of association between each and every feature, and also plot it all as a heatmap.
For the documentation and more:
12-Statsmodels
Statsmodels is a Python module that includes classes and functions for the estimation of many different statistical models as well as for conducting statistical tests, and statistical data exploration.
For the documentation and more:
13-Mlxtend
Mlxtend (machine learning extensions) is a library that contains useful tools for data science tasks.
For the documentation and more:
14- SciPy
SciPy is an open-source software for mathematics, science, and engineering. It contains useful functions in areas such as linear algebra, optimization, signal processing and statistics.
For the documentation and more:
15-Pandas Profiling
It generates reports from a pandas dataframe for a quick exploratory data analysis.
For the documentation and more:
16-Sweetviz
Sweetviz is an open-source Python library that generates high-density visualizations to Exploratory Data Analysis with just two lines of code. Output is a fully self-contained HTML application.
For the documentation and more:
17-Dtreeviz
Dtreeviz is a library for decision tree visualization and model interpretation. It currently supports Scikit-Learn, XGBoost, Spark MLlib and LightGBM trees.
For the documentation and more:
18-category_encoders
It is a library that includes scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. Some of the encoders : Count Encoder, CatBoost Encoder, James-Stein Encoder, Target Encoder …
For the documentation and more :
19-tslearn
tslearn is a package that provides some machine learning tools for time series analysis . This package builds on scikit-learn, numpy and scipy libraries.
For the documentation and more:
20-sktime
It is a unified framework for machine learning with time series. It provides specialized time series algorithms and scikit-learn compatible tools to build time series models.
For the documentation and more:
To see other published articles : https://medium.com/datasciencearth
To see published Turkish articles : https://www.datasciencearth.com