5 lesser-known Python libraries to improve your Data Science workflow

Hint: No Pandas, Numpy, Scikit-learn in this list

Sayar Banerjee
School of ML
5 min readAug 30, 2020

--

Photo by JC Gellidon on Unsplash

“A star does not compete with other stars around it; it just shines.”
Matshona Dhliwayo

Python is by far the most popular programming language in the field of Data Science. The rich list of libraries, simple syntax and high productiveness make Python an extremely popular language among beginners as well as seasoned practitioners.

Therefore, it is not unusual to find countless articles praising the power of Python and it’s famous data science libraries like Numpy, Pandas, Tensorflow, Matplotlib, etc.

This blog will try to divert attention to look at some of the lesser-known Python libraries that are slowly gaining recognition among the Data Science community.

1. Streamlit

Streamlit has been gaining tremendous popularity in recent times. Streamlit has launched only 2 years ago in 2018 and already boasts about being “the fastest way to create data apps” on its platform.

By embracing Python scripting, users can create data apps within minutes. Additionally, UI components like sliders, buttons, widgets, and text boxes can be added with just a single line of code.

Streamlit demo code

By just writing the above lines of code, we are able to produce this:

streamlit demo app

Furthermore, the library is compatible with a lot of other major libraries and frameworks like scikit learn, keras, OpenCV, Tensorflow, Pytorch, Numpy, matplotlib etc.

Check out this demo by the CEO of Streamlit at PyData LA 2019:

Founded by former executives of Google, Amanda Kelly, Thiago Teixeira, and Adrien Treuille, the company just announced a Series A funding round where it raised $21 million. It’s safe to say that the future looks very promising for Streamlit.

Github Stats:

Stars ⭐️ ⭐️ ⭐️ — 10.2k
Forks 🍴 🍴 🍴 — 921

2. tqdm

According to their documentation, tqdm means "progress" in Arabic (taqadum, تقدّم) and is an abbreviation for "I love you so much" in Spanish (te quiero demasiado).

As you might have guessed, tqdm is a library used to create smart progress meters during your iterative processes. All you have to do is wrap an iterable with tqdm() , and you’re all set.

tqdm demo

tqdm has become extremely popular among Data Scientists. The library is especially used with the Pytorch framework to track the progress of training epochs of neural networks. The next time you are building a Neural Network, do not forget to use this useful library! 😊

Github Stats:

Stars ⭐️ ⭐️ ⭐️ — 15.6k
Forks 🍴 🍴 🍴 — 810

3. pandas-profiling

This is another library that has recently gained a lot of recognition in the Kaggle community. pandas-profiling offers HTML profile reports for Pandas data frames. The Github docs mention that the purpose of the library was to provide an upgrade to the normal df.describe() of the Pandas library.

Here are some of the statistics offered by the library provided it’s relevant for the column:

  • Type inference: detect the types of columns in a dataframe.
  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap, and dendrogram of missing values
  • Text analysis learns about categories (Uppercase, Space), scripts (Latin, Cyrillic), and blocks (ASCII) of text data.
  • File and Image analysis extract file sizes, creation dates, and dimensions and scan for truncated images or those containing EXIF information.
pandas-profiling demo code

A simple one-liner like the one above can produce a report like the one below:

One issue with this library is if it is used on very large data, it might take a very long time to create the profile. In some cases, it may even hang.

Nevertheless, pandas-profiling is a fantastic library to add to your EDA workflow.

Github Stats:

Stars ⭐️ ⭐️ ⭐️ — 5.7k
Forks 🍴 🍴 🍴 — 857

4. Pycaret

Pycaret is rapidly gaining popularity because of it’s low-code approach to machine learning. This library acts as a wrapper for popular classical Machine Learning libraries like scikit-learn and XGBoost .

Due to this, the Pycaret library is able to execute complicated tasks such as inter-model performance comparison with a single line of code. The official docs note that one of the key objectives of creating such a library was to reduce costs of startup companies who want to leverage Machine Learning technology.

It is super easy to use and is deployment ready. The documentation also notes that the Pycaret pipelines can be saved as binary files to help securely transfer them between machines and environments.

Pycaret demo code

By using just the two lines of code above, you can get a detailed view of the features outcomes of preprocessing techniques like the following table:

Similarly, to compare multiple models you just need to write a single line of code:

Pycaret demo code 2

As you can see, Pycaret offers users with incredible power to create powerful Machine Learning workflows using very little code. Use Pycaret to develop machine learning algorithms at lightning speed! ⚡️⚡️⚡️

Github Stats:

Stars ⭐️ ⭐️ ⭐️ — 2.1k
Forks 🍴 🍴 🍴 — 419

5. cudf and cuml by RAPIDS

Okay, so the last one is actually two libraries. However, the RAPIDS project is the new talk of the town for machine learning libraries. The reason being is that it offers the niche feature of being able to use GPUs to train classical Machine Learning models.

Traditionally, GPUs are used to speed up Deep Learning models. However, with the introduction of the RAPIDS project, the powerful computational capabilities of GPUs can now be extended to classical models.

cudf and cuml use the CUDA programming model on the lower level to provide accelerated workflows when working with data frames and machine learning models.

cudf offers an API that looks very similar to Pandas. Here is some sample code:

Sample code that performs group by with aggregation

Likewise, the API offered by cuml is very similar to scikit-learn:

example code for a simple Linear Regression model in cuml

According to the RAPIDS documentation, the cuml implementations can speed up training by 10–50x as compared to CPU equivalents when training on large datasets. It has many more features available. Don’t forget to check it out and give it a spin the next time you are working on a Machine Learning project!

cudf Github Stats:

Stars ⭐️ ⭐️ ⭐️ — 3.1k
Forks 🍴 🍴 🍴 — 419

cuml Github Stats:

Stars ⭐️ ⭐️ ⭐️ — 1.6k
Forks 🍴 🍴 🍴 — 256

Conclusion

I hope all of you enjoyed this article! You can check out my other articles on Medium as well. Feel free to connect with me on Linkedin. Until next time! ✋

--

--

Sayar Banerjee
School of ML

Graduate Student in Analytics at UC Davis | Data Scientist | Amateur crypto investor | UIUC Alumnus