Lesser Known Python Libraries for Data Science

Parul Pandey
Nov 8, 2018 · 7 min read

Python is an amazing language. It’s one of the fastest growing programming languages in the world. It has time and again proved its usefulness both in developer job roles and data science positions across industries. The entire ecosystem of Python and its libraries makes it an apt choice for users (beginners and advanced) all over the world. One of the reasons for its success and popularity is the presence of its set of robust libraries that make it so dynamic and fast.

In this article, we will look at some of the Python libraries for data science tasks other than the commonly used ones like pandas, scikit-learn, matplotlib, etc. Although the libraries like pandas and scikit-learn are the default names which come to mind for machine learning tasks, it’s always good to learn about other python offerings in this field.


Wget

Extracting data , especially from the web, is one of the vital tasks of a data scientist. Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. Since it is non-interactive, it can work in the background even if the user isn’t logged in. So the next time you want to download a website or all the images from a page, wget is there to assist you.


Pendulum

For the ones, who get frustrated when working with date-times in python, Pendulum is here for you. It is a Python package to ease datetimes manipulations. It is a drop-in replacement for Python’s native class. Refer to the documentation for in-depth working.


imbalanced-learn

It’s a fact that most classification algorithms work best when the number of samples in each class is almost the same, i.e., balanced. But real life cases are full of imbalanced datasets which can have a bearing upon the learning phase and the subsequent prediction of machine learning algorithms. Fortunately, this library has been created to address this issue. It is compatible with scikit-learn and is part of scikit-learn-contrib projects. Try it out the next time when you encounter imbalanced datasets.

For usage and examples refer to the documentation.


FlashText

Cleaning text data during NLP tasks often requires replacing keywords in sentences or extracting keywords from sentences. Usually, such operations are normally accomplished with regular expressions, but it could become cumbersome if the number of terms to be searched ran into thousands. Python’s FlashText module, based upon the FlashText algorithm provides an apt alternative for such situations. The best part of FlashText is that the runtime is the same irrespective of the number of search terms. You can read more about it here.

Extract keywords

Replace keywords

For more usage examples, refer to the official documentation.


Fuzzywuzzy

The name does sound weird, but fuzzywuzzy is a very useful library when it comes to string matching. One can quickly implement operations like string comparison ratios, token ratios, etc. It is also handy for matching records which are kept in different databases.

More interesting examples can be found at their GitHub repo.


PyFlux

Time series analysis is one of the most frequently encountered problems in the Machine learning domain. PyFlux is an open source library in Python explicitly built for working with time series problems. The library has an excellent array of modern time series models including but not limited to ARIMA, GARCH, and VAR models. In short, PyFlux offers a probabilistic approach to time series modeling. Worth trying out.

Please refer the documentation for usage and examples.


Ipyvolume

Communicating results is an essential aspect of Data Science. Being able to visualize results comes at a significant advantage. IPyvolume is a Python library to visualize 3d volumes and glyphs (e.g., 3d scatter plots), in the Jupyter notebook, with minimal configuration and effort. However, it is currently in the pre-1.0 stage. A good analogy would be something like this: IPyvolume’s volshow is to 3d arrays what matplotlib’s imshow is to 2d arrays. You can read more about it here.

  • Animation
  • Volume Rendering

Dash

Dash is a productive Python framework for building web applications. It is written on top of Flask, Plotly.js, and React.js and ties modern UI elements like dropdowns, sliders, and graphs to your analytical Python code without the need for javascript. Dash is highly suitable for building data visualization apps. These apps can then be rendered in the web browser. The user guide can be accessed here.

The example below shows a highly interactive graph with drop down capabilities. As the user selects a value in the Dropdown, the application code dynamically exports data from Google Finance into a Pandas DataFrame. Source


Bashplotlib

Bashplotlib is a python package and command line tool for making basic plots in the terminal. Written purely in python, it comes in handy to visualize data when the users do not have access to GUI.

with pip

from source


Colorama

Colorama colorizes the Terminal Output in Python. It uses the standard ANSI escape codes to colorize and style terminal output. Sometimes it is a good idea to color the logs in the terminal so that if anything goes wrong is stands out. Even though one can manually colorizing output by using escape characters, but that is a very lengthy and tedious task. Colorama offers a simple solution. Just include it into the scripts and add any text to be colorized.

RUn the following script and see how the color of the text changes with different options.

The output:

For a full list of options, refer to the official Github page.

Conclusion

These were my picks for useful python libraries for data science, other than the common ones like numpy, pandas, etc. In case you know about others which can be added to the list, mention in the comments below. Do not forget to try them out.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Parul Pandey

Written by

Data Science+Community+Evangelism @H2O.ai | Linkedin: https://www.linkedin.com/in/parul-pandey-a5498975/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade