Lesser Known Python Libraries for Data Science
Python is an amazing language. It’s one of the fastest growing programming languages in the world. It has time and again proved its usefulness both in developer job roles and data science positions across industries. The entire ecosystem of Python and its libraries makes it an apt choice for users (beginners and advanced) all over the world. One of the reasons for its success and popularity is the presence of its set of robust libraries that make it so dynamic and fast.
In this article, we will look at some of the Python libraries for data science tasks other than the commonly used ones like pandas, scikit-learn, matplotlib etc. Although the libraries like pandas and scikit-learn are the default names which come to mind for machine learning tasks, it’s always good to learn about other python offerings in this field.
Extracting data especially from the web is one of the vital tasks of a data scientist. Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. Since it is non-interactive, it can work in the background even if the user isn’t logged in. So the next time you want to download a website or all the images from a page, wget is there to assist you.
$ pip install wget
url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'
filename = wget.download(url)
100% [................................................] 3841532 / 3841532
For the ones, who get frustrated when working with date-times in python, Pendulum is here for you. It is a Python package to ease datetimes manipulations. It is a drop-in replacement for the Python’s native class. Refer to the documentation for in-depth working.
$ pip install pendulum
dt_toronto = pendulum.datetime(2012, 1, 1, tz='America/Toronto')
dt_vancouver = pendulum.datetime(2012, 1, 1, tz='America/Vancouver')
It is seen that most classification algorithms work best when the number of samples in each class is almost the same, i.e. balanced. But real life cases are full of imbalanced datasets which can have a bearing upon the learning phase and the subsequent prediction of machine learning algorithms. Fortunately, this library has been created to address this issue. It is compatible with scikit-learn and is part of scikit-learn-contrib projects. Try it out the next time when you encounter imbalanced datasets.
pip install -U imbalanced-learn
conda install -c conda-forge imbalanced-learn
For usage and examples refer documentation.
Cleaning text data during NLP tasks often requires replacing keywords in sentences or extracting keywords from sentences. Usually, such operations can be accomplished with regular expressions, but it could become cumbersome if the number of terms to be searched ran into thousands. Python’s FlashText module, which is based upon the FlashText algorithm provides an apt alternative for such situations. The best part of FlashText is that the runtime is the same irrespective of the number of search terms. You can read more about it here.
$ pip install flashtext
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
# keyword_processor.add_keyword(<unclean name>, <standardised name>)
keyword_processor.add_keyword('Big Apple', 'New York')
keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
['New York', 'Bay Area']
keyword_processor.add_keyword('New Delhi', 'NCR region')
new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')
'I love New York and NCR region.'
For more usage examples, refer the official documentation.
The name does sound weird, but fuzzywuzzy is a very helpful library when it comes to string matching. One can easily implement operations like string comparison ratios, token ratios etc. It is also handy for matching records which are kept in different databases.
$ pip install fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# Simple Ratio
fuzz.ratio("this is a test", "this is a test!")
# Partial Ratio
fuzz.partial_ratio("this is a test", "this is a test!")
More interesting examples can be found at their GitHub repo.
Time series analysis is one of the most frequently encountered problems in Machine learning domain. PyFlux is an open source library in Python explicitly built for working with time series problems. The library has an excellent array of modern time series models including but not limited to ARIMA, GARCH and VAR models. In short, PyFlux offers a probabilistic approach to time series modelling. Worth trying out.
pip install pyflux
Please refer the documentation for usage and examples.
Communicating results is an essential aspect of Data Science. Being able to visualise results comes at a significant advantage. IPyvolume is a Python library to visualise 3d volumes and glyphs (e.g. 3d scatter plots), in the Jupyter notebook, with minimal configuration and effort. However, it is currently in the pre-1.0 stage. A good analogy would be something like this: IPyvolume’s volshow is to 3d arrays what matplotlib’s imshow is to 2d arrays. You can read more about it here.
$ pip install ipyvolume
$ conda install -c conda-forge ipyvolume
- Volume Rendering
pip install dash==0.29.0 # The core dash backend
pip install dash-html-components==0.13.2 # HTML components
pip install dash-core-components==0.36.0 # Supercharged components
pip install dash-table==3.1.3 # Interactive DataTable component (new!)
The example below shows a highly interactive graph with drop down capabilities. As the user selects a value in the Dropdown, the application code dynamically exports data from Google Finance into a Pandas DataFrame. Source
Gym from OpenAI is a toolkit for developing and comparing reinforcement learning algorithms. It is compatible with any numerical computation library, such as TensorFlow or Theano. The gym library is necessarily a collection of test problem also called environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.
pip install gym
An example that will run an instance of the environment
CartPole-v0 for 1000 timesteps, rendering the environment at each step.
You can read about other environments here.
These were my picks for useful python libraries for data science, other than the common ones like numpy, pandas etc. In case you know about others which can be added to the list, mention in the comments below. Do not forget to try them out.